Skip to main content

Processing Mode Configuration

Overview

Control whether your pipeline uses Pandas or Dask for data processing at the job level.

Configuration Options

1. Auto Mode (Default) - Smart Switching

processing_mode = 'auto'

Behavior:

  • Small datasets (< 10,000 rows): Uses Pandas
  • Large datasets (≥ 10,000 rows): Uses Dask
  • Optimal for most use cases

2. Force Dask Mode

processing_mode = 'dask'

Behavior:

  • ALL operations use Dask, regardless of data size
  • Even small datasets (100 rows) will use Dask
  • Enables distributed processing for everything

When to use:

  • Testing distributed processing
  • Ensuring consistent behavior across all data sizes

3. Force Pandas Mode

processing_mode = 'pandas'

Behavior:

  • ALL operations use Pandas, regardless of data size
  • No distributed processing

When to use:

  • Debugging issues
  • When you need deterministic single-threaded execution

How to Set Processing Mode

Method 1: Pipeline Configuration (JSON)

{
"processing_mode": "dask",
"nodes": [...],
"edges": [...]
}

Method 2: Python API

from pipeline_runner import PipelineRunner

runner = PipelineRunner(
config=pipeline_config,
processing_mode='dask'
)

await runner.run_async()

Method 3: Programmatic Control

from dataframe_adapter import DataFrameAdapter

DataFrameAdapter.set_processing_mode('dask')
# Your pipeline code here...
DataFrameAdapter.reset_processing_mode()

Behavior Matrix

Data SizeMode: autoMode: daskMode: pandas
100 rowsPandasDaskPandas
5K rowsPandasDaskPandas
10K rowsDaskDaskPandas
50K rowsDaskDaskPandas
1M rowsDaskDaskPandas

Configuration Priority

The system checks for processing mode in this order:

  1. Pipeline config (config['processing_mode'])
  2. Pipeline data (config['data']['processing_mode'])
  3. PipelineRunner parameter (processing_mode='dask')
  4. Default ('auto')

Best Practices

  1. Default to Auto: Use auto mode unless you have a specific reason
  2. Test with Dask: Use dask mode to test distributed processing
  3. Debug with Pandas: Use pandas mode for easier debugging
  4. Document Choice: If using non-auto mode, document why in comments