Architecture

System Overview

Data Flow IO is a high-performance, visual ETL/ELT pipeline builder designed for scale and flexibility. It leverages a modern stack combining synchronous visual editing with an asynchronous, distributed execution engine.

Key Components & Data Flow

1. Pipeline Execution (`pipeline_runner.py`)

Pipelines are represented as Directed Acyclic Graphs (DAGs). Execution follows these steps:

Topological Sort: Determines the execution order based on dependencies.
Master Data Initialization: Nodes marked as "Master Data" execute first, and their results are cached for use by any node in the pipeline.
Hybrid Querying: SQL Source nodes support "Fetch then Filter" execution, allowing remote queries to be post-processed locally via DuckDB (joining with local Master Data).
Node Execution: Each node type (Source, Transform, Destination) has a specialized executor.
Data Handoff: DataFrameAdapter manages data movement between nodes, automatically switching between Pandas and Dask based on row count.
Staging: Intermediate results are optionally persisted to local DuckDB files for observability and restartability.

2. Observability System

Logger: AsyncObservabilityLogger provides non-blocking event logging.
Persistence: Events are batched and synced to Supabase observability_logs.
Metrics: Tracks row counts (input/output), duration, memory usage, and data samples at every node.

3. Airbyte Connector Architecture

Managers: AirbyteConnectorManager handles connector lifecycle.
Specs: Dynamic generation of configuration forms based on Airbyte specifications.
Secrets: Integrated with Supabase for secure credential storage (Oauth tokens, passwords).

Configuration & Rules

No Localhost DB: All persistent data must reside in Supabase.
Absolute Paths: Mandatory for all file system operations in the server.
Processing Modes:
- auto: Default, switches between Pandas and Dask.
- pandas: Forces in-memory processing.
- dask: Forces distributed processing.

Data Flow Rules

1. Input Merging

When a node has multiple inputs:

Union (Vertical Concatenation):

If all inputs have identical schemas
Rows are stacked vertically
Example: Combining data from multiple sources

Join (Horizontal Concatenation):

If inputs have different schemas
Columns are aliased with source prefix
Example: Enriching data from multiple sources

2. Edge Transforms

Edges can transform data:

Pass Entire Record: Send all columns
Selected Columns: Filter specific columns

3. Execution Order

Topological sorting ensures dependencies are met
Nodes in same generation run in parallel
Async execution with asyncio.gather()

Architecture

System Overview​

Key Components & Data Flow​

1. Pipeline Execution (pipeline_runner.py)​

2. Observability System​

3. Airbyte Connector Architecture​

Configuration & Rules​

Data Flow Rules​

1. Input Merging​

2. Edge Transforms​

3. Execution Order​