Architecture
System Overview
Data Flow IO is a high-performance, visual ETL/ELT pipeline builder designed for scale and flexibility. It leverages a modern stack combining synchronous visual editing with an asynchronous, distributed execution engine.
Key Components & Data Flow
1. Pipeline Execution (pipeline_runner.py)
Pipelines are represented as Directed Acyclic Graphs (DAGs). Execution follows these steps:
- Topological Sort: Determines the execution order based on dependencies.
- Master Data Initialization: Nodes marked as "Master Data" execute first, and their results are cached for use by any node in the pipeline.
- Hybrid Querying: SQL Source nodes support "Fetch then Filter" execution, allowing remote queries to be post-processed locally via DuckDB (joining with local Master Data).
- Node Execution: Each node type (Source, Transform, Destination) has a specialized executor.
- Data Handoff:
DataFrameAdaptermanages data movement between nodes, automatically switching between Pandas and Dask based on row count. - Staging: Intermediate results are optionally persisted to local DuckDB files for observability and restartability.
2. Observability System
- Logger:
AsyncObservabilityLoggerprovides non-blocking event logging. - Persistence: Events are batched and synced to Supabase
observability_logs. - Metrics: Tracks row counts (input/output), duration, memory usage, and data samples at every node.
3. Airbyte Connector Architecture
- Managers:
AirbyteConnectorManagerhandles connector lifecycle. - Specs: Dynamic generation of configuration forms based on Airbyte specifications.
- Secrets: Integrated with Supabase for secure credential storage (Oauth tokens, passwords).
Configuration & Rules
- No Localhost DB: All persistent data must reside in Supabase.
- Absolute Paths: Mandatory for all file system operations in the server.
- Processing Modes:
auto: Default, switches between Pandas and Dask.pandas: Forces in-memory processing.dask: Forces distributed processing.
Data Flow Rules
1. Input Merging
When a node has multiple inputs:
Union (Vertical Concatenation):
- If all inputs have identical schemas
- Rows are stacked vertically
- Example: Combining data from multiple sources
Join (Horizontal Concatenation):
- If inputs have different schemas
- Columns are aliased with source prefix
- Example: Enriching data from multiple sources
2. Edge Transforms
Edges can transform data:
- Pass Entire Record: Send all columns
- Selected Columns: Filter specific columns
3. Execution Order
- Topological sorting ensures dependencies are met
- Nodes in same generation run in parallel
- Async execution with
asyncio.gather()