Skip to main content

Architecture

System Overview

Data Flow IO is a high-performance, visual ETL/ELT pipeline builder designed for scale and flexibility. It leverages a modern stack combining synchronous visual editing with an asynchronous, distributed execution engine.

Key Components & Data Flow

1. Pipeline Execution (pipeline_runner.py)

Pipelines are represented as Directed Acyclic Graphs (DAGs). Execution follows these steps:

  1. Topological Sort: Determines the execution order based on dependencies.
  2. Master Data Initialization: Nodes marked as "Master Data" execute first, and their results are cached for use by any node in the pipeline.
  3. Hybrid Querying: SQL Source nodes support "Fetch then Filter" execution, allowing remote queries to be post-processed locally via DuckDB (joining with local Master Data).
  4. Node Execution: Each node type (Source, Transform, Destination) has a specialized executor.
  5. Data Handoff: DataFrameAdapter manages data movement between nodes, automatically switching between Pandas and Dask based on row count.
  6. Staging: Intermediate results are optionally persisted to local DuckDB files for observability and restartability.

2. Observability System

  • Logger: AsyncObservabilityLogger provides non-blocking event logging.
  • Persistence: Events are batched and synced to Supabase observability_logs.
  • Metrics: Tracks row counts (input/output), duration, memory usage, and data samples at every node.

3. Airbyte Connector Architecture

  • Managers: AirbyteConnectorManager handles connector lifecycle.
  • Specs: Dynamic generation of configuration forms based on Airbyte specifications.
  • Secrets: Integrated with Supabase for secure credential storage (Oauth tokens, passwords).

Configuration & Rules

  • No Localhost DB: All persistent data must reside in Supabase.
  • Absolute Paths: Mandatory for all file system operations in the server.
  • Processing Modes:
    • auto: Default, switches between Pandas and Dask.
    • pandas: Forces in-memory processing.
    • dask: Forces distributed processing.

Data Flow Rules

1. Input Merging

When a node has multiple inputs:

Union (Vertical Concatenation):

  • If all inputs have identical schemas
  • Rows are stacked vertically
  • Example: Combining data from multiple sources

Join (Horizontal Concatenation):

  • If inputs have different schemas
  • Columns are aliased with source prefix
  • Example: Enriching data from multiple sources

2. Edge Transforms

Edges can transform data:

  • Pass Entire Record: Send all columns
  • Selected Columns: Filter specific columns

3. Execution Order

  • Topological sorting ensures dependencies are met
  • Nodes in same generation run in parallel
  • Async execution with asyncio.gather()