Skip to main content

Incremental Loading

Overview

Process only new or changed data since the last run, reducing execution time and resource usage.

Use Cases

  • Daily ETL pipelines processing only today's data
  • CDC-style updates
  • Time-series data ingestion

Configuration

Enable Incremental Mode

{
"id": "source-1",
"type": "database",
"data": {
"connection_id": "uuid",
"query": "SELECT * FROM orders WHERE created_at > {{state.last_run}}",
"incremental": {
"enabled": true,
"column": "created_at",
"state_key": "orders_last_run"
}
}
}

State Management

The system automatically tracks state:

{
"state_key": "orders_last_run",
"value": "2024-01-15T00:00:00Z",
"updated_at": "2024-01-16T00:00:00Z"
}

Incremental Strategies

By Timestamp

{
"incremental": {
"type": "timestamp",
"column": "updated_at",
"format": "iso8601"
}
}

By ID

{
"incremental": {
"type": "id",
"column": "id",
"last_value": 1000
}
}

By Partition

{
"incremental": {
"type": "partition",
"column": "date",
"partition_format": "yyyy-MM-dd"
}
}

Backfill

To process historical data:

{
"backfill": {
"enabled": true,
"start_date": "2024-01-01",
"end_date": "2024-01-15",
"batch_size": 7
}
}

Best Practices

  1. Always have an index on the incremental column
  2. Use appropriate data types (timestamp preferred)
  3. Handle deletions with soft deletes or CDC
  4. Monitor state to ensure consistency