Incremental Loading
Overview
Process only new or changed data since the last run, reducing execution time and resource usage.
Use Cases
- Daily ETL pipelines processing only today's data
- CDC-style updates
- Time-series data ingestion
Configuration
Enable Incremental Mode
{
"id": "source-1",
"type": "database",
"data": {
"connection_id": "uuid",
"query": "SELECT * FROM orders WHERE created_at > {{state.last_run}}",
"incremental": {
"enabled": true,
"column": "created_at",
"state_key": "orders_last_run"
}
}
}
State Management
The system automatically tracks state:
{
"state_key": "orders_last_run",
"value": "2024-01-15T00:00:00Z",
"updated_at": "2024-01-16T00:00:00Z"
}
Incremental Strategies
By Timestamp
{
"incremental": {
"type": "timestamp",
"column": "updated_at",
"format": "iso8601"
}
}
By ID
{
"incremental": {
"type": "id",
"column": "id",
"last_value": 1000
}
}
By Partition
{
"incremental": {
"type": "partition",
"column": "date",
"partition_format": "yyyy-MM-dd"
}
}
Backfill
To process historical data:
{
"backfill": {
"enabled": true,
"start_date": "2024-01-01",
"end_date": "2024-01-15",
"batch_size": 7
}
}
Best Practices
- Always have an index on the incremental column
- Use appropriate data types (timestamp preferred)
- Handle deletions with soft deletes or CDC
- Monitor state to ensure consistency