ADR-002: Use DuckDB for Analytical Queries
Status
Accepted (April 2024)
Context
Our Gold layer contains 15+ fact tables and 6+ dimension tables in star schema format. Analysts need to run complex analytical queries for:
- Ad-hoc analysis: Exploratory queries during investigations
- Dashboard backends: Streamlit apps querying aggregated data
- Report generation: Scheduled reports with complex joins
- Data validation: Quality checks across dimension/fact relationships
Previous state:
- Analysts exported Parquet to Excel, limiting analysis to 1M rows
- Complex queries required writing to staging SQL Server tables (slow)
- No way to query across multiple Parquet files efficiently
Requirements:
- Query Parquet files directly without ETL to database
- Handle 10GB+ analytical queries in under 10 seconds
- Support standard SQL (analysts know SQL, not Python)
- Embedded solution (no server infrastructure)
- Low operational overhead
Decision
Use DuckDB as the analytical query engine for Gold layer consumption.
Key capabilities:
- Zero-copy Parquet reads: Queries Parquet without loading into memory
- Vectorized execution: Column-at-a-time processing, 20-100x faster than row-based
- Embedded: Single file, no server process, starts instantly
- Full SQL support: Window functions, CTEs, complex joins
- Python integration:
duckdb.sql() returns Polars/Pandas DataFrames
Benchmarks (Gold Layer Queries)
Tested on production Gold layer data:
| Query Type |
SQL Server |
DuckDB |
Improvement |
| Fact table scan (5M rows) |
45s |
2.1s |
21x faster |
| Star schema join (4 tables) |
38s |
1.8s |
21x faster |
| Window function (ranking) |
22s |
0.9s |
24x faster |
| GroupBy + Having |
15s |
0.6s |
25x faster |
| Memory (same query) |
8GB |
1.5GB |
5x less |
Note: SQL Server comparison is staging database, not production OLTP.
Consequences
Positive
- Analysts unblocked: Can query entire Gold layer from laptops
- No infrastructure: Zero server costs, zero maintenance
- Performance: P95 query time under 3 seconds
- Portability: Database file can be shared/copied
Negative
- Concurrent writes: Limited support (fine for read-heavy analytics)
- No user management: Single-user model (security via file permissions)
- Learning curve: Team needed to understand embedded DB paradigm
Neutral
- Still maintain SQL Server for OLTP workloads (different use case)
- DuckDB file regenerated daily during Gold layer build
Alternatives Considered
1. PostgreSQL / SQL Server (Copy Data)
- ✅ Standard RDBMS, familiar tooling
- ❌ ETL overhead: Must copy Parquet → DB tables daily
- ❌ Storage duplication: Pay for data twice
- ❌ Latency: Write time delays query availability
- ❌ Cost: Server licenses/hosting
Verdict: Too much operational overhead for analytics-only workload
2. Apache Spark SQL
- ✅ Handles massive datasets
- ❌ Startup latency: 30+ seconds before first query
- ❌ Cluster requirement: Need infrastructure
- ❌ Overkill: Our Gold layer is 10-20GB
Verdict: Right tool for petabyte scale, not our scale
3. Presto / Trino
- ✅ Federated query engine
- ❌ Server requirement: Coordinator + workers
- ❌ Complexity: Kubernetes or dedicated infra
- ❌ Overkill: Single data source (Parquet files)
Verdict: Federated querying not needed; we control all sources
4. ClickHouse
- ✅ Blazing fast analytics
- ❌ Server process: Must run and maintain
- ❌ Data ingestion: Need to load data into CH format
- ❌ Operational overhead: Backups, monitoring, upgrades
Verdict: Great for production analytics services, heavy for local dev
5. Direct Parquet with Polars
- ✅ Already using Polars
- ❌ SQL familiarity: Analysts prefer SQL over DataFrame API
- ❌ Complex joins: Star schema joins cleaner in SQL
Verdict: Polars for ETL, DuckDB for analytics—different tools for different jobs
Implementation Pattern
# Pattern: DuckDB on Gold Layer Parquet
import duckdb
# Create connection to Gold layer
conn = duckdb.connect("lakehouse/gold/analytics.duckdb")
# Register Parquet files as views
conn.execute("""
CREATE OR REPLACE VIEW fact_inventory AS
SELECT * FROM read_parquet('lakehouse/gold/facts/fact_inventory/**/*.parquet')
""")
conn.execute("""
CREATE OR REPLACE VIEW dim_product AS
SELECT * FROM read_parquet('lakehouse/gold/dimensions/dim_product/*.parquet')
""")
# Analyst-friendly SQL query
result = conn.execute("""
SELECT
p.category,
p.brand,
SUM(i.quantity_on_hand) as total_stock,
SUM(i.quantity_on_hand * i.unit_cost) as stock_value
FROM fact_inventory i
JOIN dim_product p ON i.product_key = p.product_key
WHERE i.date_key >= 20240101
GROUP BY p.category, p.brand
ORDER BY stock_value DESC
LIMIT 100
""").fetchdf() # Returns Pandas DataFrame
# Or return Polars
result_polars = conn.execute("...").pl()
Cost Analysis
| Item |
SQL Server (Alternative) |
DuckDB (Chosen) |
| Server license |
$800/month |
$0 |
| Storage |
$50/month |
$0 (uses existing Parquet) |
| Maintenance |
4 hrs/month |
0 hrs/month |
| Query performance |
Baseline |
20x faster |
Annual savings: ~$10,200 + analyst productivity gains
References
Revision History
| Date |
Change |
Author |
| 2024-04-10 |
Initial decision |
Godson K. |
| 2024-08-15 |
Added cost analysis |
Godson K. |