data-engineering-portfolio

ADR-002: Use DuckDB for Analytical Queries

Status

Accepted (April 2024)

Context

Our Gold layer contains 15+ fact tables and 6+ dimension tables in star schema format. Analysts need to run complex analytical queries for:

  1. Ad-hoc analysis: Exploratory queries during investigations
  2. Dashboard backends: Streamlit apps querying aggregated data
  3. Report generation: Scheduled reports with complex joins
  4. Data validation: Quality checks across dimension/fact relationships

Previous state:

Requirements:

Decision

Use DuckDB as the analytical query engine for Gold layer consumption.

Key capabilities:

Benchmarks (Gold Layer Queries)

Tested on production Gold layer data:

Query Type SQL Server DuckDB Improvement
Fact table scan (5M rows) 45s 2.1s 21x faster
Star schema join (4 tables) 38s 1.8s 21x faster
Window function (ranking) 22s 0.9s 24x faster
GroupBy + Having 15s 0.6s 25x faster
Memory (same query) 8GB 1.5GB 5x less

Note: SQL Server comparison is staging database, not production OLTP.

Consequences

Positive

Negative

Neutral

Alternatives Considered

1. PostgreSQL / SQL Server (Copy Data)

Verdict: Too much operational overhead for analytics-only workload

2. Apache Spark SQL

Verdict: Right tool for petabyte scale, not our scale

3. Presto / Trino

Verdict: Federated querying not needed; we control all sources

4. ClickHouse

Verdict: Great for production analytics services, heavy for local dev

5. Direct Parquet with Polars

Verdict: Polars for ETL, DuckDB for analytics—different tools for different jobs

Implementation Pattern

# Pattern: DuckDB on Gold Layer Parquet

import duckdb

# Create connection to Gold layer
conn = duckdb.connect("lakehouse/gold/analytics.duckdb")

# Register Parquet files as views
conn.execute("""
    CREATE OR REPLACE VIEW fact_inventory AS 
    SELECT * FROM read_parquet('lakehouse/gold/facts/fact_inventory/**/*.parquet')
""")

conn.execute("""
    CREATE OR REPLACE VIEW dim_product AS 
    SELECT * FROM read_parquet('lakehouse/gold/dimensions/dim_product/*.parquet')
""")

# Analyst-friendly SQL query
result = conn.execute("""
    SELECT 
        p.category,
        p.brand,
        SUM(i.quantity_on_hand) as total_stock,
        SUM(i.quantity_on_hand * i.unit_cost) as stock_value
    FROM fact_inventory i
    JOIN dim_product p ON i.product_key = p.product_key
    WHERE i.date_key >= 20240101
    GROUP BY p.category, p.brand
    ORDER BY stock_value DESC
    LIMIT 100
""").fetchdf()  # Returns Pandas DataFrame

# Or return Polars
result_polars = conn.execute("...").pl()

Cost Analysis

Item SQL Server (Alternative) DuckDB (Chosen)
Server license $800/month $0
Storage $50/month $0 (uses existing Parquet)
Maintenance 4 hrs/month 0 hrs/month
Query performance Baseline 20x faster

Annual savings: ~$10,200 + analyst productivity gains

References

Revision History

Date Change Author
2024-04-10 Initial decision Godson K.
2024-08-15 Added cost analysis Godson K.