data-engineering-portfolio

ADR-001: Choose Polars over Pandas for ETL Processing

Status

Accepted (March 2024)

Context

Our ETL pipelines process 10,000+ SKUs daily across multiple source systems (ERP, CRM, WMS, OBI). The existing Pandas-based pipelines had several issues:

Memory pressure: DataFrames exceeding 500MB caused OOM errors on 16GB workstations
Single-threaded bottleneck: GIL prevented parallel processing within transformations
Slow aggregations: GroupBy operations on 2M+ rows took 15+ seconds
CSV intermediate files: Pandas I/O forced us to use CSV, adding overhead

Requirements:

Process 2-5GB daily data within 30-minute batch window
Run on single-machine infrastructure (no cluster budget approved)
Maintain SQL-like expressiveness for analyst handoff
First-class Parquet support (our target storage format)

Decision

Replace Pandas with Polars for all new ETL development. Migrate critical existing pipelines.

Polars advantages for our use case:

Lazy evaluation: Build query plan, execute once (60% memory reduction)
Multi-threaded by default: Utilizes all CPU cores without GIL constraints
Native Parquet: scan_parquet() with predicate pushdown
Expressive API: Method chaining feels like SQL
Rust backend: Compiled performance, not interpreted Python

Benchmarks (Our Data)

Tested on actual production datasets:

Operation	Pandas	Polars	Improvement
Read CSV (1.2GB)	52s	9s	5.8x faster
GroupBy + Agg	18s	1.4s	12.9x faster
Join (2 tables, 1M rows each)	11s	1.1s	10x faster
Filter + Select	3.2s	0.2s	16x faster
Memory (same operation)	4.8GB	1.2GB	4x less

Consequences

Positive

Batch processing now completes in 30 minutes (was 4 hours)
No more OOM errors on analyst workstations
Code is more readable (.pipe() chains vs. nested assignments)
Parquet read/write is seamless

Negative

Learning curve: Team needed 2 weeks to adjust to new API
Ecosystem gaps: Some niche libraries assume Pandas (workaround: .to_pandas())
Stack Overflow scarcity: Fewer community answers than Pandas (improving rapidly)

Neutral

Need to maintain Pandas interop for legacy dashboards (trivial with .to_pandas())

Alternatives Considered

1. Apache Spark (PySpark)

✅ Distributed processing capability
❌ Overkill: Our data fits on one machine; cluster overhead not justified
❌ Latency: Spark startup time (~30s) exceeds some of our pipeline runtimes
❌ Cost: Would require EMR/Databricks spend we haven’t budgeted

Verdict: Keep Spark as migration path if we hit 50GB+ daily volume

2. Dask

✅ Pandas-compatible API
❌ Slower than Polars in our benchmarks (7x slower on groupby)
❌ Less mature lazy evaluation
❌ Debugging complexity with task graphs

Verdict: Polars outperformed on every metric we tested

3. Vaex

✅ Memory-mapped for huge files
❌ API inconsistencies between versions
❌ Development has slowed (fewer releases)
❌ Community smaller than Polars

Verdict: Polars has more momentum and better ergonomics

4. Stick with Pandas (optimize)

✅ No migration cost
❌ Fundamental GIL limitation can’t be optimized away
❌ Memory model requires full DataFrame in RAM

Verdict: Optimizing Pandas yielded 20% improvement; Polars gave 500%+

Implementation Notes

# Migration pattern: Pandas → Polars

# BEFORE (Pandas)
df = pd.read_csv("inventory.csv")
result = df.groupby("location")["quantity"].sum()

# AFTER (Polars - Lazy)
result = (
    pl.scan_csv("inventory.csv")
    .group_by("location")
    .agg(pl.col("quantity").sum())
    .collect()
)

# For Parquet (our actual approach)
result = (
    pl.scan_parquet("lakehouse/bronze/inventory/**/*.parquet")
    .filter(pl.col("extract_date") >= "2024-01-01")
    .group_by("location")
    .agg(pl.col("quantity").sum().alias("total_qty"))
    .collect()
)

References

Revision History

Date	Change	Author
2024-03-15	Initial decision	Godson K.
2024-06-01	Added 3-month retrospective metrics	Godson K.