data-engineering-portfolio

ADR-001: Choose Polars over Pandas for ETL Processing

Status

Accepted (March 2024)

Context

Our ETL pipelines process 10,000+ SKUs daily across multiple source systems (ERP, CRM, WMS, OBI). The existing Pandas-based pipelines had several issues:

  1. Memory pressure: DataFrames exceeding 500MB caused OOM errors on 16GB workstations
  2. Single-threaded bottleneck: GIL prevented parallel processing within transformations
  3. Slow aggregations: GroupBy operations on 2M+ rows took 15+ seconds
  4. CSV intermediate files: Pandas I/O forced us to use CSV, adding overhead

Requirements:

Decision

Replace Pandas with Polars for all new ETL development. Migrate critical existing pipelines.

Polars advantages for our use case:

Benchmarks (Our Data)

Tested on actual production datasets:

Operation Pandas Polars Improvement
Read CSV (1.2GB) 52s 9s 5.8x faster
GroupBy + Agg 18s 1.4s 12.9x faster
Join (2 tables, 1M rows each) 11s 1.1s 10x faster
Filter + Select 3.2s 0.2s 16x faster
Memory (same operation) 4.8GB 1.2GB 4x less

Consequences

Positive

Negative

Neutral

Alternatives Considered

1. Apache Spark (PySpark)

Verdict: Keep Spark as migration path if we hit 50GB+ daily volume

2. Dask

Verdict: Polars outperformed on every metric we tested

3. Vaex

Verdict: Polars has more momentum and better ergonomics

4. Stick with Pandas (optimize)

Verdict: Optimizing Pandas yielded 20% improvement; Polars gave 500%+

Implementation Notes

# Migration pattern: Pandas → Polars

# BEFORE (Pandas)
df = pd.read_csv("inventory.csv")
result = df.groupby("location")["quantity"].sum()

# AFTER (Polars - Lazy)
result = (
    pl.scan_csv("inventory.csv")
    .group_by("location")
    .agg(pl.col("quantity").sum())
    .collect()
)

# For Parquet (our actual approach)
result = (
    pl.scan_parquet("lakehouse/bronze/inventory/**/*.parquet")
    .filter(pl.col("extract_date") >= "2024-01-01")
    .group_by("location")
    .agg(pl.col("quantity").sum().alias("total_qty"))
    .collect()
)

References

Revision History

Date Change Author
2024-03-15 Initial decision Godson K.
2024-06-01 Added 3-month retrospective metrics Godson K.