Production-grade data lakehouse powering supply chain operations at scale
A Medallion architecture data lakehouse processing 850,000+ records daily from 4 source systems (ERP, CRM, WMS, OBI). Reduced data latency from 48 hours to 2-4 hours, enabling same-day supply chain decisions for 10,000+ SKUs.
| Metric | Before | After | Business Impact |
|---|---|---|---|
| Processing Time | 4 hours | 30 minutes | Team can iterate faster on reports |
| Data Freshness | 48 hours | 2-4 hours | Same-day inventory decisions |
| Pipeline Reliability | ~70% | 95%+ | Fewer morning firefights |
| Analyst Time on Wrangling | 60% | 15% | 45% more time for actual analysis |
| Data Quality Issues | Discovered in reports | 500+ caught at source | No more “bad data” escalations |
The supply chain operations team at a major retail company made decisions on stale data:
“I don’t know what inventory I have until it’s already a problem.”
— Warehouse Manager
“I spend 60% of my time cleaning data, not analyzing it.”
— Business Analyst
Why Medallion over alternatives?
| Alternative | Why Not |
|---|---|
| Single-hop ETL | No recovery from source issues; can’t debug transformations |
| Lambda Architecture | Overkill; we don’t have real-time requirements |
| Data Vault | Modeling complexity not justified for our scale |
| Delta Lake | Requires Spark; our volume fits single-machine processing |
Decision documented: ADR-003: Medallion Architecture
| Component | Technology | Why |
|---|---|---|
| Processing | Polars | 5-10x faster than Pandas (ADR-001) |
| Analytics | DuckDB | Zero-copy Parquet reads (ADR-002) |
| Storage | Parquet | Columnar, 80% smaller than CSV |
| Extraction | Abstract Base Classes | Reusable patterns for 50+ pipelines |
Configuration-driven pipelines — Adding a new data source is YAML, not code:
pipeline:
name: new_source_daily
source:
type: database
connection: ${NEW_SOURCE_CONN}
query: "SELECT * FROM table WHERE modified > :last_run"
destination:
layer: bronze
partition_by: [extract_date]
3-tier data quality — Catch issues early, not in reports:
Tier 1: Schema Validation → Pipeline BLOCKS on failure
Tier 2: Business Rules → Flags issues, continues pipeline
Tier 3: Statistical Anomalies → Alerts team, logs for review
| Outcome | Measurement |
|---|---|
| Same-day reporting enabled | Previously next-day |
| Analyst productivity | +45% time on insights vs. cleaning |
| Data incidents | 70% reduction in “bad data” escalations |
| Quality automation | 500+ anomalies auto-detected in 12 months |
| Reusable framework | 5 additional pipelines built from template |
“I can finally trust the numbers in my reports.”
— Finance Analyst
“Same-day visibility changed how we manage replenishment.”
— Supply Chain Manager
| Metric | Value |
|---|---|
| Pipeline uptime | 98.7% (12-month average) |
| P95 query latency | < 3 seconds |
| Storage efficiency | 70% savings (Parquet vs CSV) |
| Test coverage | 87% |
Over-engineering initially: Started with Airflow + Spark. Killed 2 weeks before realizing Task Scheduler + Polars was simpler and sufficient.
Building in isolation: Spent 3 weeks on features nobody asked for. Started weekly demos after that.
No monitoring at first: Silent failures went unnoticed for days. Added observability from day 1 on subsequent projects.
A Medallion architecture data lakehouse designed for enterprise retail supply chain operations. This platform processes data from multiple source systems (ERP, CRM, WMS, OBI) and delivers analytics-ready datasets for demand forecasting, inventory optimization, and operational reporting.
| Metric | Before | After | Improvement |
|---|---|---|---|
| ETL Processing Time | 4 hours | 30 minutes | 87% ⬇️ |
| Data Freshness | 48 hours | 2-4 hours | 90% ⬇️ |
| Pipeline Reliability | ~70% | 95%+ | 25% ⬆️ |
| Data Quality Issues | Unknown | 500+ caught | Automated |
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┤
│ ERP │ CRM │ WMS │ OBI │ External APIs │
└──────┬──────┴──────┬──────┴──────┬──────┴──────┬──────┴──────────┬──────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ EXTRACTION LAYER │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ DB Conn │ │ API │ │ RPA Bot │ │ File │ │
│ │ Extractor│ │ Extractor│ │ Extractor│ │ Extractor│ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
└────────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 🥉 BRONZE LAYER │
│ │
│ • Raw data landing zone │
│ • Immutable historical record │
│ • Hive-partitioned Parquet (by extract_date) │
│ • Full audit trails │
│ │
│ 📁 /lakehouse/bronze/{source}/{table}/extract_date={YYYY-MM-DD}/ │
└────────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 🥈 SILVER LAYER │
│ │
│ • Cleaned & validated data │
│ • Schema enforcement │
│ • Deduplication & null handling │
│ • 3-tier anomaly detection │
│ │
│ 📁 /lakehouse/silver/{domain}/{entity}/ │
└────────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ 🥇 GOLD LAYER │
│ │
│ • Business-ready star schema │
│ • 15+ fact tables, 6+ dimension tables │
│ • Optimized for analytical queries │
│ • Powers ML models & dashboards │
│ │
│ 📁 /lakehouse/gold/facts/ & /lakehouse/gold/dimensions/ │
└────────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONSUMPTION LAYER │
├─────────────────┬─────────────────┬─────────────────┬───────────────────────┤
│ Streamlit │ Power BI │ ML Models │ Ad-hoc Analysis │
│ Dashboards │ Reports │ (Forecasting) │ (DuckDB) │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────┘
enterprise-data-platform/
├── README.md # This file
├── architecture/
│ ├── medallion-architecture.md # Detailed layer specifications
│ ├── data-flow.md # End-to-end data flow documentation
│ └── system-integration-diagram.png
├── etl-framework/
│ ├── __init__.py
│ ├── config/
│ │ ├── pipeline_config.yaml # Pipeline definitions
│ │ └── source_config.yaml # Source system configs
│ ├── extractors/
│ │ ├── base_extractor.py # Abstract base class
│ │ ├── database_extractor.py # SQL Server, Oracle
│ │ ├── api_extractor.py # REST APIs
│ │ └── rpa_extractor.py # Selenium/PyAutoGUI bots
│ ├── transformers/
│ │ ├── base_transformer.py # Abstract base class
│ │ ├── cleaner.py # Data cleaning operations
│ │ ├── validator.py # Schema validation
│ │ └── enricher.py # Data enrichment
│ └── loaders/
│ ├── base_loader.py # Abstract base class
│ └── parquet_loader.py # Hive-partitioned Parquet
├── orchestration/
│ ├── task-scheduler-configs/ # Windows Task Scheduler XMLs
│ └── batch-scripts/ # Orchestration batch files
├── data-quality/
│ ├── validation-rules/ # Business rule definitions
│ └── anomaly-detection/ # 3-tier detection system
└── docs/
├── performance-benchmarks.md # Speed comparisons
├── reliability-metrics.md # SLA tracking
└── deployment-guide.md # Production setup
# Python 3.10+
python --version
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import polars; import duckdb; print('Ready!')"
# Execute a single pipeline
python -m etl_framework.run --pipeline inventory_daily
# Execute all pipelines for a source
python -m etl_framework.run --source erp
# Dry run (no writes)
python -m etl_framework.run --pipeline inventory_daily --dry-run
| Operation | Pandas | Polars | Speedup |
|---|---|---|---|
| CSV Read (1GB) | 45s | 8s | 5.6x |
| GroupBy Agg | 12s | 1.2s | 10x |
| Join (2 tables) | 8s | 0.9s | 8.9x |
| Memory Usage | 4GB | 1.1GB | 3.6x |
# New data source? Just add config:
pipelines:
- name: new_source_daily
source:
type: database
connection: ${NEW_SOURCE_CONN}
query: "SELECT * FROM table"
destination:
layer: bronze
partition_by: [extract_date]
schedule: "0 6 * * *"
| Table | Grain | Key Metrics |
|---|---|---|
fact_sales |
Transaction | Revenue, Quantity, Discount |
fact_inventory |
SKU × Location × Day | On-hand, In-transit, Reserved |
fact_orders |
Order Line | Order qty, Fulfilled qty, Lead time |
fact_replenishment |
SKU × Day | Reorder point, Safety stock, EOQ |
fact_forecast |
SKU × Week | Predicted demand, Confidence |
| Table | Attributes |
|---|---|
dim_product |
SKU, Category, Brand, Supplier |
dim_location |
Store, Warehouse, Region, Country |
dim_time |
Date, Week, Month, Quarter, Year |
dim_supplier |
Supplier, Lead time, MOQ |
dim_customer |
Customer segment, Loyalty tier |
dim_channel |
Online, Retail, Wholesale |
Tier 1: VALIDATION
├── Schema conformance
├── Required fields
├── Data type enforcement
└── Referential integrity
Tier 2: OUTLIER DETECTION
├── Statistical bounds (IQR, Z-score)
├── Historical range checks
└── Velocity checks (rate of change)
Tier 3: BUSINESS RULES
├── Domain-specific validations
├── Cross-field consistency
└── Temporal logic checks
| Pipeline | Records/Day | Avg Runtime | P95 Runtime |
|---|---|---|---|
| Inventory Sync | 2.5M | 12 min | 18 min |
| Sales Transactions | 500K | 8 min | 12 min |
| Replenishment Calc | 10K SKUs | 25 min | 35 min |
| Forecast Generation | 10K SKUs | 45 min | 60 min |
| Category | Technology |
|---|---|
| Language | Python 3.10+ |
| Data Processing | Polars, DuckDB |
| Storage Format | Parquet (Hive-partitioned) |
| Databases | SQL Server, Oracle |
| Automation | Selenium, PyAutoGUI |
| Visualization | Streamlit, Power BI |
| Orchestration | Windows Task Scheduler |
| Version Control | Git |
MIT License - See LICENSE for details.
Godson Kurishinkal
Data Engineer | Dubai, UAE