Medallion Data Lakehouse
Building a Bronze/Silver/Gold architecture that transformed supply chain data operations from next-day to same-day reporting.
The Challenge
The supply chain operations team was drowning in data chaos. Multiple source systems—ERP, CRM, WMS, and OBI—each operated as isolated silos with inconsistent data formats and no single source of truth.
- Data extraction took 4+ hours daily, delaying critical business decisions
- Reports were always 24-48 hours stale, making same-day decisions impossible
- Manual data reconciliation consumed 40% of analyst time
- No lineage tracking—when issues arose, root cause analysis was guesswork
- Dimensional data quality issues caused inventory miscounts and stockouts
The Solution
I designed and implemented a Medallion architecture data lakehouse—a three-tier system that progressively refines raw data into analytics-ready assets.
Bronze Layer: Raw data landing zone. All source systems dump data here in native formats with full audit trails. Nothing is transformed—this is our immutable historical record.
Silver Layer: Cleaned, validated, and standardized data. Schema enforcement, deduplication, null handling, and data type casting happen here. This is where the 3-tier anomaly detection system catches 500+ dimension issues.
Gold Layer: Business-ready star schema dimensional models. 15+ fact tables and 6+ dimension tables optimized for analytical queries. This powers all downstream reporting and ML models.
Technical Implementation
The architecture leverages configuration-driven ETL pipelines with Abstract Base Class patterns for maximum reusability. Each pipeline inherits from a base extractor/transformer/loader class, ensuring consistent behavior across all 50+ pipelines.
# Configuration-driven ETL pipeline example pipeline: name: "inventory_daily" source: type: "oracle_obi" query: "SELECT * FROM inventory_snapshot" destination: layer: "bronze" format: "parquet" partition_by: ["extract_date"] schedule: "0 6 * * *" # 6 AM daily
Key architectural decisions:
- Hive-partitioned Parquet files for efficient time-based queries
- Polars for 5-10x faster transformations vs. pandas
- DuckDB for ad-hoc analytical queries on local lakehouse
- Windows Task Scheduler for orchestration (legacy system constraint)
- Selenium/PyAutoGUI RPA bots for systems without API access
Tech Stack
Results & Impact
The new data platform transformed how the supply chain team operates:
- Same-day reporting enabled—decisions that previously waited 24-48 hours now happen within hours
- Analyst time reclaimed—40% of time previously spent on data wrangling now goes to actual analysis
- 500+ dimension anomalies identified and fixed through automated quality checks
- 70% reduction in data-related incidents after implementing the 3-tier validation system
- Foundation for ML—the clean Gold layer now powers demand forecasting and replenishment models
Key Learnings
Configuration Over Code
Pipeline behavior defined in YAML means new data sources can be onboarded in hours, not days.
Immutable Bronze Layer
Never transform raw data. Having the original source enables debugging and reprocessing.
Fail Fast, Fail Loud
Aggressive validation at Silver layer catches issues before they corrupt Gold tables.
RPA as Last Resort
When APIs don't exist, Selenium bots work. They're fragile but better than manual exports.
Interested in Similar Solutions?
I help organizations build reliable data infrastructure that scales.