DATA PLATFORM

Medallion Data Lakehouse

Building a Bronze/Silver/Gold architecture that transformed supply chain data operations from next-day to same-day reporting.

Role Data Engineer

Industry Retail / Supply Chain

Timeline 6 months

Data Scale 10,000+ SKUs

87%

Time Reduction

4 hours → 30 minutes

90%

Freshness Improvement

48 hours → 2-4 hours

95%+

Pipeline Reliability

Across 50+ pipelines

The Challenge

The supply chain operations team was drowning in data chaos. Multiple source systems—ERP, CRM, WMS, and OBI—each operated as isolated silos with inconsistent data formats and no single source of truth.

Data extraction took 4+ hours daily, delaying critical business decisions
Reports were always 24-48 hours stale, making same-day decisions impossible
Manual data reconciliation consumed 40% of analyst time
No lineage tracking—when issues arose, root cause analysis was guesswork
Dimensional data quality issues caused inventory miscounts and stockouts

The Solution

I designed and implemented a Medallion architecture data lakehouse—a three-tier system that progressively refines raw data into analytics-ready assets.

🥉 BRONZE Raw ingestion

→

🥈 SILVER Cleaned & validated

→

🥇 GOLD Business-ready

Bronze Layer: Raw data landing zone. All source systems dump data here in native formats with full audit trails. Nothing is transformed—this is our immutable historical record.

Silver Layer: Cleaned, validated, and standardized data. Schema enforcement, deduplication, null handling, and data type casting happen here. This is where the 3-tier anomaly detection system catches 500+ dimension issues.

Gold Layer: Business-ready star schema dimensional models. 15+ fact tables and 6+ dimension tables optimized for analytical queries. This powers all downstream reporting and ML models.

Technical Implementation

The architecture leverages configuration-driven ETL pipelines with Abstract Base Class patterns for maximum reusability. Each pipeline inherits from a base extractor/transformer/loader class, ensuring consistent behavior across all 50+ pipelines.

                    pipeline_config.yaml
                

# Configuration-driven ETL pipeline example
pipeline:
  name: "inventory_daily"
  source:
    type: "oracle_obi"
    query: "SELECT * FROM inventory_snapshot"
  destination:
    layer: "bronze"
    format: "parquet"
    partition_by: ["extract_date"]
  schedule: "0 6 * * *"  # 6 AM daily
                

Key architectural decisions:

Hive-partitioned Parquet files for efficient time-based queries
Polars for 5-10x faster transformations vs. pandas
DuckDB for ad-hoc analytical queries on local lakehouse
Windows Task Scheduler for orchestration (legacy system constraint)
Selenium/PyAutoGUI RPA bots for systems without API access

Tech Stack

Python Polars DuckDB Parquet SQL Server Selenium PyAutoGUI Streamlit

Results & Impact

The new data platform transformed how the supply chain team operates:

Same-day reporting enabled—decisions that previously waited 24-48 hours now happen within hours
Analyst time reclaimed—40% of time previously spent on data wrangling now goes to actual analysis
500+ dimension anomalies identified and fixed through automated quality checks
70% reduction in data-related incidents after implementing the 3-tier validation system
Foundation for ML—the clean Gold layer now powers demand forecasting and replenishment models

Key Learnings

Configuration Over Code

Pipeline behavior defined in YAML means new data sources can be onboarded in hours, not days.

Immutable Bronze Layer

Never transform raw data. Having the original source enables debugging and reprocessing.

Fail Fast, Fail Loud

Aggressive validation at Silver layer catches issues before they corrupt Gold tables.

RPA as Last Resort

When APIs don't exist, Selenium bots work. They're fragile but better than manual exports.

Interested in Similar Solutions?

I help organizations build reliable data infrastructure that scales.

Connect on LinkedIn Send Email