Back to Portfolio
DATA PLATFORM

Medallion Data Lakehouse

Building a Bronze/Silver/Gold architecture that transformed supply chain data operations from next-day to same-day reporting.

Role Data Engineer
Industry Retail / Supply Chain
Timeline 6 months
Data Scale 10,000+ SKUs
87%
Time Reduction
4 hours → 30 minutes
90%
Freshness Improvement
48 hours → 2-4 hours
95%+
Pipeline Reliability
Across 50+ pipelines

The Challenge

The supply chain operations team was drowning in data chaos. Multiple source systems—ERP, CRM, WMS, and OBI—each operated as isolated silos with inconsistent data formats and no single source of truth.

The Solution

I designed and implemented a Medallion architecture data lakehouse—a three-tier system that progressively refines raw data into analytics-ready assets.

🥉 BRONZE Raw ingestion
🥈 SILVER Cleaned & validated
🥇 GOLD Business-ready

Bronze Layer: Raw data landing zone. All source systems dump data here in native formats with full audit trails. Nothing is transformed—this is our immutable historical record.

Silver Layer: Cleaned, validated, and standardized data. Schema enforcement, deduplication, null handling, and data type casting happen here. This is where the 3-tier anomaly detection system catches 500+ dimension issues.

Gold Layer: Business-ready star schema dimensional models. 15+ fact tables and 6+ dimension tables optimized for analytical queries. This powers all downstream reporting and ML models.

Technical Implementation

The architecture leverages configuration-driven ETL pipelines with Abstract Base Class patterns for maximum reusability. Each pipeline inherits from a base extractor/transformer/loader class, ensuring consistent behavior across all 50+ pipelines.

pipeline_config.yaml
# Configuration-driven ETL pipeline example
pipeline:
  name: "inventory_daily"
  source:
    type: "oracle_obi"
    query: "SELECT * FROM inventory_snapshot"
  destination:
    layer: "bronze"
    format: "parquet"
    partition_by: ["extract_date"]
  schedule: "0 6 * * *"  # 6 AM daily

Key architectural decisions:

Tech Stack

Python Polars DuckDB Parquet SQL Server Selenium PyAutoGUI Streamlit

Results & Impact

The new data platform transformed how the supply chain team operates:

Key Learnings

Configuration Over Code

Pipeline behavior defined in YAML means new data sources can be onboarded in hours, not days.

Immutable Bronze Layer

Never transform raw data. Having the original source enables debugging and reprocessing.

Fail Fast, Fail Loud

Aggressive validation at Silver layer catches issues before they corrupt Gold tables.

RPA as Last Resort

When APIs don't exist, Selenium bots work. They're fragile but better than manual exports.

Interested in Similar Solutions?

I help organizations build reliable data infrastructure that scales.