data-engineering-portfolio

Enterprise Data Platform

Production-grade data lakehouse powering supply chain operations at scale

Python Polars DuckDB License Tests Coverage


Executive Summary

A Medallion architecture data lakehouse processing 850,000+ records daily from 4 source systems (ERP, CRM, WMS, OBI). Reduced data latency from 48 hours to 2-4 hours, enabling same-day supply chain decisions for 10,000+ SKUs.

Impact at a Glance

Metric Before After Business Impact
Processing Time 4 hours 30 minutes Team can iterate faster on reports
Data Freshness 48 hours 2-4 hours Same-day inventory decisions
Pipeline Reliability ~70% 95%+ Fewer morning firefights
Analyst Time on Wrangling 60% 15% 45% more time for actual analysis
Data Quality Issues Discovered in reports 500+ caught at source No more “bad data” escalations

The Problem

Business Context

The supply chain operations team at a major retail company made decisions on stale data:

“I don’t know what inventory I have until it’s already a problem.”
— Warehouse Manager

“I spend 60% of my time cleaning data, not analyzing it.”
— Business Analyst

Technical Pain Points

  1. Fragmented data: 4 source systems (ERP, CRM, WMS, OBI) with no integration
  2. Manual exports: Daily 4-hour Excel extraction ritual (fragile, error-prone)
  3. No historical tracking: Only point-in-time snapshots, no trend analysis
  4. Quality discovered late: Issues found in reports, not at data ingestion
  5. Single point of failure: One person knew the Excel macros

What Was at Stake


The Solution

Architecture Choice: Medallion (Bronze/Silver/Gold)

Why Medallion over alternatives?

Alternative Why Not
Single-hop ETL No recovery from source issues; can’t debug transformations
Lambda Architecture Overkill; we don’t have real-time requirements
Data Vault Modeling complexity not justified for our scale
Delta Lake Requires Spark; our volume fits single-machine processing

Decision documented: ADR-003: Medallion Architecture

Technology Choices

Component Technology Why
Processing Polars 5-10x faster than Pandas (ADR-001)
Analytics DuckDB Zero-copy Parquet reads (ADR-002)
Storage Parquet Columnar, 80% smaller than CSV
Extraction Abstract Base Classes Reusable patterns for 50+ pipelines

Key Implementation Details

Configuration-driven pipelines — Adding a new data source is YAML, not code:

pipeline:
  name: new_source_daily
  source:
    type: database
    connection: ${NEW_SOURCE_CONN}
    query: "SELECT * FROM table WHERE modified > :last_run"
  destination:
    layer: bronze
    partition_by: [extract_date]

3-tier data quality — Catch issues early, not in reports:

Tier 1: Schema Validation    → Pipeline BLOCKS on failure
Tier 2: Business Rules       → Flags issues, continues pipeline
Tier 3: Statistical Anomalies → Alerts team, logs for review

Results & Impact

Quantified Business Outcomes

Outcome Measurement
Same-day reporting enabled Previously next-day
Analyst productivity +45% time on insights vs. cleaning
Data incidents 70% reduction in “bad data” escalations
Quality automation 500+ anomalies auto-detected in 12 months
Reusable framework 5 additional pipelines built from template

Stakeholder Feedback

“I can finally trust the numbers in my reports.”
— Finance Analyst

“Same-day visibility changed how we manage replenishment.”
— Supply Chain Manager

Technical Metrics

Metric Value
Pipeline uptime 98.7% (12-month average)
P95 query latency < 3 seconds
Storage efficiency 70% savings (Parquet vs CSV)
Test coverage 87%

What I Learned (Failures Included)

What Didn’t Work

  1. Over-engineering initially: Started with Airflow + Spark. Killed 2 weeks before realizing Task Scheduler + Polars was simpler and sufficient.

  2. Building in isolation: Spent 3 weeks on features nobody asked for. Started weekly demos after that.

  3. No monitoring at first: Silent failures went unnoticed for days. Added observability from day 1 on subsequent projects.

What I’d Do Differently


🎯 Overview

A Medallion architecture data lakehouse designed for enterprise retail supply chain operations. This platform processes data from multiple source systems (ERP, CRM, WMS, OBI) and delivers analytics-ready datasets for demand forecasting, inventory optimization, and operational reporting.

Key Metrics

Metric Before After Improvement
ETL Processing Time 4 hours 30 minutes 87% ⬇️
Data Freshness 48 hours 2-4 hours 90% ⬇️
Pipeline Reliability ~70% 95%+ 25% ⬆️
Data Quality Issues Unknown 500+ caught Automated

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           DATA SOURCES                                       │
├─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┤
│     ERP     │     CRM     │     WMS     │     OBI     │   External APIs     │
└──────┬──────┴──────┬──────┴──────┬──────┴──────┬──────┴──────────┬──────────┘
       │             │             │             │                 │
       ▼             ▼             ▼             ▼                 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        EXTRACTION LAYER                                      │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐                 │
│  │  DB Conn  │  │  API      │  │  RPA Bot  │  │  File     │                 │
│  │  Extractor│  │  Extractor│  │  Extractor│  │  Extractor│                 │
│  └───────────┘  └───────────┘  └───────────┘  └───────────┘                 │
└────────────────────────────────────┬────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         🥉 BRONZE LAYER                                      │
│                                                                              │
│  • Raw data landing zone                                                    │
│  • Immutable historical record                                              │
│  • Hive-partitioned Parquet (by extract_date)                              │
│  • Full audit trails                                                        │
│                                                                              │
│  📁 /lakehouse/bronze/{source}/{table}/extract_date={YYYY-MM-DD}/          │
└────────────────────────────────────┬────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         🥈 SILVER LAYER                                      │
│                                                                              │
│  • Cleaned & validated data                                                 │
│  • Schema enforcement                                                       │
│  • Deduplication & null handling                                           │
│  • 3-tier anomaly detection                                                │
│                                                                              │
│  📁 /lakehouse/silver/{domain}/{entity}/                                    │
└────────────────────────────────────┬────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         🥇 GOLD LAYER                                        │
│                                                                              │
│  • Business-ready star schema                                               │
│  • 15+ fact tables, 6+ dimension tables                                    │
│  • Optimized for analytical queries                                         │
│  • Powers ML models & dashboards                                           │
│                                                                              │
│  📁 /lakehouse/gold/facts/ & /lakehouse/gold/dimensions/                   │
└────────────────────────────────────┬────────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CONSUMPTION LAYER                                     │
├─────────────────┬─────────────────┬─────────────────┬───────────────────────┤
│   Streamlit     │   Power BI      │   ML Models     │   Ad-hoc Analysis     │
│   Dashboards    │   Reports       │   (Forecasting) │   (DuckDB)            │
└─────────────────┴─────────────────┴─────────────────┴───────────────────────┘

📁 Project Structure

enterprise-data-platform/
├── README.md                    # This file
├── architecture/
│   ├── medallion-architecture.md    # Detailed layer specifications
│   ├── data-flow.md                 # End-to-end data flow documentation
│   └── system-integration-diagram.png
├── etl-framework/
│   ├── __init__.py
│   ├── config/
│   │   ├── pipeline_config.yaml     # Pipeline definitions
│   │   └── source_config.yaml       # Source system configs
│   ├── extractors/
│   │   ├── base_extractor.py        # Abstract base class
│   │   ├── database_extractor.py    # SQL Server, Oracle
│   │   ├── api_extractor.py         # REST APIs
│   │   └── rpa_extractor.py         # Selenium/PyAutoGUI bots
│   ├── transformers/
│   │   ├── base_transformer.py      # Abstract base class
│   │   ├── cleaner.py               # Data cleaning operations
│   │   ├── validator.py             # Schema validation
│   │   └── enricher.py              # Data enrichment
│   └── loaders/
│       ├── base_loader.py           # Abstract base class
│       └── parquet_loader.py        # Hive-partitioned Parquet
├── orchestration/
│   ├── task-scheduler-configs/      # Windows Task Scheduler XMLs
│   └── batch-scripts/               # Orchestration batch files
├── data-quality/
│   ├── validation-rules/            # Business rule definitions
│   └── anomaly-detection/           # 3-tier detection system
└── docs/
    ├── performance-benchmarks.md    # Speed comparisons
    ├── reliability-metrics.md       # SLA tracking
    └── deployment-guide.md          # Production setup

🚀 Quick Start

Prerequisites

# Python 3.10+
python --version

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows

Installation

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import polars; import duckdb; print('Ready!')"

Run a Pipeline

# Execute a single pipeline
python -m etl_framework.run --pipeline inventory_daily

# Execute all pipelines for a source
python -m etl_framework.run --source erp

# Dry run (no writes)
python -m etl_framework.run --pipeline inventory_daily --dry-run

💡 Key Design Decisions

Why Polars over Pandas?

Operation Pandas Polars Speedup
CSV Read (1GB) 45s 8s 5.6x
GroupBy Agg 12s 1.2s 10x
Join (2 tables) 8s 0.9s 8.9x
Memory Usage 4GB 1.1GB 3.6x

Why Parquet with Hive Partitioning?

Why Configuration-Driven Pipelines?

# New data source? Just add config:
pipelines:
  - name: new_source_daily
    source:
      type: database
      connection: ${NEW_SOURCE_CONN}
      query: "SELECT * FROM table"
    destination:
      layer: bronze
      partition_by: [extract_date]
    schedule: "0 6 * * *"

📊 Data Model

Fact Tables (Gold Layer)

Table Grain Key Metrics
fact_sales Transaction Revenue, Quantity, Discount
fact_inventory SKU × Location × Day On-hand, In-transit, Reserved
fact_orders Order Line Order qty, Fulfilled qty, Lead time
fact_replenishment SKU × Day Reorder point, Safety stock, EOQ
fact_forecast SKU × Week Predicted demand, Confidence

Dimension Tables

Table Attributes
dim_product SKU, Category, Brand, Supplier
dim_location Store, Warehouse, Region, Country
dim_time Date, Week, Month, Quarter, Year
dim_supplier Supplier, Lead time, MOQ
dim_customer Customer segment, Loyalty tier
dim_channel Online, Retail, Wholesale

🔒 Data Quality

3-Tier Anomaly Detection

Tier 1: VALIDATION
├── Schema conformance
├── Required fields
├── Data type enforcement
└── Referential integrity

Tier 2: OUTLIER DETECTION
├── Statistical bounds (IQR, Z-score)
├── Historical range checks
└── Velocity checks (rate of change)

Tier 3: BUSINESS RULES
├── Domain-specific validations
├── Cross-field consistency
└── Temporal logic checks

Results


📈 Performance Benchmarks

Pipeline Records/Day Avg Runtime P95 Runtime
Inventory Sync 2.5M 12 min 18 min
Sales Transactions 500K 8 min 12 min
Replenishment Calc 10K SKUs 25 min 35 min
Forecast Generation 10K SKUs 45 min 60 min

🛠️ Tech Stack

Category Technology
Language Python 3.10+
Data Processing Polars, DuckDB
Storage Format Parquet (Hive-partitioned)
Databases SQL Server, Oracle
Automation Selenium, PyAutoGUI
Visualization Streamlit, Power BI
Orchestration Windows Task Scheduler
Version Control Git

📄 License

MIT License - See LICENSE for details.


👤 Author

Godson Kurishinkal
Data Engineer | Dubai, UAE