data-engineering-portfolio

ADR-003: Adopt Medallion (Bronze/Silver/Gold) Architecture

Status

Accepted (February 2024)

Context

We needed an architecture for organizing data as it flows from source systems to analytics consumption. The existing state was:

  1. No separation of concerns: Raw and transformed data mixed in same directories
  2. No data lineage: Impossible to trace issues back to source
  3. Reprocessing nightmares: Schema changes required re-extracting from source
  4. Quality issues discovered late: Bad data reached reports before detection

Requirements:

Decision

Adopt the Medallion Architecture with three distinct layers:

Bronze (Raw)     →    Silver (Validated)    →    Gold (Business-Ready)
   ↓                       ↓                          ↓
Immutable raw        Cleaned, deduplicated      Star schema models
Full audit trail     Schema enforced            Optimized for queries
Source format        Standard types             Dimensional models

Layer Specifications

Bronze Layer

Added Metadata:

_source_system: str       # e.g., "erp", "wms"
_extract_timestamp: datetime
_extract_date: str        # Partition key
_row_hash: str            # MD5 for deduplication
_file_name: str           # Source file if applicable

Silver Layer

Quality Gates:

  1. Schema validation (required columns, data types)
  2. Null checks (configurable by column)
  3. Referential integrity (foreign key validation)
  4. Business rules (range checks, pattern matching)

Gold Layer

Consequences

Positive

Negative

Trade-offs We Accept

Alternatives Considered

1. Lambda Architecture (Batch + Streaming)

Verdict: Keep streaming as future enhancement, not initial architecture

2. Single-Hop ETL (Direct to Star Schema)

Verdict: Short-term savings, long-term pain. Rejected.

3. Data Vault 2.0

Verdict: Right for enterprise MDM, wrong for our use case

4. Lakehouse (Delta Lake / Iceberg)

Verdict: Medallion pattern without Delta/Iceberg dependencies. Can add later.

Implementation Notes

Directory Structure

/lakehouse/
├── bronze/
│   ├── erp/
│   │   ├── inventory/
│   │   │   ├── extract_date=2024-01-15/
│   │   │   │   └── data.parquet
│   │   │   └── extract_date=2024-01-16/
│   │   │       └── data.parquet
│   │   └── sales/
│   │       └── ...
│   ├── wms/
│   └── crm/
├── silver/
│   ├── inventory/
│   │   ├── current_stock/
│   │   └── stock_movements/
│   └── sales/
│       └── transactions/
└── gold/
    ├── facts/
    │   ├── fact_inventory/
    │   ├── fact_sales/
    │   └── fact_orders/
    └── dimensions/
        ├── dim_product/
        ├── dim_location/
        └── dim_time/

Layer Transition Validation

# Bronze → Silver: Schema validation
def validate_bronze_to_silver(bronze_df: pl.DataFrame, silver_schema: dict) -> bool:
    for col, expected_type in silver_schema.items():
        if col not in bronze_df.columns:
            raise SchemaViolation(f"Missing required column: {col}")
        # Additional type checking...
    return True

# Silver → Gold: Referential integrity
def validate_silver_to_gold(silver_df: pl.DataFrame, dim_tables: dict) -> bool:
    for fk_col, dim_table in dim_tables.items():
        orphans = silver_df.filter(
            ~pl.col(fk_col).is_in(dim_table["key_col"])
        )
        if len(orphans) > 0:
            raise IntegrityViolation(f"Orphan records for {fk_col}")
    return True

References

Revision History

Date Change Author
2024-02-15 Initial decision Godson K.
2024-05-01 Added storage retention policies Godson K.