ADR-003: Adopt Medallion (Bronze/Silver/Gold) Architecture
Status
Accepted (February 2024)
Context
We needed an architecture for organizing data as it flows from source systems to analytics consumption. The existing state was:
- No separation of concerns: Raw and transformed data mixed in same directories
- No data lineage: Impossible to trace issues back to source
- Reprocessing nightmares: Schema changes required re-extracting from source
- Quality issues discovered late: Bad data reached reports before detection
Requirements:
- Clear separation between raw inputs and processed outputs
- Ability to reprocess without re-extracting from source
- Incremental quality improvement at each stage
- Support for both batch and (future) streaming patterns
- Self-documenting structure
Decision
Adopt the Medallion Architecture with three distinct layers:
Bronze (Raw) → Silver (Validated) → Gold (Business-Ready)
↓ ↓ ↓
Immutable raw Cleaned, deduplicated Star schema models
Full audit trail Schema enforced Optimized for queries
Source format Standard types Dimensional models
Layer Specifications
Bronze Layer
- Purpose: Raw data landing zone, immutable historical record
- Contents: Exact copy of source data with minimal additions
- Transformations: None (except adding metadata columns)
- Schema: Inferred from source (flexible)
- Retention: Indefinite (archive to cold storage after 90 days)
- Path Pattern:
/lakehouse/bronze/{source}/{table}/extract_date={YYYY-MM-DD}/
Added Metadata:
_source_system: str # e.g., "erp", "wms"
_extract_timestamp: datetime
_extract_date: str # Partition key
_row_hash: str # MD5 for deduplication
_file_name: str # Source file if applicable
Silver Layer
- Purpose: Validated, cleaned, standardized data
- Contents: Business entities with enforced schema
- Transformations: Type casting, null handling, deduplication, validation
- Schema: Strictly enforced (fail on mismatch)
- Retention: Latest + 30 days history
- Path Pattern:
/lakehouse/silver/{domain}/{entity}/
Quality Gates:
- Schema validation (required columns, data types)
- Null checks (configurable by column)
- Referential integrity (foreign key validation)
- Business rules (range checks, pattern matching)
Gold Layer
- Purpose: Business-ready analytical models
- Contents: Dimensional models (facts + dimensions)
- Transformations: Joins, aggregations, surrogate keys, SCD handling
- Schema: Star schema (Kimball methodology)
- Retention: Latest + 90 days history
- Path Pattern:
/lakehouse/gold/{facts|dimensions}/{table}/
Consequences
Positive
- Debuggability: Can trace any Gold issue back to Bronze
- Reprocessing: Schema change? Reprocess from Bronze, not source
- Quality visibility: Know exactly where quality issues originate
- Flexibility: Bronze schema can evolve without breaking downstream
- Self-service: Analysts can query any layer appropriate to their need
Negative
- Storage overhead: Data exists in 3 copies (mitigated by compression)
- Complexity: More pipelines to maintain than single-hop ETL
- Latency: 3 stages vs 1 adds processing time
Trade-offs We Accept
- ~3x storage cost for auditability (storage is cheap, data loss is expensive)
- Additional pipeline complexity for reduced debugging time
- Slightly higher latency for much higher reliability
Alternatives Considered
1. Lambda Architecture (Batch + Streaming)
- ✅ Real-time + batch capabilities
- ❌ Complexity: Two codepaths for same logic
- ❌ Consistency: Hard to keep batch and stream in sync
- ❌ Overkill: We don’t have real-time requirements yet
Verdict: Keep streaming as future enhancement, not initial architecture
2. Single-Hop ETL (Direct to Star Schema)
- ✅ Simpler pipeline
- ✅ Less storage
- ❌ No recovery: Source system issue = re-extract everything
- ❌ No debugging: Can’t see what data looked like before transformation
- ❌ Tight coupling: Schema changes break everything
Verdict: Short-term savings, long-term pain. Rejected.
3. Data Vault 2.0
- ✅ Excellent for enterprise-scale MDM
- ✅ Full historical tracking
- ❌ Complexity: Hubs, links, satellites add modeling overhead
- ❌ Query complexity: Analysts need joins for simple queries
- ❌ Overkill: We’re not doing MDM across hundreds of sources
Verdict: Right for enterprise MDM, wrong for our use case
4. Lakehouse (Delta Lake / Iceberg)
- ✅ ACID transactions on data lake
- ✅ Time travel built-in
- ❌ Infrastructure: Requires Spark/distributed compute
- ❌ Cost: Cluster costs exceed our budget
- ❌ Overkill: Our volume doesn’t need distributed processing
Verdict: Medallion pattern without Delta/Iceberg dependencies. Can add later.
Implementation Notes
Directory Structure
/lakehouse/
├── bronze/
│ ├── erp/
│ │ ├── inventory/
│ │ │ ├── extract_date=2024-01-15/
│ │ │ │ └── data.parquet
│ │ │ └── extract_date=2024-01-16/
│ │ │ └── data.parquet
│ │ └── sales/
│ │ └── ...
│ ├── wms/
│ └── crm/
├── silver/
│ ├── inventory/
│ │ ├── current_stock/
│ │ └── stock_movements/
│ └── sales/
│ └── transactions/
└── gold/
├── facts/
│ ├── fact_inventory/
│ ├── fact_sales/
│ └── fact_orders/
└── dimensions/
├── dim_product/
├── dim_location/
└── dim_time/
Layer Transition Validation
# Bronze → Silver: Schema validation
def validate_bronze_to_silver(bronze_df: pl.DataFrame, silver_schema: dict) -> bool:
for col, expected_type in silver_schema.items():
if col not in bronze_df.columns:
raise SchemaViolation(f"Missing required column: {col}")
# Additional type checking...
return True
# Silver → Gold: Referential integrity
def validate_silver_to_gold(silver_df: pl.DataFrame, dim_tables: dict) -> bool:
for fk_col, dim_table in dim_tables.items():
orphans = silver_df.filter(
~pl.col(fk_col).is_in(dim_table["key_col"])
)
if len(orphans) > 0:
raise IntegrityViolation(f"Orphan records for {fk_col}")
return True
References
Revision History
| Date |
Change |
Author |
| 2024-02-15 |
Initial decision |
Godson K. |
| 2024-05-01 |
Added storage retention policies |
Godson K. |