🔍 Data Quality

3-Tier Anomaly Detection System

A multi-layer statistical system that catches data quality issues at ingestion, during processing, and monitors long-term volatility—before bad data impacts downstream analytics.

Role Data Engineer

Industry Retail Supply Chain

Duration 3 months

Dimensions Monitored 6+

500+

Anomalies Detected

70%

Fewer Incidents

Detection Tiers

<5min

Alert Latency

The Challenge

In supply chain analytics, data quality issues don't just cause wrong reports—they cause wrong decisions. A single corrupted inventory record can trigger unnecessary purchase orders worth thousands of dollars.

We were drowning in data quality fires:

Source systems sending null values where quantities should exist
Dimension tables with duplicate or missing keys breaking joins
Sudden spikes in sales data from system glitches (not actual demand)
Gradual data drift going unnoticed until quarterly reviews
No visibility into which specific records were problematic

The Solution

I designed a 3-tier anomaly detection system that operates at different stages of the data pipeline, catching issues at the right level with the right technique:

📥

Raw Data

→

1️⃣

Validation

→

2️⃣

Outlier

→

3️⃣

Volatility

→

✅

Clean Data

The 3 Tiers

Schema Validation

Gate: Bronze → Silver

Rule-based checks that ensure data conforms to expected schemas, constraints, and business rules before entering the Silver layer.

Null checks on required fields

Data type validation

Referential integrity (FK→PK)

Range constraints (qty > 0)

Uniqueness validation

Statistical Outlier Detection

Gate: Silver → Gold

Statistical methods that identify individual records that deviate significantly from expected distributions—catching data that's valid but suspicious.

Z-Score (|z| > 3)

Modified Z-Score (robust)

IQR fencing

MAD-based detection

Percentile thresholds

Volatility Monitoring

Continuous: Gold Layer

Time-series analysis that detects aggregate-level anomalies, trend breaks, and gradual drift that individual outlier detection would miss.

Rolling CV monitoring

Trend break detection

Seasonal deviation alerts

Volume spike detection

Week-over-week variance

Detection Methods

Z-Score

Measures standard deviations from mean. Best for normally distributed data.

z = (x - μ) / σ

Modified Z-Score

Uses median instead of mean. Robust to existing outliers.

M = 0.6745(x - x̃) / MAD

IQR Fencing

Quartile-based bounds. No distribution assumptions.

[Q1 - 1.5×IQR, Q3 + 1.5×IQR]

Rolling CV

Tracks coefficient of variation over time windows.

CV = σ / μ (rolling)

Grubbs Test

Statistical test for single outliers in univariate data.

G = max|x - x̄| / s

DBSCAN

Density-based clustering for multivariate anomalies.

ε-neighborhood density

Implementation

The anomaly detector is configurable per metric with multiple detection strategies:

                        
                    
anomaly_detector.py

from dataclasses import dataclass
from enum import Enum
import numpy as np

class DetectionMethod(Enum):
    ZSCORE = "zscore"
    MODIFIED_ZSCORE = "modified_zscore"
    IQR = "iqr"

@dataclass
class AnomalyConfig:
    method: DetectionMethod
    threshold: float = 3.0
    min_samples: int = 30

class AnomalyDetector:
    """Multi-method statistical anomaly detection."""
    
    def __init__(self, config: AnomalyConfig):
        self.config = config
        self._methods = {
            DetectionMethod.ZSCORE: self._zscore,
            DetectionMethod.MODIFIED_ZSCORE: self._modified_zscore,
            DetectionMethod.IQR: self._iqr_fence,
        }
    
    def detect(self, data: np.ndarray) -> np.ndarray:
        """Return boolean mask of anomalies."""
        if len(data) < self.config.min_samples:
            return np.zeros(len(data), dtype=bool)
        return self._methods[self.config.method](data)
    
    def _zscore(self, data: np.ndarray) -> np.ndarray:
        z = (data - np.mean(data)) / np.std(data)
        return np.abs(z) > self.config.threshold
    
    def _modified_zscore(self, data: np.ndarray) -> np.ndarray:
        median = np.median(data)
        mad = np.median(np.abs(data - median))
        m = 0.6745 * (data - median) / (mad + 1e-10)
        return np.abs(m) > self.config.threshold
    
    def _iqr_fence(self, data: np.ndarray) -> np.ndarray:
        q1, q3 = np.percentile(data, [25, 75])
        iqr = q3 - q1
        return (data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr)
                

Tech Stack

Python NumPy Pandas SciPy DuckDB Great Expectations Streamlit

Key Learnings

Layers Beat Single Methods

No single detection method catches everything. The 3-tier approach improved detection by 3x over Z-score alone.

Context Matters

A 50% sales spike on Black Friday is expected. Time-aware thresholds reduced false positives by 60%.

Quarantine, Don't Delete

Anomalies go to a quarantine table for review—sometimes "anomalies" are real events worth investigating.

Alert Fatigue is Real

Started with 100+ daily alerts. Tuned thresholds and added severity levels to reduce to ~10 actionable alerts.

Want to Build Better Data Quality?

Let's discuss how layered anomaly detection can save your analytics from bad data.

View Lakehouse Case Study Connect on LinkedIn