🔍 Data Quality

3-Tier Anomaly Detection System

A multi-layer statistical system that catches data quality issues at ingestion, during processing, and monitors long-term volatility—before bad data impacts downstream analytics.

Role Data Engineer
Industry Retail Supply Chain
Duration 3 months
Dimensions Monitored 6+
500+
Anomalies Detected
70%
Fewer Incidents
3
Detection Tiers
<5min
Alert Latency

The Challenge

In supply chain analytics, data quality issues don't just cause wrong reports—they cause wrong decisions. A single corrupted inventory record can trigger unnecessary purchase orders worth thousands of dollars.

We were drowning in data quality fires:

The Solution

I designed a 3-tier anomaly detection system that operates at different stages of the data pipeline, catching issues at the right level with the right technique:

📥
Raw Data
1️⃣
Validation
2️⃣
Outlier
3️⃣
Volatility
Clean Data

The 3 Tiers

1

Schema Validation

Gate: Bronze → Silver

Rule-based checks that ensure data conforms to expected schemas, constraints, and business rules before entering the Silver layer.

Null checks on required fields
Data type validation
Referential integrity (FK→PK)
Range constraints (qty > 0)
Uniqueness validation
2

Statistical Outlier Detection

Gate: Silver → Gold

Statistical methods that identify individual records that deviate significantly from expected distributions—catching data that's valid but suspicious.

Z-Score (|z| > 3)
Modified Z-Score (robust)
IQR fencing
MAD-based detection
Percentile thresholds
3

Volatility Monitoring

Continuous: Gold Layer

Time-series analysis that detects aggregate-level anomalies, trend breaks, and gradual drift that individual outlier detection would miss.

Rolling CV monitoring
Trend break detection
Seasonal deviation alerts
Volume spike detection
Week-over-week variance

Detection Methods

Z-Score

Measures standard deviations from mean. Best for normally distributed data.

z = (x - μ) / σ

Modified Z-Score

Uses median instead of mean. Robust to existing outliers.

M = 0.6745(x - x̃) / MAD

IQR Fencing

Quartile-based bounds. No distribution assumptions.

[Q1 - 1.5×IQR, Q3 + 1.5×IQR]

Rolling CV

Tracks coefficient of variation over time windows.

CV = σ / μ (rolling)

Grubbs Test

Statistical test for single outliers in univariate data.

G = max|x - x̄| / s

DBSCAN

Density-based clustering for multivariate anomalies.

ε-neighborhood density

Implementation

The anomaly detector is configurable per metric with multiple detection strategies:

anomaly_detector.py
from dataclasses import dataclass
from enum import Enum
import numpy as np

class DetectionMethod(Enum):
    ZSCORE = "zscore"
    MODIFIED_ZSCORE = "modified_zscore"
    IQR = "iqr"

@dataclass
class AnomalyConfig:
    method: DetectionMethod
    threshold: float = 3.0
    min_samples: int = 30

class AnomalyDetector:
    """Multi-method statistical anomaly detection."""
    
    def __init__(self, config: AnomalyConfig):
        self.config = config
        self._methods = {
            DetectionMethod.ZSCORE: self._zscore,
            DetectionMethod.MODIFIED_ZSCORE: self._modified_zscore,
            DetectionMethod.IQR: self._iqr_fence,
        }
    
    def detect(self, data: np.ndarray) -> np.ndarray:
        """Return boolean mask of anomalies."""
        if len(data) < self.config.min_samples:
            return np.zeros(len(data), dtype=bool)
        return self._methods[self.config.method](data)
    
    def _zscore(self, data: np.ndarray) -> np.ndarray:
        z = (data - np.mean(data)) / np.std(data)
        return np.abs(z) > self.config.threshold
    
    def _modified_zscore(self, data: np.ndarray) -> np.ndarray:
        median = np.median(data)
        mad = np.median(np.abs(data - median))
        m = 0.6745 * (data - median) / (mad + 1e-10)
        return np.abs(m) > self.config.threshold
    
    def _iqr_fence(self, data: np.ndarray) -> np.ndarray:
        q1, q3 = np.percentile(data, [25, 75])
        iqr = q3 - q1
        return (data < q1 - 1.5*iqr) | (data > q3 + 1.5*iqr)

Tech Stack

Python NumPy Pandas SciPy DuckDB Great Expectations Streamlit

Key Learnings

Layers Beat Single Methods

No single detection method catches everything. The 3-tier approach improved detection by 3x over Z-score alone.

Context Matters

A 50% sales spike on Black Friday is expected. Time-aware thresholds reduced false positives by 60%.

Quarantine, Don't Delete

Anomalies go to a quarantine table for review—sometimes "anomalies" are real events worth investigating.

Alert Fatigue is Real

Started with 100+ daily alerts. Tuned thresholds and added severity levels to reduce to ~10 actionable alerts.

Want to Build Better Data Quality?

Let's discuss how layered anomaly detection can save your analytics from bad data.