Model Watchtower

ML Eng · MLOps Observability
Example · demo data, not live
monitor  prod-fleet · 23 models
--:--:--
← Data Platform
Fleet Vitals · flagship: churn_v3
Models in Production
23
22 Healthy · 1 Watch
Prediction Drift · PSI
0.07
thr 0.20 ✓ well under
Feature Drift
2 / 84
drifting non-critical
Model Quality · ROC-AUC
0.913
baseline 0.908 · F1 0.88
Inference Volume / Latency
1.42M /24h
p95 38ms · 0 errors
Prediction Drift
PSI · churn_v3 output
▲ stable
0.07
threshold 0.20 · 30d range 0.05–0.09
Data Drift
input feature set
↗ slight
0.09 · 2 of 84 features above 0.05
Drifting Share
% features
2.4%
2 / 84 · 30d max 3.6%
Null
0.3%
steady
Target Drift
label-lagged
lagging
n/a · labels arrive +7d
Prediction Distribution
train vs prod scores
KS p=0.18
train μ 0.41 · prod μ 0.43 · no shift
Throughput
calls/min
986/min · p95 38ms
p95
38ms
flat
Concept Drift Proxy
confidence entropy
low
entropy 0.41 · 30d band ±0.03
AI · Pre-label Degradation Forecast
Input drift on tenure_days (PSI 0.21) projects an AUC dip to 0.89 in ~6 days — before ground-truth labels arrive. New-tenure cohort drives 71% of the shift.

Recommended: scheduled refresh on the churn pipeline this weekend.

Drift Method Panel per-feature

AI picks the most sensitive test per feature → PSI for tenure_days

Test Suite Evidently presets · 134 checks

Data Quality 32/32
Data Drift 82/84
Regression 18/18
Classification 2 warn

Performance Over Time

ROC-AUC · 30-day window · ±0.006 confidence band · 0.90 floor
AUC conf. band 0.90 floor F1 0.88

Prediction Distribution

train vs prod score histogram overlay · KS p=0.18
train μ 0.41 prod μ 0.43
Model Fleet · 23 in production
ModelVersionStatusDrift (PSI) AUCInference 24hLast EvalOwner

How to use this

Reading the watchtower like the MLOps owner it's built for.

  • 01
    This view drives the refresh-or-wait decision. The whole point is catching decay before labels land — so when input/prediction drift trend up, you act now instead of discovering a bad model in next month's accuracy report.
  • 02
    Watch first: the 1 Watch model in the KPI band and any amber bar in the Drift Method Panel. One drifting feature above its threshold is your earliest tripwire — here it's tenure_days at PSI 0.21.
  • 03
    Read the signature Drift Wall by scanning for tiles whose glowing sparkline bends up and to the right. Flat lines are healthy; a rising data-drift or concept-drift line is what to click into.
  • 04
    The AI features do the triage: the degradation forecast projects AUC forward from drift alone, the method ensembler tells you which statistical test to trust per feature, and the ticket writer turns an alert into a ready-to-assign retraining task.
  • 05
    Cross-check the Performance Over Time band: if AUC is inside its expected band you have runway; if drift is rising but AUC still holds, you're in the pre-degradation window — the cheapest time to retrain.
  • 06
    Think about your own org: how many days of label latency sit between a model going bad and you finding out? That gap is exactly what a drift-first view is built to close.

Watch the walkthrough

Four AI agents review every panel live.

Four AI agents walk this dashboard.

In context Sample feed

Illustrative — wire to your feature store + serving-layer telemetry feed.

Feature store freshnessoffline ↔ online skew
2.1 min▼ 0.6m
tenure_days · serving PSIrolling 7d vs reference
0.21▲ 0.04
Champion AUC (windowed)churn_v3 · last 50k preds
0.913▲ 0.005
Challenger AUC (churn_v4)shadow traffic · 12% split
0.921▲ 0.008
GPU serving utilizationinference cluster
64%▲ 7%
Nightly eval coveragemodels with passing suite
22 / 23▲ 1