Model Watchtower — ML Eng

Fleet Vitals · flagship: churn_v3

Models in Production

22 Healthy · 1 Watch

Prediction Drift · PSI

0.07

thr 0.20 ✓ well under

Feature Drift

2 / 84

drifting non-critical

Model Quality · ROC-AUC

0.913

baseline 0.908 ▲ · F1 0.88

Inference Volume / Latency

1.42M /24h

p95 38ms · 0 errors

Drift Wall · rolling 7-day windows

Prediction Drift

PSI · churn_v3 output

▲ stable

0.07

threshold 0.20 · 30d range 0.05–0.09

Data Drift

input feature set

↗ slight

0.09 · 2 of 84 features above 0.05

Drifting Share

% features

2.4%

2 / 84 · 30d max 3.6%

Null

0.3%

steady

Target Drift

label-lagged

lagging

n/a · labels arrive +7d

Prediction Distribution

train vs prod scores

KS p=0.18

train μ 0.41 · prod μ 0.43 · no shift

Throughput

calls/min

986/min · p95 38ms

p95

38ms

flat

Concept Drift Proxy

confidence entropy

low

entropy 0.41 · 30d band ±0.03

AI · Pre-label Degradation Forecast

Input drift on tenure_days (PSI 0.21) projects an AUC dip to 0.89 in ~6 days — before ground-truth labels arrive. New-tenure cohort drives 71% of the shift.

Recommended: scheduled refresh on the churn pipeline this weekend.

Drift Method Panel per-feature

          AI picks the most sensitive test per feature → PSI for tenure_days
        

Test Suite Evidently presets · 134 checks

Data Quality 32/32

Data Drift 82/84

Regression 18/18

Classification 2 warn

Performance Over Time

ROC-AUC · 30-day window · ±0.006 confidence band · 0.90 floor

AUC conf. band 0.90 floor F1 0.88

Prediction Distribution

train vs prod score histogram overlay · KS p=0.18

train μ 0.41 prod μ 0.43

Model Fleet · 23 in production

Model	Version	Status	Drift (PSI)	AUC	Inference 24h	Last Eval	Owner

How to use this

Reading the watchtower like the MLOps owner it's built for.

This view drives the refresh-or-wait decision. The whole point is catching decay before labels land — so when input/prediction drift trend up, you act now instead of discovering a bad model in next month's accuracy report.

Watch first: the 1 Watch model in the KPI band and any amber bar in the Drift Method Panel. One drifting feature above its threshold is your earliest tripwire — here it's tenure_days at PSI 0.21.

Read the signature Drift Wall by scanning for tiles whose glowing sparkline bends up and to the right. Flat lines are healthy; a rising data-drift or concept-drift line is what to click into.

The AI features do the triage: the degradation forecast projects AUC forward from drift alone, the method ensembler tells you which statistical test to trust per feature, and the ticket writer turns an alert into a ready-to-assign retraining task.

Cross-check the Performance Over Time band: if AUC is inside its expected band you have runway; if drift is rising but AUC still holds, you're in the pre-degradation window — the cheapest time to retrain.

Think about your own org: how many days of label latency sit between a model going bad and you finding out? That gap is exactly what a drift-first view is built to close.

In context Sample feed

Illustrative — wire to your feature store + serving-layer telemetry feed.

Feature store freshnessoffline ↔ online skew

2.1 min▼ 0.6m

tenure_days · serving PSIrolling 7d vs reference

0.21▲ 0.04

Champion AUC (windowed)churn_v3 · last 50k preds

0.913▲ 0.005

Challenger AUC (churn_v4)shadow traffic · 12% split

0.921▲ 0.008

GPU serving utilizationinference cluster

64%▲ 7%

Nightly eval coveragemodels with passing suite

22 / 23▲ 1

Drift Method Panel per-feature

Test Suite Evidently presets · 134 checks

Performance Over Time

Prediction Distribution

How to use this

Watch the walkthrough

In context Sample feed