Reliability Control Room — SRE · SLO & Error Budget

Golden SLOs · live ring wallclick any ring → SLO detail

Error budget

Error-Budget Burn-Down — fleet tier-1Burn-Rate Forecast

Consumed Ideal pace Forecast cone (EOM) Deploy marker

⌁ Budget Guardian

At the current burn rate the monthly error budget lasts 22 days — safe through month-end. Forecast lands at 71% consumed by EOM (vs 32% at day 19), leaving 29% headroom. No window is burning >1× fast.

Multi-Window Burn-Rate Strip<1× safe

⌁ Deploy-blame

Peak 0.9× at Tue 14:20 auto-pinned to deploy #4471 (checkout-api). Recovered within 18 min; budget impact negligible.

Golden signals · RED method · top servicesclick an errors series → log slice

Traffic Throughput — fleetno anomalies

req/slearned baseline

avg

12,400 rps

peak

18,900 rps

anomalies

Latency Heatmap — tier-1 trafficfaint tail 14:20

time (24h) × latency bucket — density

00:0006:0012:0018:0023:59

low

high · bulk in 20–60 ms band

Service health & saturation

Service Health Ribbon — 14 SLOs13 green1 ambersort: budget ↑

Saturation — CPU / Memheadroom healthy

Slow-Trace LeaderboardTrace cluster summary

Field Guide

How to use this view

This is a go / no-go on-call pane. Land your eye on the ring wall first: if all five are green and burn-rate dials read <1×, no SLO is in danger and you can step away from the keyboard.
Read the burn-down chart as a race — the cyan "consumed" line should stay below the grey "ideal pace" line. When consumed crosses ideal, the dashed forecast cone tells you the EOM landing before you breach.
Multi-window burn rate is your fast-vs-slow alarm: a 1h spike with a calm 6h means a transient; both hot together is a real, sustained burn — page-worthy.
The AI features do the correlation for you: Budget Guardian projects the depletion date, Deploy-blame auto-pins the deploy behind a dip, and the Anomaly highlighter flags golden-signal series that drift from their learned baseline.
Drill any ring for the bad-minutes ledger — exact violation windows and cause tags — before you write the incident note.
For your own org: are your SLO targets set from realized 90-day performance, or aspirational round numbers? Targets that nobody can hit just train the team to ignore the page.

In ContextSample Feed

Illustrative — wire to your observability + status-page feed.

Fleet availability (rolling 30d)across 14 tier-1 SLOs

99.972%

▲0.004

Error budget burned MTD43.2 min monthly allowance

32%

▼12 vs ideal

Median deploy → recoveryauto-rollback enabled

18m

▼5m

Upstream provider healthCDN · DNS · payments gateway

All OK

0 SEV

p99 latency vs ≤300ms SLOcheckout-api hot path

240ms

▼6ms

Watch the Walkthrough

How to use this view

Four AI agents walk this dashboard.