Reliability Control Room

SRE · SLO & Error Budget
Example · demo data, not live --:--:-- UTC ‹ All engineering
Service All tier-1 Env prod Region us-east-1 · eu-west-1 last updated 9s ago
Golden SLOs · live ring wallclick any ring → SLO detail
Error budget
Error-Budget Burn-Down — fleet tier-1Burn-Rate Forecast
Consumed Ideal pace Forecast cone (EOM) Deploy marker
⌁ Budget Guardian

At the current burn rate the monthly error budget lasts 22 dayssafe through month-end. Forecast lands at 71% consumed by EOM (vs 32% at day 19), leaving 29% headroom. No window is burning >1× fast.

Multi-Window Burn-Rate Strip<1× safe
⌁ Deploy-blame

Peak 0.9× at Tue 14:20 auto-pinned to deploy #4471 (checkout-api). Recovered within 18 min; budget impact negligible.

Golden signals · RED method · top servicesclick an errors series → log slice
Traffic Throughput — fleetno anomalies
req/slearned baseline
avg
12,400 rps
peak
18,900 rps
anomalies
0
Latency Heatmap — tier-1 trafficfaint tail 14:20
time (24h) × latency bucket — density
00:0006:0012:0018:0023:59
low
high · bulk in 20–60 ms band
Service health & saturation
Service Health Ribbon — 14 SLOs13 green1 ambersort: budget ↑
Saturation — CPU / Memheadroom healthy
Slow-Trace LeaderboardTrace cluster summary
Field Guide

How to use this view

  • This is a go / no-go on-call pane. Land your eye on the ring wall first: if all five are green and burn-rate dials read <1×, no SLO is in danger and you can step away from the keyboard.
  • Read the burn-down chart as a race — the cyan "consumed" line should stay below the grey "ideal pace" line. When consumed crosses ideal, the dashed forecast cone tells you the EOM landing before you breach.
  • Multi-window burn rate is your fast-vs-slow alarm: a 1h spike with a calm 6h means a transient; both hot together is a real, sustained burn — page-worthy.
  • The AI features do the correlation for you: Budget Guardian projects the depletion date, Deploy-blame auto-pins the deploy behind a dip, and the Anomaly highlighter flags golden-signal series that drift from their learned baseline.
  • Drill any ring for the bad-minutes ledger — exact violation windows and cause tags — before you write the incident note.
  • For your own org: are your SLO targets set from realized 90-day performance, or aspirational round numbers? Targets that nobody can hit just train the team to ignore the page.
In ContextSample Feed

Illustrative — wire to your observability + status-page feed.

Fleet availability (rolling 30d)across 14 tier-1 SLOs
99.972%
▲0.004
Error budget burned MTD43.2 min monthly allowance
32%
▼12 vs ideal
Median deploy → recoveryauto-rollback enabled
18m
▼5m
Upstream provider healthCDN · DNS · payments gateway
All OK
0 SEV
p99 latency vs ≤300ms SLOcheckout-api hot path
240ms
▼6ms
Watch the Walkthrough

Four AI agents walk this dashboard.

Four AI agents walk this dashboard.