Active Incidents
1
0 SEV-1 · 0 SEV-2 · 1 SEV-3 ✓
MTTA
2.4m
▼ 0.6m vs last 30d
MTTR
28m
▼ 5m trending down
On-Call Health
6/wk
interruptions · ack 99.1% · esc 4%
Reliability Streak
31d
since last SEV-1 · MTBF 19d
Auto-Resolved
41%
fleet-wide · noise suppression on
Who's On Call Now
Live rotation
PA
Primary
@priya
handoff in 3h 22m
DG
Secondary
@diego
backup · idle
SM
Manager
@sam
escalation L3
89% of errors on search-svc share region=us-east-1 · build=4471
Likely a regional cache-warm regression from deploy #4471. Blast-radius prediction: autocomplete + recommendations downstream. Rollback recommended — similar to INC-2104 (resolved in 11m by rollback).
Live Incident Board
3 openTriggered
No newly triggered alerts
Acknowledged
SEV-3INC-2291Ack
search-svc · cache warm
WATCHINC-2289Ack
image-cdn · elevated p99
Mitigated
SEV-3INC-2286Mitigated
notify-svc · queue backlog
✓ 3 incidents resolved in last 24h
Service Dependency Map
23 nodes · 22 ✓ · 1 ⚠
Healthy
Degraded
Critical
Recent deploy
— call path edge
Incident Timeline
StreamingMTTA / MTTR Over Time
30dMTTR41m → 28m
MTTA~2.4m flat
Response Vitals
tier-1Days since SEV-131
MTBF tier-1 services: 19 days · best streak this quarter
On-Call Interruptions6/wk
Ack rate 99.1% · escalation 4% · load low
Noisiest Services
by alert volume| Service | Pages | Noise score | Auto-res |
|---|---|---|---|
| search-svc | 9 | 33% | |
| image-cdn | 4 | 75% | |
| payments | 2 | 50% | |
| notify-svc | 2 | 100% | |
| checkout | 1 | 0% |
◎ How to use this room
What an incident commander should read first, in order.
- Glance the status strip + KPI row first. "All systems nominal · 0 SEV-1" plus a green active-incident count is your two-second go/no-go. Anything red here means stop scrolling and open the board.
- The dependency map is your blast-radius decision. An amber node alone is noise; an amber node with healthy downstream means contained. Hover to see which services a failure can drag down — that's what decides whether you page wider.
- Read the timeline top-down for narrative. Newest event is pinned with a LIVE tag. The dot colors (alert → ack → comment → mitigate → resolve) tell you the response is progressing without reading a word.
- Let Root-Cause Copilot pre-frame the page. It surfaces the shared anomalous dimension (region, build, host) on incident open — so you start from "what's common to the errors" instead of guessing. Treat it as a strong hypothesis, not a verdict.
- Watch MTTA flat + MTTR falling. Acknowledge speed staying steady while resolve time drops is the signature of a maturing on-call practice — automation and runbooks are working.
- For your own org: if a single service tops "Noisiest" three weeks running, the fix usually isn't the alert — it's the SLO or the dependency. What's your loudest service telling you?
▶ Watch the walkthrough
Four AI agents walk this dashboard — every panel, every decision surface.
Four AI agents walk this dashboard.
⛁ In context Sample feed
What a live status-page + dependency integration could surface alongside your incidents.
us-east-1EC2 elevated API error rate (AWS Health)▲ degraded
StripePayments API — all systems operational● nominal
CloudflareEdge cache hit-rate recovered (LHR/IAD)▼ 0.3% miss
DatadogWatchdog: search-svc latency anomaly cleared▼ resolved
GitHubActions runners — operational, queue 0● nominal
PagerDutyEscalation policies synced · 0 coverage gaps▼ healthy
Illustrative — wire to your status-page, cloud-provider health, and observability feeds.