Incident War RoomOn-Call · Incident & Response
All systems nominal · 0 SEV-1
EXAMPLE · demo data, not live --:--:-- UTC ← War Rooms
Filter All services checkout payments search-svc Severity SEV-1 SEV-2 SEV-3 Team Platform Payments
Window
Active Incidents
1
0 SEV-1 · 0 SEV-2 · 1 SEV-3
MTTA
2.4m
▼ 0.6m vs last 30d
MTTR
28m
▼ 5m trending down
On-Call Health
6/wk
interruptions · ack 99.1% · esc 4%
Reliability Streak
31d
since last SEV-1 · MTBF 19d
Auto-Resolved
41%
fleet-wide · noise suppression on

Who's On Call Now

Live rotation
PA
Primary
@priya
handoff in 3h 22m
DG
Secondary
@diego
backup · idle
SM
Manager
@sam
escalation L3
⌁ Root-Cause Copilot

89% of errors on search-svc share region=us-east-1 · build=4471

Likely a regional cache-warm regression from deploy #4471. Blast-radius prediction: autocomplete + recommendations downstream. Rollback recommended — similar to INC-2104 (resolved in 11m by rollback).

Investigate →

Live Incident Board

3 open
Triggered
No newly triggered alerts
Acknowledged
SEV-3INC-2291Ack
search-svc · cache warm
PA@priya 06:14
WATCHINC-2289Ack
image-cdn · elevated p99
DG@diego 17:22
Mitigated
SEV-3INC-2286Mitigated
notify-svc · queue backlog
SM@sam resolved 22m
✓ 3 incidents resolved in last 24h

Service Dependency Map

23 nodes · 22 ✓ · 1 ⚠
Hover a node to highlight its blast-radius · click to inspect
Healthy Degraded Critical Recent deploy — call path edge

Incident Timeline

Streaming

MTTA / MTTR Over Time

30d
MTTR41m → 28m
MTTA~2.4m flat

Response Vitals

tier-1
Days since SEV-131

MTBF tier-1 services: 19 days · best streak this quarter

On-Call Interruptions6/wk

Ack rate 99.1% · escalation 4% · load low

Noisiest Services

by alert volume
ServicePagesNoise scoreAuto-res
search-svc9
33%
image-cdn4
75%
payments2
50%
notify-svc2
100%
checkout1
0%

How to use this room

What an incident commander should read first, in order.

  • Glance the status strip + KPI row first. "All systems nominal · 0 SEV-1" plus a green active-incident count is your two-second go/no-go. Anything red here means stop scrolling and open the board.
  • The dependency map is your blast-radius decision. An amber node alone is noise; an amber node with healthy downstream means contained. Hover to see which services a failure can drag down — that's what decides whether you page wider.
  • Read the timeline top-down for narrative. Newest event is pinned with a LIVE tag. The dot colors (alert → ack → comment → mitigate → resolve) tell you the response is progressing without reading a word.
  • Let Root-Cause Copilot pre-frame the page. It surfaces the shared anomalous dimension (region, build, host) on incident open — so you start from "what's common to the errors" instead of guessing. Treat it as a strong hypothesis, not a verdict.
  • Watch MTTA flat + MTTR falling. Acknowledge speed staying steady while resolve time drops is the signature of a maturing on-call practice — automation and runbooks are working.
  • For your own org: if a single service tops "Noisiest" three weeks running, the fix usually isn't the alert — it's the SLO or the dependency. What's your loudest service telling you?

Watch the walkthrough

Four AI agents walk this dashboard — every panel, every decision surface.

Four AI agents walk this dashboard.

In context Sample feed

What a live status-page + dependency integration could surface alongside your incidents.

us-east-1EC2 elevated API error rate (AWS Health)▲ degraded
StripePayments API — all systems operational● nominal
CloudflareEdge cache hit-rate recovered (LHR/IAD)▼ 0.3% miss
DatadogWatchdog: search-svc latency anomaly cleared▼ resolved
GitHubActions runners — operational, queue 0● nominal
PagerDutyEscalation policies synced · 0 coverage gaps▼ healthy

Illustrative — wire to your status-page, cloud-provider health, and observability feeds.