Incident War Room — On-Call · Incident & Response

Active Incidents

0 SEV-1 · 0 SEV-2 · 1 SEV-3 ✓

MTTA

2.4m

▼ 0.6m vs last 30d

MTTR

28m

▼ 5m trending down

On-Call Health

6/wk

interruptions · ack 99.1% · esc 4%

Reliability Streak

31d

since last SEV-1 · MTBF 19d

Auto-Resolved

41%

fleet-wide · noise suppression on

Who's On Call Now

Live rotation

Primary

@priya

handoff in 3h 22m

Secondary

@diego

backup · idle

Manager

@sam

escalation L3

⌁ Root-Cause Copilot

89% of errors on search-svc share `region=us-east-1 · build=4471`

Likely a regional cache-warm regression from deploy #4471. Blast-radius prediction: autocomplete + recommendations downstream. Rollback recommended — similar to INC-2104 (resolved in 11m by rollback).

Investigate →

Live Incident Board

3 open

Triggered

No newly triggered alerts

Acknowledged

SEV-3INC-2291Ack

search-svc · cache warm

PA@priya 06:14

WATCHINC-2289Ack

image-cdn · elevated p99

DG@diego 17:22

Mitigated

SEV-3INC-2286Mitigated

notify-svc · queue backlog

SM@sam resolved 22m

✓ 3 incidents resolved in last 24h

Service Dependency Map

23 nodes · 22 ✓ · 1 ⚠

Hover a node to highlight its blast-radius · click to inspect

Healthy Degraded Critical Recent deploy — call path edge

Incident Timeline

Streaming

MTTA / MTTR Over Time

30d

MTTR41m → 28m

MTTA~2.4m flat

Response Vitals

tier-1

Days since SEV-131

MTBF tier-1 services: 19 days · best streak this quarter

On-Call Interruptions6/wk

Ack rate 99.1% · escalation 4% · load low

Noisiest Services

by alert volume

Service	Pages	Auto-res
search-svc	9	33%
image-cdn	4	75%
payments	2	50%
notify-svc	2	100%
checkout	1	0%

◎ How to use this room

What an incident commander should read first, in order.

Glance the status strip + KPI row first. "All systems nominal · 0 SEV-1" plus a green active-incident count is your two-second go/no-go. Anything red here means stop scrolling and open the board.
The dependency map is your blast-radius decision. An amber node alone is noise; an amber node with healthy downstream means contained. Hover to see which services a failure can drag down — that's what decides whether you page wider.
Read the timeline top-down for narrative. Newest event is pinned with a LIVE tag. The dot colors (alert → ack → comment → mitigate → resolve) tell you the response is progressing without reading a word.
Let Root-Cause Copilot pre-frame the page. It surfaces the shared anomalous dimension (region, build, host) on incident open — so you start from "what's common to the errors" instead of guessing. Treat it as a strong hypothesis, not a verdict.
Watch MTTA flat + MTTR falling. Acknowledge speed staying steady while resolve time drops is the signature of a maturing on-call practice — automation and runbooks are working.
For your own org: if a single service tops "Noisiest" three weeks running, the fix usually isn't the alert — it's the SLO or the dependency. What's your loudest service telling you?

▶ Watch the walkthrough

Four AI agents walk this dashboard — every panel, every decision surface.

Four AI agents walk this dashboard.

⛁ In context Sample feed

What a live status-page + dependency integration could surface alongside your incidents.

us-east-1EC2 elevated API error rate (AWS Health)▲ degraded

StripePayments API — all systems operational● nominal

CloudflareEdge cache hit-rate recovered (LHR/IAD)▼ 0.3% miss

DatadogWatchdog: search-svc latency anomaly cleared▼ resolved

GitHubActions runners — operational, queue 0● nominal

PagerDutyEscalation policies synced · 0 coverage gaps▼ healthy

Illustrative — wire to your status-page, cloud-provider health, and observability feeds.