Reliability Command

Can I trust the numbers right now — what broke, who owns it, and what's downstream?

Synced 2m ago Monte Carlo · dbt freshness · Atlan lineage

96/ 100

Data Health

Composite Trust

▲ 2 over 7 days

Fresh

Volume

Schema

Quality

Tables Monitored+coverage

2,847

78% certified ▲ 3%

Active Incidents · MTTR

2 both Low

MTTR 1h 47m ▼ 22%

Freshness SLA

99.1%

met · 3 volume anomalies (7d)

Test Pass Rate

98.6%

12,904 / 13,082 passed

Incident Feed

2 active · 4 resolved 24h

Column-Level Lineage

quality badges propagate downstream →

Blast radius

14 downstream assets · 3 dashboards in path of stg_payments

Healthy & certified Watch · anomaly BI dashboard

Reliability Over Time

30 days · band 89–97

91 · 30d ago96 today · no breach 9d

Certification & Glossary

data products

Anomaly Timeline

expected band vs actual row counts

orders_daily · row volume · last 12h

expected 1.18M–1.24M

BREACH 02:40actual dipped to 0.97M rows — recovered 03:05, auto-clustered to upstream late-arriving partition.

Monitor Coverage by Domain

monitors · freshness · owner

Domain	Monitors	Fresh	Owner
Finance	412	99%	@finance-eng
Growth	388	98%	@growth-data
Product	521	99%	@product-data
Payments	274	96%	@payments-eng
Marketing	196	99%	@mktg-analytics

How to use this

Reading the reliability board

What an Analytics Engineering / Data Reliability lead does with this view, top to bottom.

1Start at the Data Health Score. It rolls freshness, volume, schema and quality into one 0–100 number. A drop here is your "trust is at risk" alarm before any single incident looks scary — drill the score components to see which axis moved.
2Triage the Incident Feed left-to-right. Cards are ordered by severity then time-to-detect. Low TTD on a high-sev card is good (you caught it fast); a stale card with no owner avatar is the one to escalate.
3Trace the lineage graph to scope impact. Quality badges flow downstream along the bezier edges — a red dot on stg_payments tints everything to its right. Follow the curves to see exactly which dashboards inherit the problem.
4Let the AI do the clustering. Root-cause overlay ties an anomaly to a likely commit and drafts the incident note; blast-radius estimator counts downstream assets and affected viewers so you notify the right people first, not everyone.
5Certify what's earned it. The auto-certification recommender promotes only tables with a clean freshness + test history — certification is a contract with consumers, not a vanity badge.

For your own org: if a board exec pulled a number right now and it was wrong, how many hops back would you have to walk by hand to find the cause — and would you even know which dashboards were affected?

In context

Sample feed

Data reliability signal

Cross-stack reliability indicators a live integration could surface beside your own.

dbtSource freshness — 41 of 42 sources within SLA; shopify_orders 12m past warn▼ 1

SnowflakeQuery failure rate — 0.18% over 24h, well below 0.5% guardrail▲ stable

AtlanUndocumented columns — 23 new fields detected, glossary linker auto-mapped 19▲ 19

Monte CarloAnomalies caught — 7 this week, MTTD median 5m 20s▲ fast

AirflowUpstream DAG health — 99.2% on-time; nightly_core landed 02:11— on SLA

PagerDutyData on-call — 0 active pages, last incident acknowledged 3h ago▲ quiet

Illustrative — wire to your dbt / Monte Carlo / Atlan reliability feed.

Watch the walkthrough

See it in action

Four AI agents walk this dashboard — lineage tracing, blast-radius estimation, and auto-certification in one take.

Four AI agents walk this dashboard.