Exclusive: American Express invests in Traversal, rolling out AI SRE across its global infrastructure

American Express invests $5M in Traversal and rolls out its AI SRE platform globally

UPDATED 09:00 EST / MARCH 04 2026

American Express is expanding its use of Traversal's AI-driven site reliability platform and making a $5 million strategic investment through Amex Ventures. The software will be deployed across AmEx's global technology infrastructure to accelerate incident diagnosis and reduce outage time.

The move speaks to a broader shift in large financial institutions: less manual war-room firefighting, more automated root cause analysis across fragmented observability stacks. Reliability isn't a "nice to have" at this scale; it's the backbone of customer trust.

"American Express operates at a massive scale; reliability and performance are foundational to delivering a seamless customer experience," said Kevin Weber, managing director at Amex Ventures. "In such a complex, distributed infrastructure environment, the focus is always on advancing how operational events are detected, understood and resolved."

What Traversal brings to the table

Traversal, founded by researchers from MIT, Columbia and Cornell, is building what it calls an AI-powered site reliability engineer. The platform ingests logs, metrics and traces across your existing tools to surface root causes and guide engineers to faster recovery.

Co-founder and CEO Anish Agarwal put it plainly: "Observability helps you visualize the data, but finding the root cause is still very labor-intensive. At Fortune 100 enterprises, you may have 50 or 100 engineers jumping into a war room to figure out what happened."

Tool sprawl is a big part of the problem. "Splunk will never give you insight on data stored on Datadog, and Datadog will never give you insight on data stored on Splunk," Agarwal said. Traversal's approach cuts across those silos, applying large language models, AI agents and causal machine learning to infer cause-and-effect instead of chasing noisy correlations.

"What typical correlation engines pick up are spikes," Agarwal said. "But understanding which is the root cause versus something that happened because something else broke requires causal reasoning."

Why this matters for operations leaders

Incidents don't wait for clean handoffs. Teams juggle multiple monitoring platforms, partial dashboards and tribal knowledge scattered across chat threads. When a P0 hits, every minute you save on root cause is a minute closer to customer continuity.

Traversal's pitch is simple: unify telemetry context, reduce the number of people needed per incident, and cut mean time to restore without replacing your current observability stack. It complements existing tools by interpreting them together.

Inside the AmEx partnership

The collaboration includes a commercial deployment and a $5 million strategic investment from Amex Ventures. Selection drivers included Traversal's causal inference engine, AI agents for incident workflow, and a security posture aligned to regulated industries.

Amex's interest also reflects a bigger push to improve operational resilience with AI across large-scale environments. Traditional observability is necessary, but interpretation speed and accuracy are becoming the deciding factors.

How to evaluate AI-driven incident diagnosis (practical checklist)

Data access model: Can the platform read from all key telemetry sources (logs, metrics, traces) across vendors without duplicating sensitive data? Is data residency respected?
Security and compliance: Validate isolation options (on-prem/VPC), encryption, and controls aligned to PCI-DSS, SOC 2 and ISO 27001. Ensure fine-grained RBAC and audit trails.
Root cause quality: Look for causal explanations, not just correlations or spikes. Demand example traces of "why" and "how" an issue cascaded.
Human-in-the-loop: Require approvals for any action that touches production. Log every recommendation and resolution step.
Integrations: Confirm connectors to your chat, paging and ITSM tools for seamless handoffs and automated context sharing.
Change awareness: The platform should understand deploys, feature flags and config changes to separate signal from noise.
Time-to-value: Pilot on a constrained but noisy service first. Track improvements in MTTR and engineer-hours per incident before scaling.

Metrics that prove ROI

Mean time to detect (MTTD) and mean time to restore (MTTR)
Time to first plausible root cause hypothesis
Engineers per incident and total incident-hours
Percentage of incidents auto-triaged or auto-diagnosed
False-positive rate from alerts and suggested root causes
Change failure rate tied to releases and config updates
Toil hours reduced (recurring manual work eliminated)

Risk, governance and controls

AI in production operations needs guardrails. Keep sensitive data out of model training paths, lock down prompts and responses, and enforce identity-scoped actions with approvals.

Demand transparency: every inference should come with evidence-queries run, datasets referenced, and the causal chain it believes led to impact. That audit trail is your safety net during postmortems and regulator reviews.

Agentic incident response: where this is headed

Traversal has raised about $53 million to date and is positioning its platform as a base layer for "agentic incident response," where AI agents diagnose first and, over time, remediate with strict guardrails. Think: safe automation that handles the repetitive work while engineers retain control over risky changes.

If you're building toward that future, align on two pillars: high-fidelity causal diagnosis and a clear policy for what agents can do without a human nudge. The second is impossible without the first.

For deeper background, the Google SRE incident response guide is a solid reference, and a quick primer on causal inference helps clarify why correlation-based alerting so often misleads.

Bottom line

AmEx's deployment and investment signal a clear message to ops leaders: the bottleneck isn't data collection-it's interpretation at the speed of impact. Tools that explain "why" across fragmented stacks will set the pace for reliability at scale.

If you're skilling up your team for this shift, explore our AI Learning Path for Systems Administrators for practical ways to bring AI into monitoring, incident response and day-two operations.

Get Daily AI News

Your membership also unlocks:

700+ AI Courses

700+ Certifications

Personalized AI Learning Plan

6500+ AI Tools (no Ads)

Daily AI News by job industry (no Ads)

Exclusive: American Express invests in Traversal, rolling out AI SRE across its global infrastructure

American Express invests $5M in Traversal and rolls out its AI SRE platform globally

UPDATED 09:00 EST / MARCH 04 2026

What Traversal brings to the table

Why this matters for operations leaders

Inside the AmEx partnership

How to evaluate AI-driven incident diagnosis (practical checklist)

Metrics that prove ROI

Risk, governance and controls

Agentic incident response: where this is headed

Bottom line

Related AI News for people in Operations

Accenture launches Cyber.AI security platform powered by Anthropic's Claude

Adonis raises $40M Series C to expand revenue cycle management platform for health systems

Midway City Council drafts policy to govern staff use of AI tools

Notch raises $30M to expand AI platform for regulated industries

About Complete AI:

Latest AI News for your Job:

Courses by AI Skill:

Courses by Job Field:

Courses by AI Company:

AI Tools for your Job:

AI Tools by Type:

AI Certifications by Skill:

AI Certifications by Job Field:

AI Certifications by Company: