American Express invests $5M in Traversal and rolls out its AI SRE platform globally
UPDATED 09:00 EST / MARCH 04 2026
American Express is expanding its use of Traversal's AI-driven site reliability platform and making a $5 million strategic investment through Amex Ventures. The software will be deployed across AmEx's global technology infrastructure to accelerate incident diagnosis and reduce outage time.
The move speaks to a broader shift in large financial institutions: less manual war-room firefighting, more automated root cause analysis across fragmented observability stacks. Reliability isn't a "nice to have" at this scale; it's the backbone of customer trust.
"American Express operates at a massive scale; reliability and performance are foundational to delivering a seamless customer experience," said Kevin Weber, managing director at Amex Ventures. "In such a complex, distributed infrastructure environment, the focus is always on advancing how operational events are detected, understood and resolved."
What Traversal brings to the table
Traversal, founded by researchers from MIT, Columbia and Cornell, is building what it calls an AI-powered site reliability engineer. The platform ingests logs, metrics and traces across your existing tools to surface root causes and guide engineers to faster recovery.
Co-founder and CEO Anish Agarwal put it plainly: "Observability helps you visualize the data, but finding the root cause is still very labor-intensive. At Fortune 100 enterprises, you may have 50 or 100 engineers jumping into a war room to figure out what happened."
Tool sprawl is a big part of the problem. "Splunk will never give you insight on data stored on Datadog, and Datadog will never give you insight on data stored on Splunk," Agarwal said. Traversal's approach cuts across those silos, applying large language models, AI agents and causal machine learning to infer cause-and-effect instead of chasing noisy correlations.
"What typical correlation engines pick up are spikes," Agarwal said. "But understanding which is the root cause versus something that happened because something else broke requires causal reasoning."
Why this matters for operations leaders
Incidents don't wait for clean handoffs. Teams juggle multiple monitoring platforms, partial dashboards and tribal knowledge scattered across chat threads. When a P0 hits, every minute you save on root cause is a minute closer to customer continuity.
Traversal's pitch is simple: unify telemetry context, reduce the number of people needed per incident, and cut mean time to restore without replacing your current observability stack. It complements existing tools by interpreting them together.
Inside the AmEx partnership
The collaboration includes a commercial deployment and a $5 million strategic investment from Amex Ventures. Selection drivers included Traversal's causal inference engine, AI agents for incident workflow, and a security posture aligned to regulated industries.
Amex's interest also reflects a bigger push to improve operational resilience with AI across large-scale environments. Traditional observability is necessary, but interpretation speed and accuracy are becoming the deciding factors.
How to evaluate AI-driven incident diagnosis (practical checklist)
- Data access model: Can the platform read from all key telemetry sources (logs, metrics, traces) across vendors without duplicating sensitive data? Is data residency respected?
- Security and compliance: Validate isolation options (on-prem/VPC), encryption, and controls aligned to PCI-DSS, SOC 2 and ISO 27001. Ensure fine-grained RBAC and audit trails.
- Root cause quality: Look for causal explanations, not just correlations or spikes. Demand example traces of "why" and "how" an issue cascaded.
- Human-in-the-loop: Require approvals for any action that touches production. Log every recommendation and resolution step.
- Integrations: Confirm connectors to your chat, paging and ITSM tools for seamless handoffs and automated context sharing.
- Change awareness: The platform should understand deploys, feature flags and config changes to separate signal from noise.
- Time-to-value: Pilot on a constrained but noisy service first. Track improvements in MTTR and engineer-hours per incident before scaling.
Metrics that prove ROI
- Mean time to detect (MTTD) and mean time to restore (MTTR)
- Time to first plausible root cause hypothesis
- Engineers per incident and total incident-hours
- Percentage of incidents auto-triaged or auto-diagnosed
- False-positive rate from alerts and suggested root causes
- Change failure rate tied to releases and config updates
- Toil hours reduced (recurring manual work eliminated)
Risk, governance and controls
AI in production operations needs guardrails. Keep sensitive data out of model training paths, lock down prompts and responses, and enforce identity-scoped actions with approvals.
Demand transparency: every inference should come with evidence-queries run, datasets referenced, and the causal chain it believes led to impact. That audit trail is your safety net during postmortems and regulator reviews.
Agentic incident response: where this is headed
Traversal has raised about $53 million to date and is positioning its platform as a base layer for "agentic incident response," where AI agents diagnose first and, over time, remediate with strict guardrails. Think: safe automation that handles the repetitive work while engineers retain control over risky changes.
If you're building toward that future, align on two pillars: high-fidelity causal diagnosis and a clear policy for what agents can do without a human nudge. The second is impossible without the first.
For deeper background, the Google SRE incident response guide is a solid reference, and a quick primer on causal inference helps clarify why correlation-based alerting so often misleads.
Bottom line
AmEx's deployment and investment signal a clear message to ops leaders: the bottleneck isn't data collection-it's interpretation at the speed of impact. Tools that explain "why" across fragmented stacks will set the pace for reliability at scale.
If you're skilling up your team for this shift, explore our AI Learning Path for Systems Administrators for practical ways to bring AI into monitoring, incident response and day-two operations.
Your membership also unlocks: