Coinbase suffers multi-hour trading outage after AWS data center cooling failure

Coinbase went down for hours on May 7 after cooling systems failed at an AWS data center, knocking out spot trading, derivatives, and account access. The outage hit as the company was cutting 700 staff to expand AI-led operations.

Categorized in: AI News Operations
Published on: May 11, 2026
Coinbase suffers multi-hour trading outage after AWS data center cooling failure

Coinbase's AI-led operations pivot backfires in multi-hour outage

Coinbase experienced a multi-hour trading outage on May 7th after cooling systems failed inside an Amazon Web Services data center, exposing vulnerabilities in the exchange's infrastructure strategy. The failure disabled spot trading, derivatives, and access across Retail, Advanced, Institutional, and Prime accounts.

Internal monitors detected widespread quote failures at 23:50 UTC. Engineers immediately escalated multiple Sev1 incidents as customers lost access to trading, balance updates, and exchange functionality.

CEO Brian Armstrong said on X that a room overheated in an AWS data center due to multiple chillers failing. He called the outage "never acceptable."

The infrastructure problem

Coinbase designed most services to survive the failure of a single AWS availability zone. The exchange itself uses different infrastructure to minimize latency - a single availability zone concentrated in one location.

Two critical systems failed during the thermal event. Hardware supporting the matching engine - the system that processes orders and maintains order books - malfunctioned. The distributed Kafka cluster that routes information across all systems also went down, carrying terabytes of data.

Rob Witoff, who heads Coinbase's Platform division, said the matching engine requires quorum across multiple nodes before it can process trades safely. When the data center overheated, not enough nodes remained healthy to achieve quorum, halting all trading.

Recovering Kafka proved especially difficult. The system manages thousands of terabytes daily across partitioned architecture, requiring extensive manual recovery work. Engineers had to rebuild quorum on new hardware while the broader outage continued.

The AI staffing connection

Coinbase announced plans to eliminate 700 employees - roughly 14% of its workforce - and replace manual processes with AI. The timing raises questions about whether automation decisions contributed to the outage response.

Witoff's account suggests the incident required significant manual engineering work. Teams had to develop, test, deploy, and validate solutions under pressure while managing cascading failures across multiple systems.

When the matching engine returned, engineers didn't immediately resume trading. They switched all products to cancel-only mode, verified system health, moved to auction mode, and only then re-enabled trading on the main exchange.

What operations teams should watch

The outage illustrates a core operations challenge: infrastructure designed for speed can create single points of failure. Coinbase concentrated its exchange in one availability zone specifically to reduce latency, but this choice meant one thermal event could disable the entire trading system.

The recovery process required human judgment at multiple stages - assessing which systems were healthy enough to bring online, deciding when to move between cancel-only and auction modes, and manually recovering partitioned data systems.

Coinbase said no data was lost. The company plans to publish a detailed incident report in several weeks. Learn more about AI for Operations and AI Agents & Automation to understand how to design resilient systems.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)