What AWS' Disruption Reveals About AI Infrastructure
Millions of people are feeling the impact of an ongoing AWS disruption. It's a sharp reminder: the backbone of our digital and AI systems is strong until a single region blinks-and then everything connected to it starts to wobble.
AWS confirmed the issue: "We can confirm significant error rates for requests made to the DynamoDB endpoint in the US-EAST-1 Region. This issue also affects other AWS Services in the US-EAST-1 Region as well. During this time, customers may be unable to create or update Support Cases. Engineers were immediately engaged and are actively working on both mitigating the issue and fully understanding the root cause."
Cloud dependency and AI infrastructure
This disruption originated in North Virginia (US-EAST-1), touching core services like DynamoDB and EC2-the database and compute layers many teams rent to run apps, pipelines and AI workloads. When those pillars shake, the ripple hits everything.
A partial list of services affected:
- Snapchat
- Fortnite
- Duolingo
- Canva
- Wordle
- Lloyds
- Slack
- monday.com
- Bank of Scotland
- HMRC
- Zoom
- Barclays
- Vodafone
For AI and ML teams, this stalls data ingestion, model inference and automated decisions that depend on real-time signals. If your training jobs, feature store or model endpoints sit in the affected region, you're likely down or degraded.
A pattern of disruption
This isn't new. AWS has had notable incidents: the 2012 Christmas Eve outage that hit Netflix, a major December 2021 disruption, and a June outage that increased error rates across multiple services, affecting organizations like The Boston Globe and the Associated Press. Even the best-engineered platforms have weak points.
Microsoft Azure has had its moments, too. In January 2023, a network issue took down Teams and Outlook for many users. For regulated sectors, downtime isn't just inconvenient-it can break audit trails and complicate compliance.
AWS Service Status * Azure Status
Interconnectivity and business resilience
George Foley, Technical Advisor at ESET Ireland, put it plainly: "When one of the major cloud platforms goes down, it reminds everyone how interconnected modern business systems have become.
"Even if your own website or app isn't hosted on AWS, there's a good chance something you use-from your CRM to your payment processor-is.
"Outages like this highlight the importance of having resilience plans in place, including backups and alternative routes for essential data and services."
For AI teams, the chain is long: data sources, feature stores, training infrastructure, model registry, inference endpoints and downstream integrations. If one link fails, the whole thing can stall.
What to do now: practical steps for IT, dev and data teams
- Design for failure by default: Use multi-AZ, plan for multi-region. Test region evacuation and failover, not just backups. Budget for the extra capacity.
- Decouple aggressively: Add queues, retries, timeouts, circuit breakers and idempotent writes. Make every external call safe to fail.
- Protect the data layer: Cross-region replication for databases and object storage. Read-only degrade modes. Backups that are tested, not assumed.
- Make AI workloads portable: Containerize models, pin dependencies, script infra with Terraform. Mirror artifacts and feature stores to a second region. Support batch fallbacks when real-time is down.
- Be selective with multi-cloud: Use a second provider for the few things that must stay up (DNS, auth, payments, status page). Keep interfaces provider-neutral where it matters.
- Cache and degrade gracefully: Serve from CDN and local caches. Offer limited functionality, queue writes for later, and keep critical reads online.
- Improve observability: Track SLOs and error budgets. Add synthetic checks for third-party dependencies. Centralize logs and traces across regions.
- Drill incidents like you mean it: Maintain runbooks, on-call rotations and a clear RACI. Run chaos tests and failover game days. Prepare customer comms templates.
- Cover compliance and contracts: Map BCDR requirements to your architecture. Review vendor SLAs and DPAs. Document workarounds for audit continuity.
- Quantify the cost of downtime: Put a number on lost revenue and productivity. Use it to justify redundancy and reset your RTO/RPO targets.
Questions to ask your vendors today
- Which regions and AZs do you depend on? What's the single point of failure?
- What are your RTO/RPO for each critical dependency?
- What's the plan if US-EAST-1 is unavailable for 24-72 hours?
- Can the product run in a degraded or offline mode? What does that look like?
- Where are model artifacts, secrets and keys stored and replicated?
- How do we fail back safely, and who signs off?
The takeaway
Outages will happen. The difference between a blip and a brand hit is decided long before the status page turns red. Build for failure, test your assumptions and keep the core of your stack portable.
If you're upskilling teams on cloud, MLOps and incident readiness, browse industry-backed programs here: Popular AI and Cloud Certifications.
Your membership also unlocks: