US-UK Red Teaming Exposes AI Agent Hijacks and Universal Jailbreaks at OpenAI and Anthropic

US and UK labs probed OpenAI and Anthropic, exposing agent hijacks, prompt injection, and guardrail gaps. Agencies need red-team access, context security, and incident SLAs.

Categorized in: AI News Government
Published on: Sep 16, 2025
US-UK Red Teaming Exposes AI Agent Hijacks and Universal Jailbreaks at OpenAI and Anthropic

US and UK researchers quietly stress-tested commercial AI: what government teams need to know

OpenAI and Anthropic spent the past year giving U.S. and U.K. government labs deep access to their systems. The goal: probe for failure modes that criminals, foreign intelligence, or insiders could exploit.

According to the companies, researchers at NIST's U.S. Center for AI Standards for Innovation and the U.K. AI Security Institute tested models, classifiers, and even guardrail-free prototypes. The focus was abuse resistance in high-risk domains and how easily agents can be hijacked via context poisoning and prompt injection.

What the testing covered

  • OpenAI: Evaluations of ChatGPT and newer agent products across cyber and chemical-biological risk areas. Work expanded to red-teaming agent tooling and new pipelines to find and fix vulnerabilities with external evaluators.
  • Anthropic: Ongoing access to Claude models and a classifier used to detect jailbreaks. Testing targeted prompt injections, hidden instructions in context, and universal jailbreak methods.

Key findings government leaders should internalize

  • Compound vulnerabilities matter: OpenAI reports NIST surfaced two novel issues that, chained with a known AI hijacking technique, let testers take over another user's agent about 50% of the time, potentially controlling the agent's accessible computer session and impersonating the user on logged-in sites.
  • Agent context is a critical attack surface: Multiple exploit paths relied on poisoning the data the model or agent uses to decide actions, not breaking the base model weights.
  • Guardrail bypasses evolve: Anthropic says a universal jailbreak technique slipped past standard detection, prompting an overhaul of their safeguard architecture rather than a simple patch.
  • Security maturity is improving: Independent researchers report newer commercial models are harder to jailbreak than earlier releases. Coding models and some open-source systems, however, remain easier to steer into unsafe outputs.

Why this matters for public-sector programs

AI use in government is shifting from prototypes to production systems that touch sensitive data and mission workflows. The findings show that:

  • Attackers target the context layer (files, tools, browsing, APIs) more than the base model.
  • Red-teaming access-to agents, tools, and safety filters-is required to see real risk, not just demo risk.
  • Point fixes age fast; vendors need architecture-level responses and continuous evaluation cycles.

Immediate actions for agencies and programs

  • Adopt a standard: Map AI projects to the NIST AI Risk Management Framework and require vendors to show alignment in documentation and testing. NIST AI RMF
  • Contract for red-team access: Bake into SOWs the right to conduct or commission independent red-teaming against agents, tools, retrieval systems, and guardrails, including access to non-production builds and evaluation APIs.
  • Demand evaluation artifacts: Require structured reports on jailbreak resistance, prompt injection defenses, bio/cyber misuse tests, and incident postmortems with remediation timelines.
  • Secure the context layer: Gate agent tool use with allowlists, sandbox execution, strong auth, scoped tokens, and egress controls. Treat RAG sources and plugins as high-trust dependencies.
  • Set incident SLAs: Define vendor obligations for vuln disclosure, temporary mitigations, model or guardrail rollbacks, and notification windows.
  • Threat-model agents: Include impersonation, session hijack, and lateral movement objectives in tabletop exercises and penetration tests.
  • Train users and builders: Teach staff to spot data-poisoning and prompt-injection patterns. Provide safe prompting norms and approval flows for new tools and data connectors.

Policy signals vs. on-the-ground work

Some leaders have deprioritized public messaging on AI safety, and both U.S. and U.K. institutes dropped "safety" from their names. Despite that, the technical collaborations show a steady push to test and harden models in areas that intersect with national security, infrastructure, and public services.

What to ask vendors now

  • What agent-level red-team results can you share from the past 90 days? What changed because of those findings?
  • How do you detect and block prompt injection and context poisoning across RAG, tools, and browsing?
  • Do you support sandboxed execution with clear permissioning, audit logs, and kill switches for agent actions?
  • Can you provide guardrail-free test builds in a controlled environment for government evaluators?
  • What is your vulnerability disclosure policy and rollback plan for faulty safeguards or models?
  • How are non-production prototypes and safety filters validated before they reach mission environments?

Additional resources

Upskill your team

If you're building AI-enabled services or running evaluations, structured training helps standardize safety practices across program, security, and acquisition teams. See curated options here: AI certifications and training.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)