The Pentagon vs. Anthropic: What's Really at Stake for Government
In 2025, the Pentagon put Anthropic's model, Claude, to work on classified systems under clear guardrails: no fully autonomous weapons and no domestic mass surveillance. In early 2026, senior officials pushed to replace those limits with "all lawful uses." That's where the relationship broke.
This is more than a contract fight. It's a live test of who sets boundaries for AI that can influence targeting, surveillance, and national decision-making-the vendor that built it, or the government that buys it. If you work in government, this clash previews the next decade of procurement, oversight, and operational risk.
What actually happened
- Anthropic agreed to support classified work, with red lines against autonomous weapons and domestic bulk-data analysis.
- Officials led by Emil Michael sought broader rights. Tensions rose as xAI's Grok joined GenAI.mil and an OpenAI deal took shape.
- DoD threatened to cancel contracts and floated supply-chain restrictions that could bar contractors from using Claude at all.
- Anthropic refused to cross its last red lines. Public blasts followed. DoD then labeled Anthropic a supply-chain risk. Lawsuits are now underway.
Underneath the drama sits a hard problem: large language models don't act like normal software. They can refuse, improvise, or make judgment calls-useful in analysis, risky in command.
Why this matters to government work
Control vs. capability. AI speeds targeting analysis, triage, and intel synthesis. But these models are a de facto counterparty, not a static tool. They can decline tasks or frame answers with "judgment," especially on political or lethal questions.
Human-in-the-loop is policy, not a cure-all. If a model's output drives decisions at machine speed, the person in the loop can become a rubber stamp. Guardrails need to be technical, procedural, and contractual-together.
Domestic surveillance gray zone. Even if an agency avoids "surveillance," purchasing troves of commercial data can create surveillance in effect. An AI that can stitch those datasets together makes scale and precision trivial. One model can do what would take thousands of analysts.
Two frames for AI: your choice sets your risk
AI as normal software. If you see models as tools, current law and standard clauses feel sufficient. You focus on uptime, SLAs, and feature controls. Alignment is "just engineering."
AI as special tech. If you see models as semi-autonomous agents, you bake in hard constraints, refusal behaviors, and non-negotiable limits on sensitive uses. You test for misalignment like you test for adversarial cyber-because failure modes are weird, fast, and public.
Practical implications for agencies and contractors
- Be explicit in contracts. Define "autonomous weapon," "bulk domestic data analysis," and "kill chain" boundaries. Remove vague modifiers like "as appropriate."
- Set mission-scoped use. Allow classified use for intel and defensive analysis while excluding lethal autonomy and bulk domestic dossiers. Put that in plain language.
- Codify refusal behavior. Document when the model must decline (e.g., partisan advocacy, unlawful orders, lethal action without human authorization) and how escalation works.
- Time-critical ops policy. For sub-minute windows (missile defense, kinetic fires), prefer validated systems over LLMs. Use models for pre-planning, not trigger-time calls.
- Data governance. Specify sources allowed, brokered data rules, minimization, retention, and audit trails. Require model-logged data lineage.
- Safety stack + process. Pair technical rails (filters, policy classifiers, tool-use limits) with procedural rails (two-person rule, approvals, red-team signoff).
- Red-teaming for misalignment. Test not just for jailbreaks but for deception, goal drift, privacy leakage, and unapproved tool use. Require periodic third-party audits.
- Continuity planning. Assume model swaps. Maintain a multi-model strategy, abstraction layers, and exit ramps if a vendor is restricted or deauthorized.
- Supply-chain clauses. Clarify whether "risk" designations apply only to government workloads or to all vendor operations. Plan for second-order effects on primes and subs.
- Dispute and escalation. Predefine how conflicts over prohibited uses are resolved under time pressure, who decides, and how logs are preserved for oversight.
10 questions to ask before an LLM touches mission-critical workflows
- What tasks are permitted, prohibited, and under what legal authorities?
- Where can the model refuse, and what is the human escalation path?
- What data sources feed it, and how is domestic data brokered or filtered?
- What tools can it call (search, databases, fires systems), and who approves access?
- How are prompts, outputs, and tool calls logged and reviewed?
- How do we test for deception, hallucination, and silent failure?
- What is the fallback when the model is down, restricted, or compromised?
- How do we isolate the model from lethal authority without explicit human action?
- What are the conditions to pause usage, and who has that authority?
- How do we rotate vendors without crippling continuity or security?
Where this likely lands
Courts will sort the supply-chain designation. Regardless of the verdict, expect vendors to push "safety stacks" that enforce limits technically rather than contractually. Agencies will still need unambiguous language to protect against overreach and to keep models away from time-critical lethal decisions.
The deeper issue won't go away: the state wants obedient systems; advanced models will sometimes talk back. Your job is to set rules that keep speed and capability without handing judgment to software in the moments that matter most.
Resources
- Claude
- AI for Government
- FISA Court overview (U.S. Courts)
- Defense Production Act (50 U.S.C. Chapter 55)
Your membership also unlocks: