Army moves to assess AI's "unpredictable behaviors" and safeguard autonomous systems
The Army has awarded an approximately $6.3 million other transaction agreement to develop GUARD - Generative Unwanted Activity Recognition and Defense - a prototype software effort to detect and analyze "unpredictable" AI behavior in autonomous and AI-enabled systems.
Advanced Technology International, Inc. is listed as the managing prime to Battelle Memorial Institute. The goal is straightforward: evaluate risk before models hit the field, and keep autonomous capabilities trustworthy and effective in real operations.
What GUARD is building
GUARD focuses on identifying risky or emergent behaviors from AI models that might slip past traditional testing. The software aims to generate AI "risk profiles" and flag behaviors that deviate from policy or mission intent.
According to the award notice, the program will draw on neural network explainability, AI cognitive research, and game AI methods. Those inputs feed Behavior-Event Graph data structures that map policies, inputs, actions, and events to analyze potential emergent behavior before, during, and after field tests.
Why this matters now
As AI models move from labs into units, small failures can scale. Unpredictable behavior in targeting, navigation, or communications can undermine trust - or worse, create unsafe outcomes.
This effort signals growing interest in practical guardrails, not just policy. While strategy pages for "AI assurance" across government are still maturing, the test and evaluation community is already tracking unexpected behaviors in AI-enabled and autonomous systems.
NIST's AI Risk Management Framework and the DoD's Responsible AI principles offer useful anchors for this kind of work. You can expect GUARD to align with both the measurement mindset and operational test realities.
What's included in the prototype scope
Beyond the core software, the notice references associated prototype activities tied to ground-launched capabilities. That includes custom launchers, mobile containers, warhead solutions and customization, and component fabrication to tune lethality, impulse, and fragmentation signatures while improving manufacturability.
Read that as signal: GUARD isn't just academic tooling. It's meant to run where real systems are tested and fielded.
Contract details and next steps
The OTA names Advanced Technology International, Inc. as managing prime, with work in partnership with Battelle. If milestones are met, the Army anticipates follow-on awards without further competition.
Deliverables center on software that can create actionable risk profiles for AI models and systems. Expect iterative field-testing and analysis, with emphasis on pre-mission validation and ongoing anomaly detection.
Implications for government, researchers, and industry
- Program managers: Build AI risk profiling into your test plans early. Treat "unexpected behavior" tracking as a standing requirement, not a one-time gate.
- Test and evaluation teams: Prepare for Behavior-Event Graphs or similar structures in your data pipelines. You'll need clean mappings from policy to inputs, actions, and outcomes.
- Data and model leads: Document model intents, constraints, and known failure modes. GUARD-like tooling will be far more effective with clear policy-to-model traceability.
- Policy and oversight: Connect assurance metrics to operational risk. A shared language between engineers, testers, and commanders will make these tools matter.
- Research partners: There's room to push on explainability methods that translate to field conditions - under noise, adversarial pressure, and incomplete data.
How Behavior-Event Graphs help in practice
They force clarity. Policies, constraints, and mission tasks are turned into graph nodes, then linked to model inputs and system actions. That gives analysts a structured way to spot conflicts and emergent behavior.
In testing, those same graphs highlight when a model "works," but for the wrong reason. That's where many of the nastiest surprises live.
What to watch
- Integration with existing test ranges and instrumentation, including telemetry and post-mission analysis workflows.
- How GUARD handles multimodal inputs and multi-agent scenarios where emergent behavior is most likely.
- Whether outputs become standard artifacts in Army acquisition - e.g., required AI risk profiles prior to fielding.
- Alignment with DoD Responsible AI guidance and cross-government assurance efforts.
For policy context, see the DoD Responsible AI resources from the Chief Digital and AI Office: ai.mil/policies. For risk management practices, NIST's framework remains a useful baseline: NIST AI RMF.
If you're building team capability around AI assurance and testing skills, browse curated programs by role at Complete AI Training.
Your membership also unlocks: