AI Agents May Redefine Risk in Industrial Operations
Cyber threats already threaten uptime and output. Add AI agents that can change temperatures, timings and setpoints on their own, and the risk shifts from downtime to physical harm.
Gartner's Wam Voster warns that agent-based systems could execute the wrong decision at machine speed. His example is blunt: if there's a configuration mistake and an agent increases a temperature by 200 degrees instead of 2, "the machines might explode or kill people."
Why agentic AI raises the stakes
These systems don't just analyze; they act. They pull from external data (weather, material quality, upstream process status) and adjust process variables in real time.
That feedback loop is powerful and fragile. Bad inputs, bad objectives or bad configs can cascade into unsafe states before a human can react.
The real root cause: human error and configuration drift
Most cyber-physical incidents still trace back to people and process. AI doesn't remove that risk; it multiplies it if guardrails are missing.
- Enforce change control: approvals, peer review and rollback plans for every model, prompt, rule set and integration.
- Set hard safety limits in PLC/SIS that an AI agent cannot override (temperature, pressure, speed, torque, valve position).
- Use dual-control for high-impact actions. Require human confirmation or a second service check before execution.
- Test in a digital twin or simulation first. Then run "shadow mode" where the agent recommends while humans decide.
- Apply step-change and rate limits to prevent large, sudden moves even if the agent requests them.
- Provide a physical and logical kill switch. Default to a known safe state on anomaly or comms loss.
- Log everything: inputs, decisions, actions and outcomes. Audit weekly; spot trends early.
Zero trust has limits in OT - use layered defenses
Full zero trust is hard on legacy lines and latency-sensitive controls. That doesn't mean do nothing. Adapt the principles.
- Segment networks by cell/zone. Use allowlists instead of open VLANs. Consider unidirectional gateways where possible.
- Lock down management access with least privilege and MFA. Rotate and vault credentials tied to agents and service accounts.
- Use device identity, signed configs and code. Block unsigned model or policy updates.
- Enforce protocol allowlists on firewalls. Limit agent APIs to explicit verbs and bounded parameters.
- Design local safety interlocks independent of the network or AI service. Safety first, connectivity second.
- Plan patch windows and compensating controls for systems you can't update quickly.
Governance and asset visibility: start here
You can't govern what you can't see. Map your assets and the flows AI will touch before you turn anything on.
- Build a living inventory: sensors, controllers, HMIs, historians, agent services, models, prompts and data sources.
- Trace data lineage from source to actuation. Add quality checks and outlier guards at each hop.
- Create an AI risk board with operations, safety, engineering and security. Define decision rights and escalation paths.
- Version everything (models, prompts, guardrails). Track who changed what, when and why. Support one-click rollback.
- Run HAZOP/FMEA for AI-driven changes. Separate safety functions (SIS/SIL-rated) from optimization logic.
- Drill emergency procedures. Measure detection-to-safe-state times quarterly.
Safe deployment pattern for AI agents in plants
- Define a narrow job for the agent with measurable bounds and clear no-go zones.
- Simulate → shadow mode → limited autonomy → supervised autonomy. Promote only after hitting stability thresholds.
- Set guardrails in code and hardware. Don't rely on prompts for safety.
- Monitor live with anomaly detection on inputs and actions. Alert on rule breaches and near-misses.
- Red-team the agent: bad data, sensor faults, drifted models, time sync issues and loss of connectivity.
- Lock vendor access. Require signed updates, SBOMs and incident response SLAs.
Metrics operations leaders should track
- Near-miss count and severity tied to agent decisions.
- Human override rate and reasons (safety, quality, throughput).
- Mean time to detect and contain unsafe agent behavior.
- Step-change violations prevented by guardrails.
- Safety-related downtime vs. throughput/quality gains from the agent.
- Audit coverage: percent of agent actions reviewed each week.
What this means for plant managers
AI agents can squeeze waste, stabilize quality and free up operators. But they act fast, and physics doesn't forgive. Treat them like junior operators with superpowers: limit their scope, check their work and keep your hand on the stop button.
Voster's message is simple: governance and visibility first, then automation. Build the safety rails now so your future gains don't come with hidden downside.
Frameworks and further reading
- NIST SP 800-82: Guide to ICS Security
- ISA/IEC 62443: Industrial Automation and Control System Security
Next steps for operations teams
- Scope a small, reversible use case (e.g., setpoint recommendations, not direct actuation).
- Write guardrails and interlocks before you train or deploy a model.
- Stand up logging, dashboards and weekly reviews before flipping to autonomy.
- Train operators on new failure modes and the exact steps to safe state.
If you're building an adoption plan, this resource can help: AI Learning Path for Plant Managers. For broader use cases and practices across ops, see AI for Operations.
Your membership also unlocks: