From A/B Testing to Agentic Ops, Dynatrace Rethinks Enterprise Observability

Enterprises are A/B testing models and building agents, so observability must track intent, routing, steps, cost, and outcomes. Dynatrace extends this to AI workloads and Azure.

Increased testing of models and agentic AI reshapes enterprise observability

Dynatrace built its name on observability and security for classic workloads. Now it's extending that discipline to AI workloads, giving teams a clearer picture of how modern applications behave under real use.

Speaking at KubeCon + CloudNativeCon NA, Alois Reitbauer, chief technology strategist at Dynatrace, summed it up: "We see a change in how people are building applications. In the past it was basically OpenAI, you used OpenAI and then it started to switch to other models. Now, we see people experimenting way more, like A-B testing models and the practice of … AI native engineering."

From single model usage to model experimentation

Enterprises aren't sticking to one foundation model anymore. Teams are trialing multiple models, routing by task, and running A/B tests to measure quality, latency, and cost.

That shift demands a new level of traceability: which model was used, why the router picked it, what context was injected, and how it performed against a specific goal.

Agentic AI changes how you debug

Some newer models expose a "train of thought" that makes it easier to see how they arrived at an output. Helpful, but the job is far from done.

"Debugging AI in agentic applications is kind of different," Reitbauer said. "The more we move into more dynamic systems, like going more into this agentic world, the more the individual transactions will be different." In short: fewer repeatable paths, more unique runs. Observability has to capture intent, decisions, and outcomes for each step.

Guardrails and goal tracking over shoulder-watching

Agentic systems act on tasks, not step-by-step instructions. You won't inspect every move. You set constraints, define success criteria, and review outcomes.

"Guardrails are a key and guardrails started to emerge very early on," Reitbauer noted. "Really thinking to the next step about agentic, we have to track against goals and I think that's where business observability comes in. You're delegating a task, you're not looking AI over the shoulder."

Dynatrace's Azure move

Dynatrace announced its next-generation cloud operations solution for Microsoft Azure, including support for agent-driven patterns. That matters because many teams will build agents, tools, and model routing directly on Azure services.

If you're evaluating this path, review Microsoft's guidance on agent services for architecture and security baselines. Microsoft Learn: Azure AI agent services

What to instrument now (practical checklist)

Prompt, context, and tool calls: Log prompts, system instructions, retrieved context, and the exact tools an agent used. Mask PII and secrets.
Model routing: Capture model/version, selection rationale, fallback events, and temperature/top-p values.
Agent steps: Track each step, the tool invoked, input/output, and confidence signals.
Evaluation signals: Store automatic scores (toxicity, hallucination checks, policy hits) and human ratings (RAG quality, relevance, helpfulness).
Goal outcomes: Define the task up front and record success/failure, retries, and final business outcome.
Cost and latency: Token usage, time per step, queue delays, and upstream/downstream service time.
Guardrail events: Blocked prompts, redactions, policy violations, and safety interventions.
Data lineage: Where context came from, embedding versions, and index timestamps.
Version control: Tie model/app/agent config to git commits and feature flags for fast rollback.

For engineering, ops, and product leaders

Make goals first-class: Every agent action should roll up to a measurable outcome.
Bias toward experiments: A/B test models, prompts, and routing policies. Keep the best, retire the rest.
Build a safety net: Policy checks, rate limits, escalation paths, and human-in-the-loop for high-risk flows.
Close the loop: Feed production outcomes back into prompts, retrieval, and router decisions.
Standardize telemetry: Use consistent schemas so data can be searched, compared, and audited.

Why this matters now

Traditional observability centers on services, endpoints, and traces. AI adds intent, context, and decisions that change on every run. If you can't see those layers, you'll struggle to diagnose failures, improve quality, or control cost.

The teams that log goals, instrument agent steps, and quantify outcomes will ship faster with fewer surprises. Everyone else will guess.