Databricks sharpens Agent Bricks to push agentic AI from pilot to production
Databricks is rolling out new Agent Bricks capabilities aimed at a problem most IT leaders feel every quarter: agents that look good in demos but stall before production. The update centers on accuracy, governance, and data access-exactly the weak spots that keep projects stuck.
The most notable change is general availability of MLflow for Agent Quality and Observability, built to continuously evaluate and monitor agent behavior. Alongside that, Databricks is previewing a governed AI Gateway, an MCP Catalog for tool/data access, multi-agent supervision with MCP support, and a SQL function to extract context from unstructured content.
Why this matters
Enterprises shifted from basic chatbots to agentic systems in 2024, but building trustworthy autonomous workflows is hard. Poor accuracy, one-vendor lock-in, and governance gaps are common failure points.
According to William McKnight, president of McKnight Consulting, "The new capabilities are a significant update designed to instill confidence in moving AI agent projects from pilots to secure production by focusing on ensuring the AI is governed, open and accurate. A full agent lifecycle is covered."
What's new in Agent Bricks
- MLflow for Agent Quality and Observability (GA): Continuous agent evaluation, run tracking, and metrics to raise accuracy and reduce drift. See the broader ecosystem at MLflow.
- AI Gateway (preview): A governed interface to manage agent connections to models like OpenAI's GPT-5, Google's Gemini, Anthropic's Claude Sonnet, and open source options.
- MCP Catalog in Marketplace (preview): Governance and lifecycle control for connecting agents to external tools and data sources via the Model Context Protocol. Learn more about the protocol at Model Context Protocol (MCP).
- MCP support in Multi-Agent Supervisor (beta): Coordinate multi-step workflows across specialized agents with standardized tool access.
- ai_parse_document SQL function (preview): Extracts content from documents and tables so agents can ground decisions in unstructured data, not just rows and columns.
Only MLflow for Agent Quality and Observability is generally available today; the rest are in preview or beta.
Context and momentum
Agent Bricks launched in beta in June to help teams close the gap between prototypes and production. Databricks also made OpenAI models natively available across its platform as part of a $100 million partnership, broadening model choice without custom plumbing.
Devin Pratt, analyst at IDC, summed it up: "Collectively, these updates help organizations move agents from pilot to production with greater control and trust. This is about making enterprise agents trustworthy, accurate, governed and flexible on the data organizations already control."
How Databricks is framing the problem
Databricks points to three recurring blockers: low confidence in agent quality, lock-in to a single model provider, and security/governance exposure. The new releases target all three with evaluation pipelines, policy controls, and standardized tool access.
McKnight sees the biggest near-term upside in the MCP Catalog and ai_parse_document: they address governance, security, and data grounding-common reasons pilots stall. Pratt also highlights MLflow's evaluation workflows as critical for regulated or customer-facing uses.
How it stacks up
Competitors like Snowflake (Cortex Agents), Teradata, Informatica, AWS, Google Cloud, and Microsoft are all building agent tooling. Analysts note Databricks' edge is unifying data governance, model control, and agent evaluation inside a lakehouse architecture.
As Pratt puts it, this supports governed, data-centric AI operations while keeping development flexible and well-orchestrated.
Practical rollout plan for IT and engineering leaders
- Stand up evaluation early: Define task suites, golden datasets, and pass/fail thresholds in MLflow before integration work begins. Track regressions per agent/version.
- Enforce model policy: Route all model calls through AI Gateway. Set guardrails for PII handling, cost ceilings, preferred model lists, and failover providers.
- Standardize tool access via MCP: Catalog external tools and data sources with clear approval paths, scopes, and audit trails. Use MCP in multi-agent workflows to avoid bespoke connectors.
- Ground agents in your data: Use ai_parse_document to extract context from PDFs, docs, and tables. Pair with retrieval policies that log sources and citations for auditability.
- Plan multi-model testing: Benchmark tasks across providers (OpenAI, Google, Anthropic, open source). Select by performance, latency, and cost-not brand.
- Operationalize observability: Monitor task success rates, tool-call accuracy, latency, and cost per task. Alert on drift and roll back to safe versions when needed.
Governance and risk checklist
- Document decision boundaries for each agent; prevent actions outside scope.
- Require human-in-the-loop for high-impact or irreversible steps.
- Enable end-to-end audit logs: prompts, tool calls, retrieved content, model responses, and final actions.
- Run red-team tests for data exfiltration, prompt injection, and tool abuse.
- Track total cost per user or workflow to avoid surprises at scale.
Known friction points
Analysts still call out two gaps: ease-of-use and pricing clarity. If you're evaluating, include UX in your proof-of-concept and model your total cost of ownership with real workloads, especially integration-heavy pipelines.
Bottom line
Agent Bricks is maturing in the right places: accuracy, governance, and data access. If your agents are stuck at the pilot stage, the GA evaluation tooling plus governed model and tool access are worth testing against your highest-priority workflows.
Next steps
- Identify two production candidate workflows and define success metrics this week.
- Set up MLflow evaluation, route model calls through AI Gateway, and catalog required tools via MCP.
- Pilot ai_parse_document for unstructured data grounding and measure error rate reduction.
- Hold a go/no-go review after two weeks based on accuracy, latency, compliance, and cost.
Building team capability for agentic systems? Explore role-based AI paths here: AI courses by job.
Your membership also unlocks: