Judge agents at Lloyds: GenAI that scales personalised, FCA-compliant guidance
Lloyds uses agent-as-judge AI: a generator plus independent reviewers to cut errors and meet FCA rules. Specialist models handle finance; humans review high-impact cases.

Interview: Using AI agents as judges in GenAI workflows
Forty years ago, a branch manager knew your name and your story. That level of personal guidance doesn't scale. As Ranil Boteju, chief data and analytics officer at Lloyds Banking Group, puts it: most people can't afford a financial planner, and there aren't enough advisers to go around.
The bank's answer: agentic AI that can be audited, measured and kept within the guardrails of UK regulation. The goal is simple-wider access to high-quality guidance without compromising accuracy or accountability.
Why "agent-as-judge" matters for finance
Large language models can produce confident but wrong answers. In a regulated sector, that's a hard stop. Boteju's team is tackling this with an "agent-as-judge" pattern: one model generates an answer; separate models review, score and approve or reject it against clear policies and FCA expectations.
This second-line review reduces the risk of hallucinations, checks for bias, and makes decisions traceable. It doesn't replace people. "There is still very much a place for humans in the loop," Boteju says.
Specialist models beat general models for regulated use
General LLMs learn from everything on the internet. That breadth isn't always useful in finance. Lloyds opted to back a financial-services-specific model, FinLLM, developed with Aveni-trained on UK-relevant financial data to cut noise and reduce error.
The bank also wants model choice, not lock-in. An open approach to foundation models supports sovereignty and the ability to select the best tool for each task.
Real deployment: an audit assistant with checks and balances
Lloyds has tested FinLLM in Group Audit & Conduct Investigations. An audit chatbot integrates generative AI with the bank's internal Atlas documentation system to make retrieval faster and more precise.
The flow: FinLLM is tuned on audit knowledge; a generator proposes an answer; independent judge agents score it for compliance and accuracy; humans review edge cases. Outputs must align with FCA guidance and internal policy before they reach users.
How agentic AI is orchestrated
Different models handle different strengths. A hyperscaler model (e.g., ChatGPT 5 or Google Gemini) can parse what a customer actually means. FinLLM handles the regulated, domain-specific reasoning. Other agents break the request into parts and solve each piece.
Judge agents act like a second-line colleague: they verify outcomes, check rationale, reference sources, and flag anything that needs human attention.
What this means for finance leaders
- Accuracy is a control problem, not just a model problem. Treat "agent-as-judge" as part of your second line of defense.
- Use specialist models for regulated reasoning; use general models for intent parsing and language fluency.
- Keep humans in the loop for high-impact advice, vulnerable customers, and novel scenarios.
- Design for auditability: store prompts, retrieved sources, scores from judge agents, and final decisions.
- Align outputs with the FCA's Consumer Duty and fair-value outcomes. Codify these as scoring rubrics for judge agents.
Practical blueprint to get started
- Define your high-risk use cases and exclude them from full automation initially.
- Stand up a retrieval layer that anchors responses in your approved policies, product docs, and rate cards.
- Choose a domain model (e.g., FinLLM-style) for financial reasoning; use a general LLM for intent classification and summarization.
- Build judge agents with clear rubrics: factuality, policy alignment, bias checks, completeness, and tone. Require a pass threshold.
- Implement human review gates for advice, cross-selling, and vulnerable customer flags.
- Instrument metrics: hallucination rate, judge-pass rate, override rate, time-to-resolve, and customer outcome measures.
- Log everything for audit: prompts, retrieved documents, intermediate steps, judge scores, human overrides, and release approvals.
- Run red-team evaluations with synthetic and historical cases, then retrain or tighten guardrails based on failure modes.
Governance checklist for CFOs, CROs and Heads of Audit
- Policy mapping: tie model outputs to FCA Consumer Duty outcomes and internal conduct rules.
- Model risk management: apply MRM standards to LLMs-validation, monitoring, change control, and issue remediation.
- Data controls: keep PII segmented, use purpose-bound data access, and rotate secrets/keys on schedule.
- Third-party risk: diversify models to avoid vendor lock-in; require transparency on training data and safety testing.
- Explainability: require sources and step-by-step reasoning artifacts from both generators and judge agents.
Where this is heading
Agentic AI won't replace regulated judgment, but it can scale high-quality guidance to far more people. The pattern is clear: specialized models for domain accuracy, general models for language, independent judges for safety, and humans for accountability.
For finance teams, the advantage goes to those who ship controlled systems early, measure failure modes, and iterate behind strong governance.
Further reading
FCA Consumer Duty
PRA: Model Risk Management principles (SS1/23)
Tools and training
For a curated list of AI tools relevant to finance teams, see AI tools for Finance.