Same Question, Different Verdicts: 12,000 Legal AI Tests Expose a Consistency Gap

Same prompt, different answers: 12,000 legal tests show AI is steady on recall but wobbly on analysis and research. That hurts client trust, so teams need guardrails.

Categorized in: AI News Legal
Published on: Dec 06, 2025
Same Question, Different Verdicts: 12,000 Legal AI Tests Expose a Consistency Gap

Same Question, Different Answers: What 12,000 Tests Reveal About AI and Legal Consistency

Two associates ask the same AI about a non-compete. One memo says enforceable. The other says overbroad. This isn't a one-off glitch. Large-scale testing shows output inconsistency is baked into today's top models.

Legal teams are adopting AI fast-from review to drafting to research. Consistency is the promise: treat like cases alike. Yet, when the prompt stays the same, the conclusion often doesn't. That gap matters for equality, predictability, and client trust.

What was tested

A 12,000-run study evaluated four leading language models (GPT-4o, GPT-4.1, Claude Sonnet 4, Claude Opus 4) across three task types: legal knowledge, analysis, and research. Domains included constitutional law, contracts, civil and criminal procedure, and IP. Outputs were scored for semantic consistency-same meaning, not just the same wording.

The numbers that matter

  • The top performer on complex legal tasks hit 57% consistency (same reasoning and outcome).
  • Low-complexity tasks improved results, but one top model still produced conflicting conclusions roughly 1 in 15 runs.
  • By task type (complex conditions):
    • Knowledge (bar-style multiple choice): ~97-100% consistency
    • Analysis (apply law to facts): ~17-61% consistency
    • Research (find and cite authority): below 26% consistency
  • By task type (lower complexity):
    • Knowledge: ~97-100%
    • Analysis: ~62-92%
    • Research: ~21-75%

In short: models are steady at recall, shaky at analysis, and most unreliable at research. Exactly the areas lawyers rely on most.

Why models disagree on the same prompt

Modern language models generate text with inherent randomness. That stochasticity can't be fully removed in today's ecosystem. Vendors also serve creative and consumer use cases where variation is a feature, not a bug. Total uniformity isn't the target for many providers.

AI inconsistency vs. human inconsistency

Yes, humans are inconsistent too, often in patterned ways tied to fatigue, time pressure, or ideology. AI inconsistency looks different. It shows up as random noise-shifts without clear triggers. Adding AI doesn't just add more inconsistency. It adds a new kind you have to manage.

Why this hits legal practice hard

Most legal AI tools are wrappers around these base models. Inconsistency at the foundation bleeds into the platforms you use. Two partners can get conflicting outputs on similar matters, with downstream risk for advice, strategy, and client expectations.

Adoption keeps climbing, so the risk surface grows. For context, see the ABA's recent overview of AI use in practice here.

A playbook to reduce inconsistency

  • Decompose the task: Break a long, multi-issue prompt into single-task steps. Ask about governing law, elements, and application separately. Consistency jumps when complexity drops.
  • Multi-run verification: Run the same prompt 3-5 times. Compare answers. If conclusions diverge, investigate before relying on any single output.
  • Pin down instructions: Specify the standard, burden, jurisdiction, date range, and what "good" looks like (e.g., "Return 3 cases with direct quotes and pinpoint cites").
  • Chain your reasoning: Force structured thinking: issue → rule → application → conclusion. Use checklists and require the model to show its work.
  • Use retrieval when possible: Ground answers in your own memos, models, and jurisdictional databases to reduce free-form drift.
  • Human-in-the-loop, always: Treat outputs as drafts. Verify citations, holdings, and facts before relying on them.
  • Price the trade-offs: Decomposition and review add steps. Multi-runs add cost and latency. Bake this into scoping and budgets.

What to expect from vendors

Prompting alone won't erase inconsistency. Even strict settings can't guarantee identical conclusions on repeated runs. Ask vendors about consistency testing by task type, not just accuracy on benchmarks. Push for logs, comparison views, and tooling that supports multi-run review.

Where AI helps today

Use it confidently for rote recall, summaries with sources, and first-pass drafting. Be cautious on novel analysis and open-ended research. Require citations, verify quotes, and insist on jurisdictional fit.

Practical next steps for your team

  • Set internal standards for prompts, formats, and verification steps.
  • Adopt a two-pass rule on analysis and research: model pass + human pass.
  • Limit high-stakes use to workflows with retrieval and source checking.
  • Track where the model disagrees with itself-that's where you add guardrails first.
  • Train your team on task decomposition and structured prompting. For role-based training paths, see these options.

AI can speed the work and widen access. But consistency isn't a given. Treat it like any legal tool: validate, compare, and document your process. That's how you keep the work reliable-and defensible.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide