GPT-5 beats judges on rule-following. Should it judge?
New research from University of Chicago law professor Eric Posner and researcher Shivam Saran reports that GPT-5 applied the correct legal rule in every scenario they tested, outperforming a cohort of US federal judges who hit 52 percent. The model showed no hallucinations or obvious logical slips in these tasks.
Google's Gemini 3 Pro matched GPT-5 with a perfect score. Earlier work by the same team found GPT-4o stuck closely to the letter of the law in an ICTY war-crimes appeal simulation, even when sympathy for a party might have nudged a human judge.
What the numbers say
- GPT-5: 100% correct rule application; perfectly formalistic
- Gemini 3 Pro: 100%
- Gemini 2.5 Pro: 92%
- o4-mini: 79%
- Llama 4 Maverick: 75%
- Llama 4 Scout: 50%
- GPT-4.1: 50%
- US federal judges (comparison study): 52%
Important nuance: a 52 percent "rule-following rate" does not mean judges were lax. Many questions turn on standards and guidelines, not hard rules. Discretion is part of the job, especially where policy, equity, or context matters.
Why this matters for legal practice
AI is getting better at mechanical legal tasks. It's also getting more rigid. If you give it a rule-bound question, it will likely snap to the black-letter answer-fast and consistently. That is useful, but it can cut against fairness in cases where the law invites judgment instead of deduction.
The real question isn't whether a model can follow rules. It's who decides when rule-following should yield to standards, policy, or mercy-and how that discretion is documented, reviewable, and accountable.
Practical guidance for courts and chambers
- Segment the workload: Use AI for conflicts-of-law triage, cite checks, and consistency reviews. Keep discretionary calls-sentencing, custody, asylum, equitable remedies-human-led.
- Define "AI-eligible" issues: Rules with clear elements, bright lines, or fixed choice-of-law tests are in. Ambiguous standards are out.
- Require human sign-off: Any AI input must be reviewed, edited, and owned by a judicial officer or clerk.
- Set model parameters centrally: Lock system prompts and temperature. Document who controls changes and why.
- Maintain an audit trail: Save prompts, model versions, outputs, and edits. Make them discoverable where appropriate.
- Test blindly: Run historical opinions and bench memos through the system to check recall, precision, and unintended bias before live use.
- Localize law: Preload governing statutes, rules, and controlling precedent for the jurisdiction. Block access to non-authoritative sources.
- Security and confidentiality: Use on-prem or approved tenants. No public endpoints for sealed or sensitive materials.
- Error pathways: Define how to challenge or override AI suggestions, and who has that authority.
Practical guidance for litigators and in-house counsel
- Use AI where formality helps: Choice-of-law factors, element checklists, deadline and rule compliance, and cite validation.
- Guard against overreach: For standards (reasonableness, undue burden, best interests), force the tool to expose competing frameworks and policy trade-offs-not a single answer.
- Prompt with constraints: Specify controlling jurisdiction, date cutoffs, and binding vs. persuasive authority. Ban speculation.
- Cross-verify: Independently confirm every citation and quote. No exceptions.
- Document provenance: Keep a record of the exact prompt, model version, and edits in the file.
- Prepare to explain: If you use AI-assisted analysis, be ready to justify the reasoning without leaning on "the model says so."
Open questions policy-makers should settle
- Parameter governance: Who sets model prompts and dials, and under what oversight?
- Transparency: When must parties be told AI assisted a decision or draft?
- Standards of review: How should appellate courts treat AI-influenced reasoning?
- Fairness vs. formalism: When do we permit deviation from rules for policy or equity, and who decides?
- Vendor accountability: What warranties, logs, and audit rights must be in the contract?
Where AI fits today
These findings make a strong case for AI as a co-pilot on rule-heavy tasks: conflicts-of-law screening, element-by-element analysis, and consistency checks across similar fact patterns. Use it to flag deviations and surface controlling authority quickly.
Do not let a model be the last word where standards, context, or moral judgment carry weight. The studies show models can ignore sympathy and stick to rules. That's valuable-until the just outcome requires stepping away from the script.
Further reading
- Eric Posner - University of Chicago Law
- International Criminal Tribunal for the former Yugoslavia (ICTY)
- AI Research Resources
Skill up your team
If your organization is formalizing AI literacy for legal-adjacent work, see our AI courses by job role for structured options tailored to practical workflows.
Your membership also unlocks: