Major chatbots fail 90% of election-related prompts, Forum AI study finds

A new study found major chatbots failed 90% of election-related accuracy tests, with 35% of foreign policy answers drawing from state-run media. ChatGPT, Gemini, Claude, and Grok were all tested.

Categorized in: AI News Science and Research
Published on: May 21, 2026
Major chatbots fail 90% of election-related prompts, Forum AI study finds

Major Chatbots Fail Accuracy Tests on Elections and Foreign Policy

ChatGPT, Gemini, Claude, and Grok show significant gaps in factual accuracy and source quality when answering questions about news, according to a study by Forum AI released this week. The findings raise questions about whether these widely used tools are reliable for information consumption on sensitive topics.

Forum AI tested the chatbots across three dimensions: factual accuracy, bias, and source quality. The researchers aimed to provide independent assessment beyond the self-evaluations companies typically conduct.

The Numbers

Major chatbots failed on 90% of election-related prompts. On foreign policy questions, 35% of answers relied on state-run media sources. Basic finance and market questions showed a 30% factual error rate.

These gaps matter because researchers, analysts, and other professionals increasingly use chatbots to gather background information and verify facts.

Why Independent Testing Matters

Campbell Brown, CEO of Forum AI, said the study addresses a structural problem: "The model companies are essentially grading their own homework. It's really important that there be companies outside of the model companies that are doing this work and sharing the results."

Most existing benchmarks focus on technical capabilities like coding performance. They don't measure factual accuracy or bias in real-world applications-the areas where these tools are most likely to mislead users.

Political Patterns in Responses

The study found different bias patterns across models. ChatGPT and Gemini produced less biased responses on election questions, with centrist or left-leaning tendencies. Grok exhibited more pronounced right-leaning bias.

Brown said some models performed better than others on specific query types, but all have room for improvement.

The Broader Picture

Brown did not call for regulation but predicted demand for independent evaluation will increase. "You're already seeing some states pass laws where they're requiring independent evaluation," she said.

As these tools become embedded in professional workflows, the ability to assess their reliability independently becomes a baseline requirement, not a luxury.

Learn more about AI Research Courses and Generative AI and LLM Courses to deepen your understanding of how these models work and their limitations.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)