Major Chatbots Fail Accuracy Tests on Elections and Foreign Policy
ChatGPT, Gemini, Claude, and Grok show significant gaps in factual accuracy and source quality when answering questions about news, according to a study by Forum AI released this week. The findings raise questions about whether these widely used tools are reliable for information consumption on sensitive topics.
Forum AI tested the chatbots across three dimensions: factual accuracy, bias, and source quality. The researchers aimed to provide independent assessment beyond the self-evaluations companies typically conduct.
The Numbers
Major chatbots failed on 90% of election-related prompts. On foreign policy questions, 35% of answers relied on state-run media sources. Basic finance and market questions showed a 30% factual error rate.
These gaps matter because researchers, analysts, and other professionals increasingly use chatbots to gather background information and verify facts.
Why Independent Testing Matters
Campbell Brown, CEO of Forum AI, said the study addresses a structural problem: "The model companies are essentially grading their own homework. It's really important that there be companies outside of the model companies that are doing this work and sharing the results."
Most existing benchmarks focus on technical capabilities like coding performance. They don't measure factual accuracy or bias in real-world applications-the areas where these tools are most likely to mislead users.
Political Patterns in Responses
The study found different bias patterns across models. ChatGPT and Gemini produced less biased responses on election questions, with centrist or left-leaning tendencies. Grok exhibited more pronounced right-leaning bias.
Brown said some models performed better than others on specific query types, but all have room for improvement.
The Broader Picture
Brown did not call for regulation but predicted demand for independent evaluation will increase. "You're already seeing some states pass laws where they're requiring independent evaluation," she said.
As these tools become embedded in professional workflows, the ability to assess their reliability independently becomes a baseline requirement, not a luxury.
Learn more about AI Research Courses and Generative AI and LLM Courses to deepen your understanding of how these models work and their limitations.
Your membership also unlocks: