ChatGPT scores 60% better than random guessing on scientific true-or-false questions
Researchers at Washington State University tested ChatGPT's ability to evaluate whether hypotheses from scientific papers were supported by research. The results: the AI answered correctly about 80% of the time, but when adjusted for random chance, it performed only 60% better than guessing.
The study examined 719 hypotheses from business journals published since 2021. Researchers repeated each query 10 times using identical prompts to test consistency.
Inconsistency emerged as a critical weakness
ChatGPT gave different answers to the same question across repeated prompts. In some cases, the AI answered "true" five times and "false" five times for an identical query.
Across all 10 identical prompts, ChatGPT consistently estimated only 73% of statements accurately. The AI struggled most with false statements, correctly identifying them just 16.4% of the time.
"We're not just talking about accuracy, we're talking about inconsistency, because if you ask the same question again and again, you come up with different answers," said Mesut Cicek, an associate professor at WSU's Carson College of Business and lead author of the study.
Fluent language masks weak reasoning
The findings highlight a gap between what these AI tools appear to do and what they actually do. ChatGPT can produce convincing, grammatically correct responses to complex questions-but it often reasons incorrectly while sounding authoritative.
"Current AI tools don't understand the world the way we do-they don't have a 'brain,'" Cicek said. "They just memorize, and they can give you some insight, but they don't understand what they're talking about."
Researchers tested both ChatGPT-3.5 (in 2024) and ChatGPT-4 mini (in 2025). Accuracy improved slightly between versions, but the pattern held: the AI performed only marginally better than chance when adjusted for random guessing.
What this means for your work
The study, published in the Rutgers Business Review, recommends that professionals verify AI results before relying on them for consequential decisions. This applies especially to tasks requiring nuance or complex reasoning.
Managers should train staff on what ChatGPT can and cannot do reliably. Treating AI outputs with skepticism is essential, particularly in scientific and research contexts where accuracy matters.
Cicek's team ran similar tests with other AI tools and found comparable results. Earlier research by the same group found that consumers were less likely to buy products marketed with an AI emphasis, suggesting skepticism extends beyond researchers.
"Always be skeptical," Cicek said. "I'm not against AI. I'm using it. But you need to be very careful."
For professionals in science and research, the takeaway is straightforward: use AI for Science & Research as a starting point, not a conclusion. Verify claims independently, especially when stakes are high.
Your membership also unlocks: