Binghamton researchers cut AI hallucinations by having multiple chatbots "vote" on answers
Researchers at Binghamton University have developed a method to eliminate false information generated by large language models - a persistent problem when people use AI chatbots for medical diagnosis.
The protocol runs the same question through seven different open-source LLMs, then uses a voting system to verify answers. In tests across 10,000 experiments, 76.85% of medical diagnoses were supported by at least four models, with the remaining 23.15% backed by at least two. No hallucinations appeared.
Ahmed Abdeen Hamed and Luis M. Rocha, researchers in Binghamton's School of Systems Science and Industrial Engineering, developed the approach with funding from New York's Empire AI Consortium. The journal STAR Protocols published their findings.
How the verification works
Each of the seven LLMs received the same plain-language symptom descriptions. Before responding, they had to reference retrieval-augmented generation (RAG), which anchored them to authoritative medical databases. Each model then produced medical terminology with official identification numbers.
The models submitted their answers to a vote. Answers backed by multiple models made it through; unsupported claims did not.
"The new workflow can verify anything from a biomedical point of view - biological knowledge with disease and genetics, translational knowledge from diseases to treatments, and healthcare applications with symptoms and treatments," Hamed said.
Building confidence through repetition
The protocol can be run repeatedly with different randomly selected LLMs from a pool of 100+ open-source models. Each iteration increases confidence in the voting outcome.
"When we perform the experiment many, many times, we increase the confidence in the voting," Hamed said.
Rocha noted the method could support development of "digital twins" for precision medicine - virtual replicas of physical processes updated in real time to create predictive simulations of human reactions.
Beyond medical diagnosis
While the study focused on biomedical applications, the verification method could address hallucinations in other domains: fabricated legal citations, fake academic references, or historical errors.
"This protocol is a big step toward the democratization of knowledge verification," Hamed said.
Hamed has moved from Binghamton to a research associate professor role at the University of Nebraska-Lincoln. Rocha credited him with developing AI workflows and catalyzing new research directions within the lab.
For professionals working with generative AI and LLMs, the research demonstrates how structured prompt engineering combined with multiple model verification can reduce the risk of relying on confidently stated but false information.
Your membership also unlocks: