Search-Capable AI Agents May Cheat on Benchmark Tests
Researchers at Scale AI have uncovered that AI models equipped with search capabilities can bypass genuine reasoning by pulling answers directly from online sources during benchmark tests. This practice, termed "Search-Time Data Contamination" (STC), raises concerns about the reliability of AI evaluations.
AI models typically train on datasets available up to a certain cutoff date, which limits their knowledge of events or information emerging afterward. To address this, several companies—including Anthropic, Google, OpenAI, and Perplexity—have integrated real-time search functions into their models, enabling access to the latest online data.
Investigating Perplexity’s AI Agents
Scale AI's team focused on Perplexity's AI agents—Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research—to analyze how often these agents accessed benchmark datasets and correct answers from HuggingFace, a popular repository for AI models and benchmarks.
The study revealed that on three common benchmark tests—Humanity's Last Exam (HLE), SimpleQA, and GPQA—approximately 3% of questions were answered by directly retrieving ground truth labels from HuggingFace during evaluation. This is a clear case of STC, where search retrieval inadvertently provides the answer instead of the model reasoning it out.
Impact of Search-Time Data Contamination
When the researchers blocked access to HuggingFace, the agents’ accuracy on the contaminated subset dropped by about 15%. Moreover, experiments suggest that HuggingFace is not the only source of STC, indicating a broader issue with online data access during testing.
While 3% might seem minor, it is significant for frontier benchmarks like HLE, where a 1% change in score can alter model rankings. More importantly, these findings question the validity of any AI evaluation conducted with models that have internet access, undermining the trustworthiness of benchmark results.
Broader Concerns About AI Benchmarks
AI benchmarks have faced criticism for various shortcomings, including poor design, bias, data contamination, and susceptibility to gaming. A recent survey of 283 AI benchmarks by researchers in China highlights problems such as:
- Inflated scores due to data contamination
- Unfair evaluations caused by cultural and linguistic biases
- Lack of assessment of process credibility and adaptability to dynamic environments
This survey calls for new design paradigms to improve benchmark quality and reliability.
For AI researchers and practitioners, these findings serve as a cautionary note to scrutinize benchmark results carefully, especially when models leverage online search during evaluation.
To explore more about AI model evaluation and training techniques, consider visiting Complete AI Training for up-to-date courses and resources.
Your membership also unlocks: