Generative Artificial Intelligence in Scientific Peer Review
Generative artificial intelligence (AI) is becoming increasingly difficult to detect, raising serious concerns for the integrity of scientific peer review.
AI Fools Detection Software in Scientific Publishing Test
Researchers from medical institutions in China tested Claude 2.0, an AI language model, by asking it to generate peer review reports for 20 cancer biology papers published in eLife. The AI produced full peer reviews, rejection recommendations, citation requests, and rebuttals to citation demands.
Popular AI detection tools struggled significantly. GPTzero, a widely trusted program, misclassified 82.8% of the AI-generated reviews as human-written. ZeroGPT performed only slightly better, misidentifying 59.7% as human-authored. Despite claims of high accuracy, these tools failed to recognize AI-generated academic writing.
The AI-generated reviews maintained professional academic tone and college-level readability, making them indistinguishable from genuine peer reviews both to detection software and human readers.
Beyond Detection: AI’s Concerning Capabilities
The more alarming issue is what the AI can do. When asked to reject research papers, over 76% of the AI’s rejection comments received above-average ratings from expert reviewers. This suggests that AI can produce convincing, credible-sounding critiques capable of influencing editorial decisions.
Additionally, the AI created plausible but irrelevant citation requests, some scoring a perfect five by experts. It even justified citing unrelated papers across different fields such as materials science and medical research. This raises concerns about potential misuse, such as artificially inflating citation metrics.
However, the AI struggled to match detailed human reviews. It performed better when compared with broad, superficial feedback but failed to replicate the depth and specificity of expert critiques.
What This Means for Scientific Research
Current safeguards are ill-equipped to handle these challenges. Malicious reviewers could exploit AI to generate convincing but biased or manipulative content, undermining the peer review process and possibly suppressing valid scientific work.
There is also a risk that reviewers might use AI to boost their own citation counts by inserting unjustified references to their papers. On the positive side, the AI showed promise in generating rebuttals to unreasonable citation requests, indicating it could also assist in detecting manipulation.
Academic journals must act quickly. The study recommends mandatory disclosure of AI use by reviewers, similar to disclosure policies for authors. Publishers need clear operational guidelines to handle suspected AI-generated reviews. Additionally, AI providers should consider implementing restrictions to prevent misuse.
Given the limited scope—20 cancer biology papers—further research across disciplines is necessary. Yet the core issue is evident: as AI advances and detection tools lag behind, the scientific community must develop new protections to maintain research integrity without hindering legitimate technological use.
Study Summary
- Methodology: Claude 2.0 generated peer reviews, rejection recommendations, citation requests, and rebuttals for 20 cancer biology papers from eLife. Each task was repeated three times for consistency. Two expert oncology reviewers scored the AI content on a five-point scale. AI detection tools ZeroGPT and GPTzero were tested for their ability to identify AI-generated text.
- Results: AI detection tools failed significantly, with GPTzero misclassifying 82.8% of AI reviews as human-written. The AI generated convincing rejection comments and plausible citation requests, even for unrelated research fields. However, the AI struggled with detailed, specific critiques compared to human reviewers.
- Limitations: The study covered only 20 cancer biology papers and focused on Claude 2.0 and two detection tools. Manual expert scoring limited sample size, and results may not generalize across all scientific fields or AI systems.
- Funding and Disclosures: No funding sources or conflicts of interest were declared.
For those interested in further exploring AI’s impact on research workflows and how to responsibly integrate AI tools, consider visiting Complete AI Training for curated AI education resources.
Reference: Zhu, L., Lai, Y., Xie, J., et al. “Evaluating the potential risks of employing large language models in peer review,” Clinical and Translational Discovery, June 27, 2025. DOI: 10.1002/ctd2.70067
Your membership also unlocks: