Gemini and Claude Evade Detection Tools in AI Writing Test
More than half of all new articles published online are now generated by artificial intelligence. Research firm Ahrefs found that 74.2% of newly created web pages in April 2025 contained AI-generated content, while a separate study by SEO firm Graphite put the figure at approximately 52% across 65,000 English-language articles.
As AI-generated content floods the web, a new study tested which writing platforms are hardest for detection software to identify.
The Test
Open Resource Applications, a company developing free AI tools, instructed 12 AI writing platforms to produce articles between 1,000 and 1,500 words designed to sound human. Each piece was then evaluated by three detection tools: Grammarly, QuillBot, and GPTZero.
Results: Gemini Leads, ChatGPT Lags
Google's Gemini proved hardest to detect, with an average detection rate of just 39% across the three tools. Grammarly flagged only 17% of Gemini's output, while QuillBot detected none of it.
Anthropic's Claude ranked second at 41%, with Grammarly finding no AI signals in its text. Grok, the AI assistant from xAI, came third at 46.33%.
ChatGPT, the most widely used AI platform globally with 800 million to 900 million monthly users, ranked ninth out of twelve. Grammarly detected 50% of its output, while QuillBot and GPTZero flagged 90% to 100% of its text as machine-generated.
A spokesperson for Open Resource Applications attributed Gemini's performance to its reasoning ability. "Tools like GPTZero flag predictability and overall structure, so a model that actually reasons through ideas rather than recycling familiar phrases is going to be a lot harder to catch," the spokesperson said.
ChatGPT's early dominance works against it in detection tests. "Everyone knows what it sounds like. Many models that came after it sounded like ChatGPT first, before they became more unique," the spokesperson said.
Which Detection Tool Works Best?
GPTZero proved the most difficult to fool, catching an average of 98.8% of AI-generated content across all platforms. Grammarly showed the weakest performance, correctly flagging just 43.5% of generated content on average. Only Claude and Meta AI managed to confuse GPTZero with any consistency.
All three detection tools avoided false positives. None incorrectly identified properly written human text as AI-generated, which adds confidence to their positive detections.
What This Means for Writers
The detection gap between AI platforms is widening. Newer models like Gemini and Claude are becoming harder to flag, while older systems like ChatGPT remain easier to identify.
Independent research suggests AI-generated content growth has plateaued in recent months. Publishers have found that AI articles tend not to perform as well in search results, which may explain why some have slowed their AI publishing efforts.
For writers, understanding which platforms produce detectable output matters. If you're using AI for writing, knowing the detection capabilities of these tools helps you understand the practical limits of AI assistance in publishing.
The arms race between generative AI platforms and detection tools continues to accelerate. Each new model generation pushes the gap wider between the best and worst-performing systems.
Your membership also unlocks: