AI Detectors Put to the Test: Study Reveals Big Differences in Spotting ChatGPT-Generated Text

A study finds AI detection tools vary in accuracy, with Pangram achieving 99.3% success, outperforming human experts. Human judgment remains crucial to catch subtle AI writing cues.

Categorized in: AI News Writers
Published on: Jun 04, 2025
AI Detectors Put to the Test: Study Reveals Big Differences in Spotting ChatGPT-Generated Text

Study Reveals Varying Effectiveness of AI Detection Software

Generative AI tools like ChatGPT have created a challenge for educators and writers alike: determining if a piece of writing is original or AI-generated. To assist in this, several AI detection programs have emerged, claiming to distinguish human writing from AI-produced content. However, their effectiveness varies significantly, according to a recent study.

Comparing Humans and AI Detectors

The study, led by a computer science researcher at the University of Maryland, evaluated how well humans versus AI detection tools could identify AI-generated text. It tested five commercial and open-source detectors against expert human reviewers across multiple phases of increasing difficulty.

  • Phases of testing: The researchers selected 30 nonfiction articles written by humans, then generated AI versions of similar length and topic using different AI models such as GPT-4o, Anthropic’s Claude, and advanced paraphrasing techniques mimicking student attempts to evade detection.
  • AI detection tools tested: Pangram, GPTZero (commercial), Binoculars, Fast-DetectGPT (open-source), and RADAR (a Chinese research framework).

The standout performer was Pangram, which achieved a 99.3% success rate, outperforming all human experts individually. It excelled in detecting AI-generated text even when AI content was “humanized” by altering phrasing to sound more natural. In contrast, GPTZero and open-source tools struggled with these advanced AI texts.

How Pangram Stands Out

Most AI detectors rely on metrics like “perplexity” (how surprising each word is) and “burstiness” (variation in surprise across text) to identify AI writing patterns. Human writing tends to have higher variation and creativity, while AI-generated text is often more formulaic.

However, this approach has drawbacks. Writers still learning the language or those with limited vocabulary may produce text with low perplexity, making their writing susceptible to false AI detection. This is especially relevant for students and English learners.

Pangram’s edge comes from a training technique called “synthetic mirrors.” It pairs each human text with an AI-generated counterpart, then retrains on mistakes by generating new synthetic pairs. This iterative learning helps the software better distinguish subtle differences between human and AI writing, reducing false positives dramatically.

Human Expertise Remains Valuable

Interestingly, the study showed that people experienced with generative AI—such as writers, teachers, and editors—were quite effective at spotting AI-generated text without formal training. Individually, their accuracy ranged from 59.3% to 97.3%, but together, through majority voting, they misclassified only one article out of 300.

This success stems from their knowledge of grammar, writing conventions, and exposure to AI-generated patterns. Each expert spotted different clues, suggesting that sharing detection strategies could improve accuracy even further.

Practical Takeaways for Writers and Educators

  • Automated AI detection tools vary widely in performance; some commercial options like Pangram offer near-human accuracy, while many open-source tools struggle, especially with sophisticated AI-generated content.
  • Human judgment remains critical. Familiarity with AI writing styles and careful analysis can catch subtle inconsistencies that software might miss.
  • Beware of false positives, particularly with emerging writers or those still developing language skills. Low perplexity doesn’t always mean AI-generated content.
  • Using AI detectors as one part of a broader evaluation process—rather than relying on them blindly—helps avoid unwarranted suspicion or accusations.

For writers looking to sharpen their understanding of AI tools and their impact on writing, exploring targeted AI training can be beneficial. Resources like Complete AI Training’s courses offer practical insights into AI’s role in content creation and detection.

Ultimately, blending AI detection technology with informed human oversight provides the best defense against misattributed authorship and helps maintain trust in authentic writing.