As the United Nations marks the International Day for Countering Hate Speech on June 18, research reveals that AI systems used by social platforms to detect hate speech deliver wildly inconsistent results, with some models flagging near-constant hate while others miss implicit attacks entirely. With more than two-thirds of internet users encountering hate speech online, the gaps in AI moderation carry real consequences for who gets protected-and who gets silenced.
The scale of online hate speech
An Ipsos-UNESCO survey of 8,000 people across 16 countries in 2023 found that over two-thirds of respondents had seen hate speech online. LGBTQI people were the most targeted group, cited by 33 percent of participants, followed by ethnic and racial minorities at 28 percent. Detection and removal practices vary sharply by platform. In the last quarter of 2025, Meta removed 1.3 million hateful posts from Instagram and 1.3 million from Facebook-a steep drop from 7.4 million and 5.8 million, respectively, in the same period a year earlier, after the company reduced proactive AI detection and leaned more on user reports. TikTok, by contrast, said it removed 96.3 percent of hate speech in that quarter before anyone reported it.
How AI moderation systems work
Platforms increasingly rely on large language models to filter hate speech at scale. These systems are trained on labeled datasets and apply scoring thresholds to decide whether a post violates policies. A 2025 study from the University of Pennsylvania evaluated seven moderation models-including those from OpenAI, Anthropic, DeepSeek, Mistral, and Google-and found large discrepancies in how they scored the same content across different target groups. The Mistral Moderation Endpoint often assigned scores near 1, labeling most examples as highly hateful regardless of context. OpenAI's endpoint regularly produced scores less than half those of other models. The study authors warned that when two systems flag the same content differently, "it undermines the legitimacy of the moderation process."
These inconsistencies matter deeply for the millions of posts removed or left up each day, shaping what speech gets amplified and what gets suppressed. For professionals working with Generative AI and LLM systems, the findings highlight how model choice and training data can produce starkly different results from identical inputs.
AI's blind spots: implicit hate and reclaimed language
Explicit slurs and profanities are the easiest for AI to catch. Far harder is implicit hate speech-messages that carry no offensive words but still demean a group. Arkaitz Zubiaga, an associate professor at Queen Mary University of London and co-lead of the university's Social Data Science lab, described the problem: "One challenging example is the case of implicit hate speech, which is often not detected as such because it contains no mention of slurs. This could be the case of a positive-sounding message such as 'I would love to see how great the world would be if…' followed by a derogatory message disparaging a demographic group. AI systems can struggle to see the hate in those messages if they focus instead on the positive side of the message."
The same models also trip over reclaimed language-slurs that marginalized communities have repurposed. "While these cases should not be flagged as hateful, AI systems have a tendency to do it," Zubiaga said. These misclassifications, in both directions, erode trust in automated moderation and can cause real harm to the people they are meant to protect.
On Social Media platforms, where volume demands automated triage, such errors mean that subtle attacks on vulnerable groups frequently go unchallenged while community in-groups are unfairly penalized.
Why this matters for IT, development, and research professionals
For engineers deploying content moderation APIs, researchers fine-tuning language models, and product teams defining safety thresholds, the findings show that choosing a moderation model is not a neutral technical decision. The same post can be flagged as severe hate by one system and ignored by another. That variability directly impacts user safety, platform liability, and the fairness of automated systems. Addressing it requires careful evaluation across demographic categories, not just aggregate accuracy scores, and closer attention to how models handle implicit meaning and reclaimed terms. As online hate continues to spread, the professionals building and auditing these systems will determine whether AI helps or hinders the fight.
Your membership also unlocks: