Inside AI Watchdog: Exposing How Generative AI Is Built on Secret Data Sets and Pirated Content

AI Watchdog reveals which materials appear in AI training data sets from millions of books, videos, and articles. This transparency helps creators understand if their work was used without consent.

Categorized in: AI News Writers
Published on: Sep 11, 2025
Inside AI Watchdog: Exposing How Generative AI Is Built on Secret Data Sets and Pirated Content

AI Watchdog: Shedding Light on Generative AI Training

Generative-AI companies have gained significant control over how people find and consume information. Chatbots now promise answers to virtually any question, and they can create images and videos at remarkable speed. These tools are quickly replacing traditional search engines and human experts as primary sources of knowledge.

However, the data that trains these AI models remains a closely guarded secret. Powerful companies compete fiercely for dominance in AI, keeping their training sources under wraps. This secrecy raises serious questions that affect writers and other creators deeply.

Why the Training Data Matters

AI models are trained on enormous amounts of data, often including copyrighted works like books, music, podcasts, and films—usually without the creators' consent. This has led to multiple lawsuits, and the legal status of such use is still unsettled.

Beyond copyright, the training data may contain misinformation, hateful content, explicit material, or instructions for harmful acts. Knowing what data influences AI responses is crucial, especially for those whose work might be included.

What is AI Watchdog?

AI Watchdog is a search tool designed to reveal what materials appear in various AI training data sets and which companies use them. At launch, it includes:

  • More than 7.5 million books
  • 81 million research articles
  • 15 million YouTube videos
  • Writings from tens of thousands of movies and TV shows

Most data sets were created by AI companies or research organizations and shared publicly on AI developer forums. The tool will grow as more data sets are verified and added.

Does Appearing in the Tool Mean My Work Was Used?

If your work shows up in the search, it's likely it was part of a training data set. However, this is not absolute proof that a specific company used it. Companies may exclude certain works when training their models.

How Do AI Companies Obtain Content Without Paying?

  • Some companies license content legally.
  • Others acquire books from pirated libraries or via BitTorrent.
  • They scrape the web broadly or use existing web scrapes like Common Crawl.
  • Search engines such as Bing, Brave, and Google make full-text articles accessible to AI companies.

Protecting Your Work from AI Training

If you’re a writer or creator, chances are your work has already been scraped. But tech firms continue to gather new material, so protection is still possible.

For visual work, adding watermarks or logos can deter AI training use. Companies generally avoid content that identifies individual creators. For instance, Stable Diffusion faced a lawsuit after generating images containing Getty Images watermarks.

There are also AI-poisoning tools like Nightshade and Glaze. These modify images subtly—imperceptible to humans but disruptive to AI learning. Some poisoning methods exist for music as well, which can cause AI models to produce less coherent content.

What to Do If You Believe Your Work Was Used Without Permission

Many lawsuits target AI companies for training models on copyrighted materials without authorization. Some are class actions, offering potential compensation if plaintiffs win. Registered copyrights increase the chances of receiving damages.

If you want to learn more about AI and its implications, consider exploring courses focused on AI tools and ethics at Complete AI Training.