Anthropic Destroyed Millions of Books to Feed Its AI—And the Courts Approved
Anthropic trained its AI by buying physical books, tearing out pages, scanning them, and discarding the originals. A judge ruled this legal under the first-sale doctrine, sparking ethical debates.

Anthropic’s Controversial Method of Training AI with Physical Books
Anthropic, the AI startup backed by Google, recently revealed a striking approach to gathering training data for its Claude AI model. Instead of relying solely on digital copies, the company purchased millions of physical books, physically tore out their pages, scanned them, and then discarded the original texts. This isn’t just a metaphorical “devouring” of books—it’s literal.
This practice came to light through a recent copyright ruling that favored Anthropic and, more broadly, the tech industry’s appetite for data. US District Judge William Alsup ruled that Anthropic can train its large language models using books it legally purchased, even without explicit permission from authors.
How Anthropic’s Approach Works Legally
The key to Anthropic’s strategy lies in the first-sale doctrine. This legal principle allows the buyer of a physical book to use it as they please without the copyright holder’s interference, which is why secondhand book sales are legal. Anthropic leveraged this to bypass the need for licensing.
By stripping pages out of the books, Anthropic made scanning cheaper and simpler. Since the company only used the scanned content internally and then discarded the physical pages, the judge considered this a form of “space conservation,” which contributed to the ruling that this process is legally acceptable.
Ethical and Practical Issues
Despite the legal win, the method raises ethical questions. Destroying millions of physical books for data extraction can be seen as wasteful and disrespectful to authors and publishers. The practice also highlights a broader problem: AI companies are searching aggressively for high-quality data sources, sometimes at the expense of the original content creators.
Anthropic’s approach isn’t unique. Others, including Meta, have used pirated books for AI training, leading to ongoing lawsuits from authors. Meanwhile, many archivists and organizations like the Internet Archive and Google Books have developed techniques to digitize books without destruction, showing that alternatives do exist.
What This Means for AI Development
- AI companies are pushing legal boundaries to acquire training data.
- Legal loopholes like the first-sale doctrine can be exploited in unexpected ways.
- There’s tension between data acquisition methods and ethical responsibility.
- Destructive scanning methods highlight a shortsighted approach to sourcing data.
As AI models continue to grow in scale and complexity, the demand for diverse and high-quality training data will only increase. This case serves as a reminder that the industry’s hunger for data sometimes leads to questionable practices that may not be sustainable or respectful of content creators.
For those working in IT and development interested in AI training practices and ethical data handling, understanding these legal and operational challenges is crucial. To explore more about AI models and training methodologies, you can visit Complete AI Training for courses and resources.