Internet Archive Caught in Publishers' Fight Over AI Training Data
The Internet Archive, a nonprofit digital library that has preserved over a trillion web pages, is facing restrictions from major news publishers who fear the organization is being used as a back door for AI developers to access copyrighted content.
The New York Times, Washington Post, USA Today, and Reddit have all moved to block or limit the archive's ability to crawl their sites or distribute their content in recent months. The Guardian took a different approach, working with the archive to restrict AI access while preserving the organization's broader mission.
The publishers' actions stem from concerns that AI developers are using the Internet Archive to train models on copyrighted news stories without permission. By blocking the archive, publishers hope to prevent AI companies from circumventing the access restrictions they've placed on their own websites.
Why Publishers Are Acting Now
News organizations have filed a series of lawsuits against AI companies including OpenAI and Perplexity, alleging copyright infringement and unauthorized use of their content to train competing products. The New York Times said in a statement that the issue is "Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us."
USA Today implemented blocking technology in August to "protect our IP, support fair compensation, and reflect a broader industry shift toward paid access for trusted journalism," according to a company spokeswoman.
The Guardian's director of business affairs said the primary concern is "that AI companies are making unauthorised use of content publicly available to Guardian readers." The organization chose to limit rather than fully block the archive, excluding Guardian stories from the archive's API while allowing access to homepage and topic pages.
The Archive's Position
Mark Graham, director of the Internet Archive's Wayback Machine, said the organization is "collateral damage" in a larger fight between publishers and AI companies.
The archive has implemented safeguards to prevent bulk downloading and limit the rate at which material can be accessed, Graham said. The organization has also blocked or prevented bulk downloads of materials from certain websites like the New York Times in response to publisher concerns.
"This is an ongoing effort," Graham said. "It's not a once-and-done kind of thing."
Graham argues that publishers' rationale for blocking the archive is "unfounded" and that the restrictions threaten the organization's core mission to preserve public access to information. As online publishers shut down or modify their sites, the archive often holds the only remaining copies of past web pages and news stories.
Broader Questions About Web Access
Graham sees the publishers' restrictions as part of a larger trend toward closing off what was once an open web. As more content moves behind paywalls or gets blocked, less quality information remains freely available to the public.
This shift has consequences beyond the archive's operations. Graham said it affects "anyone's ability to access quality journalistic material" and compounds the spread of misinformation when freely available information is limited to low-quality sources.
The Internet Archive itself has faced legal battles over copyright issues. The organization lost a lawsuit and an appeal against book publishers two years ago, and reached a settlement with major record labels last year.
The Underlying Problem for AI Companies
Developers of large-language models have long operated on the assumption that more training data improves model capabilities. However, they quickly exhausted freely available, high-quality information sources.
Copyright holders across industries-news publishers, music publishers, book authors, and photo repositories-have accused AI companies of filling that gap by training on copyrighted material without permission. The publishers' lawsuits represent an attempt to establish legal boundaries around that practice.
For writers and content creators, understanding these dynamics matters. The outcome of these disputes will likely shape how AI tools can access and use your work, and what protections exist for original content in an increasingly AI-driven media landscape. AI for Writers resources can help you understand how these tools work and what rights issues matter to your career.
Your membership also unlocks: