News publishers block Internet Archive over fears AI companies use it to access copyrighted content

Major news publishers including the New York Times, Washington Post, and USA Today are blocking the Internet Archive over fears AI companies use it to access copyrighted content. The archive calls itself "collateral damage" in the fight.

Categorized in: AI News Writers
Published on: Apr 14, 2026
News publishers block Internet Archive over fears AI companies use it to access copyrighted content

Internet Archive Caught in Publishers' Fight Over AI Training Data

The Internet Archive, a nonprofit digital library that has preserved over a trillion web pages, is facing restrictions from major news publishers who fear the organization is being used as a back door for AI developers to access copyrighted content.

The New York Times, Washington Post, USA Today, and Reddit have all moved to block or limit the archive's ability to crawl their sites or distribute their content in recent months. The Guardian took a different approach, working with the archive to restrict AI access while preserving the organization's broader mission.

The publishers' actions stem from concerns that AI developers are using the Internet Archive to train models on copyrighted news stories without permission. By blocking the archive, publishers hope to prevent AI companies from circumventing the access restrictions they've placed on their own websites.

Why Publishers Are Acting Now

News organizations have filed a series of lawsuits against AI companies including OpenAI and Perplexity, alleging copyright infringement and unauthorized use of their content to train competing products. The New York Times said in a statement that the issue is "Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us."

USA Today implemented blocking technology in August to "protect our IP, support fair compensation, and reflect a broader industry shift toward paid access for trusted journalism," according to a company spokeswoman.

The Guardian's director of business affairs said the primary concern is "that AI companies are making unauthorised use of content publicly available to Guardian readers." The organization chose to limit rather than fully block the archive, excluding Guardian stories from the archive's API while allowing access to homepage and topic pages.

The Archive's Position

Mark Graham, director of the Internet Archive's Wayback Machine, said the organization is "collateral damage" in a larger fight between publishers and AI companies.

The archive has implemented safeguards to prevent bulk downloading and limit the rate at which material can be accessed, Graham said. The organization has also blocked or prevented bulk downloads of materials from certain websites like the New York Times in response to publisher concerns.

"This is an ongoing effort," Graham said. "It's not a once-and-done kind of thing."

Graham argues that publishers' rationale for blocking the archive is "unfounded" and that the restrictions threaten the organization's core mission to preserve public access to information. As online publishers shut down or modify their sites, the archive often holds the only remaining copies of past web pages and news stories.

Broader Questions About Web Access

Graham sees the publishers' restrictions as part of a larger trend toward closing off what was once an open web. As more content moves behind paywalls or gets blocked, less quality information remains freely available to the public.

This shift has consequences beyond the archive's operations. Graham said it affects "anyone's ability to access quality journalistic material" and compounds the spread of misinformation when freely available information is limited to low-quality sources.

The Internet Archive itself has faced legal battles over copyright issues. The organization lost a lawsuit and an appeal against book publishers two years ago, and reached a settlement with major record labels last year.

The Underlying Problem for AI Companies

Developers of large-language models have long operated on the assumption that more training data improves model capabilities. However, they quickly exhausted freely available, high-quality information sources.

Copyright holders across industries-news publishers, music publishers, book authors, and photo repositories-have accused AI companies of filling that gap by training on copyrighted material without permission. The publishers' lawsuits represent an attempt to establish legal boundaries around that practice.

For writers and content creators, understanding these dynamics matters. The outcome of these disputes will likely shape how AI tools can access and use your work, and what protections exist for original content in an increasingly AI-driven media landscape. AI for Writers resources can help you understand how these tools work and what rights issues matter to your career.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)