News outlets are limiting the Internet Archive’s access to their journalism

TL;DR

Over 340 U.S. local news sites are blocking the Internet Archive’s web crawlers due to concerns over data scraping for AI training. This limits long-term access to journalism archives, impacting researchers and journalists.

More than 340 local news websites across the United States are now blocking the Internet Archive’s web crawlers, according to recent analysis, amid growing concerns from publishers about data scraping for AI training.

Since January 2026, a significant number of news sites, primarily local outlets owned by major publishers such as USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing, have restricted the Internet Archive’s bots from accessing their content. This trend follows reports by Nieman Lab that large news organizations, including The New York Times and The Guardian, began blocking the archive due to fears that AI companies might scrape their repositories for training data. Despite no confirmed cases of content scraping by AI firms, the number of sites blocking the Internet Archive has continued to rise, now totaling over 340.

Researchers, historians, and journalists rely heavily on the Internet Archive’s Wayback Machine to access historical news content, especially from local media outlets that often lack comprehensive digital archives. Blocking these archives threatens the preservation of primary sources essential for understanding past events and ongoing journalism work. The Internet Archive has responded by highlighting efforts to prevent abuse, including limiting bulk downloads and monitoring bot activity, but publishers remain concerned about intellectual property rights and potential misuse.

Why It Matters

This development poses a significant threat to the preservation of journalistic history and the accessibility of local news archives. Researchers, historians, and journalists depend on these archives for fact-checking, historical research, and ongoing reporting. The restrictions could create gaps in the digital record, complicating efforts to understand media coverage over time and potentially weakening the transparency and accountability that archives support.

Furthermore, the move reflects broader tensions between content creators and digital archiving efforts, especially as AI training practices come under scrutiny. The restrictions could influence future policies on web archiving and digital preservation, with long-term implications for the availability of primary source material.

Amazon

news archive preservation tools

As an affiliate, we earn on qualifying purchases.

Background

In January 2026, Nieman Lab reported that many news organizations had begun blocking the Internet Archive’s bots, citing concerns over data scraping and intellectual property. The initial focus was on larger national outlets, but recent analysis shows that a majority of the blocked sites are local news outlets, many owned by the largest media companies in the U.S. These sites disallow bots like Heritrix, Archive-It, and others associated with the Internet Archive, through their robots.txt files.

The Internet Archive has maintained that its crawlers are used for research and scholarship, and has implemented measures to prevent misuse. However, the ongoing blocking indicates a growing mistrust between publishers and digital archiving efforts, driven largely by concerns over AI training data and intellectual property rights.

“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term.”

— Edward McCain, journalism librarian at the University of Missouri

“We are in conversation with many publishers and appreciate the opportunity to address their concerns.”

— Mark Graham, founder of the Wayback Machine

“This is the same fight that everybody has been having with the Internet Archive since its inception. AI companies are the catalyst for the latest skirmish.”

— Meredith Broussard, data journalist and NYU professor

Amazon

web crawler for digital archives

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear how many news organizations will continue to block the Internet Archive or if any will reverse their restrictions. The extent to which AI companies have scraped content from these blocked sites is also unconfirmed, as no public evidence has emerged. The potential impact on long-term digital preservation is still being evaluated, and negotiations between publishers and the Internet Archive are ongoing.

Amazon

news content backup software

As an affiliate, we earn on qualifying purchases.

What’s Next

Next steps include continued dialogue between the Internet Archive and news publishers to address concerns, possible policy adjustments, and monitoring of the impact on digital archives. Further analysis is expected to determine whether additional sites will block access and how this will affect research and historical preservation efforts.

Amazon

historical news archive storage

As an affiliate, we earn on qualifying purchases.

Key Questions

Why are news outlets blocking the Internet Archive?

Many outlets cite concerns over data scraping, intellectual property rights, and the potential use of their content for AI training without permission.

Does this mean the Internet Archive has scraped content from news sites?

There is no confirmed evidence that AI companies have scraped content; the blocking is primarily a precaution or response to perceived risks.

How does this affect researchers and journalists?

Blocking limits access to historical news content, making it more difficult for researchers, journalists, and historians to access primary sources for their work.

Could these restrictions be reversed?

Yes, ongoing discussions between the Internet Archive and publishers could lead to policy changes or exceptions, but no commitments have been announced.

What is the broader significance of this trend?

This reflects ongoing tensions between digital preservation efforts and content owners’ rights, with implications for the future of open access and historical record-keeping.

Source: Hacker News