Scrapers briefly cause outage at Internet Archive

The logo of California-based Internet Archive. (Photo by Arnold Gatilao via Wikimedia Commons)

The Internet’s biggest digital library was briefly unavailable over the weekend after someone using an Amazon-owned web service tried to scrape thousands of files from the website in a short amount of time.

On Monday, the Internet Archive’s founder Brewster Kahle said the website was down for about an hour after someone using virtual hosts linked back to Amazon Web Services launched “tens of thousands of requests” to download Optical Character Recognition (OCR) files.

OCR is a technology that allows computers to analyze text and characters in digital images. The Internet Archive is one of the biggest repositories of digital files that include PDFs, electronic books and images that contain text.

On Sunday, someone used 64 virtual hosts at Amazon Web Services to begin requesting tens of thousands of downloads in a concentrated amount of time, impacting the Internet Archive’s ability to serve other users around the world.

“Even by web standards, tens of thousands of requests per second is a lot,” Kahle wrote.

The archive was able to return to normal after blocking dozens of IP addresses linked to the activity, but the person or group who started the initial batch of download requests did it a second time just a few hours later. The second attempt resulted in an additional hour-long outage of the archive, Kahle said.

“We are thankful to our engineers who could scramble on a Sunday afternoon on a holiday weekend to work on this,” Kahle wrote.

The offenders in this case were not specifically mentioned by name, but a tweet from the Internet Archive originally said the scraping requests were linked to an “AI company harvesting Internet Archive texts at an extreme rate.” In a follow-up message, the organization said the culprit “may not have been an AI company, maybe just an eager user.”

The Internet Archive said it supports individuals and groups who want to access and preserve its content, but said users should “start slowly and ramp up.” The organization asked those who want to start large projects to contact them directly, and provided an e-mail address — info@archive.org — where they can touch base.

“If you find yourself blocked, please don’t just start again, reach out,” Kahle pleaded. “Please use the Internet Archive, but don’t bring us down in the process.”

In addition to its repository of files, the Internet Archive is perhaps best known for hosting the Wayback Machine, which has preserved static web pages since the mid-1990s. The archive is based in the Richmond District of San Francisco.

Scrapers briefly cause outage at Internet Archive

Get free breaking news alerts and twice-weekly digests delivered to your inbox.

We do not share your e-mail address with third parties; you can unsubscribe at any time.

TheDesk.net offers the latest news, analysis and commentary on the business of streaming media, broadcast TV and radio, advertising, measurement, journalism, tech and policy.