网络爬虫AI机器人正在扰乱科学数据库和期刊。
Web-scraping AI bots cause disruption for scientific databases and journals

原始链接: https://www.nature.com/articles/d41586-025-01661-4

学术资源网站,特别是那些托管学术资源的网站,正面临着机器人流量的激增,这主要源于训练生成式AI工具对数据的需求。这种数据抓取的“蛮荒时代”压垮了服务器,导致服务中断、财务压力,甚至可能导致小型企业消亡。开放获取资源库,虽然欢迎内容重用,但也正经历着激进的机器人行为,导致服务中断。AI机器人的兴起,加速于人们认识到可以用更少的资源构建强大的语言模型,这引发了数据抓取的竞赛。BMJ等出版商和Highwire Press等托管服务商都在努力应对“不良机器人”流量的洪流,这些流量往往超过合法用户活动,影响服务交付。各组织都在努力阻止机器人,但资源有限使得这项任务非常困难。

A Hacker News thread discusses the disruption caused by AI web-scraping bots on scientific databases and journals. Users debate solutions to prevent DoS attacks from these bots, which often ignore `robots.txt` and overwhelm servers with excessive requests, unlike traditional search engine crawlers. Suggestions include providing data as dumps, but concerns arise about repeated downloads and CDN bandwidth costs. Some point out that even with readily available downloads, bots still scrape entire sites. Technical solutions like Proof-of-Work (PoW) and more efficient server technologies (Go, Rust, etc.) are proposed, but dismissed by some due to increased energy consumption, or perceived victim blaming. Other users argue for properly optimized websites with caching mechanisms. Counterarguments highlight cache eviction issues caused by systematic crawling. The underlying problem is framed as a resource constraint issue involving CPU, memory, database, and network limitations, rather than a purely technical one. Some users consider AI a solution by enabling people to increase their effectivity.
相关文章

原文
hot of a laptop with an error message on the screen in an empty server room.

Some websites have been overwhelmed by the sheer volume of bot traffic. Credit: Marco VDM/Getty

In February, the online image repository DiscoverLife, which contains nearly three million photographs of different species, started to receive millions of hits to its website every day — a much higher volume than normal. At times, this spike in traffic was so high that it slowed the site down to the point that it became unusable. The culprit? Bots.

These automated programs, which attempt to ‘scrape’ large amounts of content from websites, are increasingly becoming a headache for scholarly publishers and researchers who run sites hosting journal papers, databases and other resources.

Much of the bot traffic comes from anonymized IP addresses, and the sudden increase has led many website owners to suspect that these web-scrapers are gathering data to train generative artificial intelligence (AI) tools such as chatbots and image generators.

“It’s the wild west at the moment,” says Andrew Pitts, the chief executive of PSI, a company based in Oxford, UK, that provides a global repository of validated IP addresses for the scholarly communications community. “The biggest issue is the sheer volume of requests” to access a website, “which is causing strain on their systems. It costs money and causes disruption to genuine users.”

Those that run affected sites are working on ways to block the bots and reduce the disruption they cause. But this is no easy task, especially for organizations with limited resources. “These smaller ventures could go extinct if these sorts of issues are not dealt with,” says Michael Orr, a zoologist at the Stuttgart State Museum of Natural History in Germany.

A flood of bots

Internet bots have been around for decades, and some have been useful. For example, Google and other search engines have bots that scan millions of web pages to identify and retrieve content. But the rise of generative AI has led to a deluge of bots, including many ‘bad’ ones that scrape without permission.

This year, the BMJ, a publisher of medical journals based in London, has seen bot traffic to its websites surpass that of real users. The aggressive behaviour of these bots overloaded the publisher’s servers and led to interruptions in services for legitimate customers, says Ian Mulvany, BMJ’s chief technology officer.

Other publishers report similar issues. “We’ve seen a huge increase in what we call ‘bad bot’ traffic,” says Jes Kainth, a service delivery director based in Brighton, UK, at Highwire Press, an Internet hosting service that specializes in scholarly publications. “It’s a big problem.”

The Confederation of Open Access Repositories (COAR) reported in April that more than 90% of 66 members it surveyed had experienced AI bots scraping content from their sites — of which roughly two-thirds had experienced service disruptions as a result. “Repositories are open access, so in a sense, we welcome the reuse of the contents,” says Kathleen Shearer, COAR’s executive director. “But some of these bots are super aggressive, and it’s leading to service outages and significant operational problems.”

Training data

One factor driving the rise in AI bots was a revelation that came with the release of DeepSeek, a Chinese-built large language model (LLM). Before that, most LLMs required a huge amount of computational power to create, explains Rohit Prajapati, a development and operations manager at Highwire Press. But the developers behind DeepSeek showed that an LLM that rivals popular generative-AI tools could be made with many fewer resources, kick-starting an explosion of bots seeking to scrape the data needed to train this type of model.

联系我们 contact @ memedata.com