激进的机器人毁了我的周末。
Aggressive bots ruined my weekend

原始链接: https://herman.bearblog.dev/agressive-bots/

## Bear Blog 中断与机器人问题 - 摘要 2025年10月25日,Bear Blog 经历了一次重大中断,由于反向代理故障影响了自定义域名。根本原因并非服务器容量,而是机器人流量激增——这是一个日益严重的问题,包括人工智能爬虫、恶意行为者和不受控制的自动化程序。这些机器人正在席卷互联网,其驱动力是大型语言模型训练的数据价值以及越来越容易部署的爬取工具。 现有的机器人缓解措施(WAF、速率限制)处理了最初的每分钟数万次请求的浪潮,但反向代理却不堪重负。关键是,作者的主要监控系统*未能*向他们发出警报,延长了停机时间。 为了防止再次发生,已经采取了几个步骤:冗余监控,带有电话/电子邮件/短信警报,增加反向代理容量(5倍),更积极的机器人缓解措施,以及自动重启功能。一个公开状态页面([https://status.bearblog.dev](https://status.bearblog.dev))也已启动,以提高透明度。 作者强调了日益恶劣的互联网环境,该环境由机器人主导,以及保护宝贵在线空间的重要性。

## 黑客新闻讨论:恶意机器人与网络爬虫 最近的黑客新闻讨论强调了日益严重的网络爬虫问题及其对小型网站的影响。原发帖人 shaunpud 描述了由无情的机器人引起的周末中断。 对话揭示了一种令人不安的趋势:爬虫越来越多地使用住宅代理——通常通过提供“免费”服务的应用程序,这些应用程序秘密出售带宽——这使得阻止变得困难。 多位评论员证实了这种做法,提到了 Bright Data (Luminati) 等服务,它们甚至为它们攻击的网站提供“保护费”。 讨论的解决方案包括实施更严格的速率限制和蜜罐(例如,针对恶意 IP 的炸弹压缩包),以及利用 CDN。 然而,许多人承认了局限性,尤其是在复杂的机器人规避检测技术日益提高以及在法律上追究攻击者责任的难度增加的情况下。 一种核心观点浮出水面:互联网正变得越来越敌对,维护独立的、小型网络服务正变得不可持续。 虽然一些人提倡为保护这些空间而奋斗,但另一些人认为挑战太大,中心化的趋势不可避免。 讨论还涉及了爬虫的伦理影响以及对数据访问更好标准的需求。
相关文章

原文

On the 25th of October Bear had its first major outage. Specifically, the reverse proxy which handles custom domains went down, causing custom domains to time out.

Unfortunately my monitoring tool failed to notify me, and it being a Saturday, I didn't notice the outage for longer than is reasonable. I apologise to everyone who was affected by it.

First, I want to dissect the root cause, exactly what went wrong, and then provide the steps I've taken to mitigate this in the future.

I wrote about The Great Scrape at the beginning of this year. The vast majority of web traffic is now bots, and it is becoming increasingly more hostile to have publicly available resources on the internet.

There are 3 major kinds of bots currently flooding the internet: AI scrapers, malicious scrapers, and unchecked automations/scrapers.

The first has been discussed at length. Data is worth something now that it is used as fodder to train LLMs, and there is a financial incentive to scrape, so scrape they will. They've depleted all human-created writing on the internet, and are becoming increasingly ravenous for new wells of content. I've seen this compared to the search for low-background-radiation steel, which is, itself, very interesting.

These scrapers, however, are the easiest to deal with since they tend to identify themselves as ChatGPT, Anthropic, XAI, et cetera. They also tend to specify whether they are from user-initiated searches (think all the sites that get scraped when you make a request with ChatGPT), or data mining (data used to train models). On Bear Blog I allow the first kinds, but block the second, since bloggers want discoverability, but usually don't want their writing used to train the next big model.

The next two kinds of scraper are more insidious. The malicious scrapers are bots that systematically scrape and re-scrape websites, sometimes every few minutes, looking for vulnerabilities such as misconfigured Wordpress instances, or .env and .aws files, among other things, accidentally left lying around.

It's more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I've blocked close to 2 million malicious requests across several hundred blogs.

What's wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I'm still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers.

Now, on to the unchecked automations. Vibe coding has made web-scraping easier than ever. Any script-kiddie can easily build a functional scraper in a single prompt and have it run all day from their home computer, and if the dramatic rise in scraping is anything to go by, many do. Tens of thousands of new scrapers have cropped up over the past few months, accidentally DDoSing website after website in their wake. The average consumer-grade computer is significantly more powerful than a VPS, so these machines can easily cause a lot of damage without noticing.

I've managed to keep all these scrapers at bay using a combination of web application firewall (WAF) rules and rate limiting provided by Cloudflare, as well as some custom code which finds and quarantines bad bots based on their activity.

I've played around with serving Zip Bombs, which was quite satisfying, but I stopped for fear of accidentally bombing a legitimate user. Another thing I've played around with is Proof of Work validation, making it expensive for bots to scrape, as well as serving endless junk data to keep the bots busy. Both of these are interesting, but ultimately are just as effective as simply blocking those requests, without the increased complexity.

With that context, here's exactly went wrong on Saturday.

Previously, the bottleneck for page requests was the web-server itself, since it does the heavy lifting. It automatically scales horizontally by up to a factor of 10, if necessary, but bot requests can scale by significantly more than that, so having strong bot detection and mitigation, as well as serving highly-requested endpoints via a CDN is necessary. This is a solved problem, as outlined in my Great Scrape post, but worth restating.

On Saturday morning a few hundred blogs were DDoSed, with tens of thousands of pages requested per minute (from the logs it's hard to say whether they were malicious, or just very aggressive scrapers). The above-mentioned mitigations worked as expected, however the reverse-proxy—which sits up-stream of most of these mitigations—became saturated with requests and decided it needed to take a little nap.

page-requests

The big blue spike is what toppled the server. It's so big it makes the rest of the graph look flat.

This server had been running with zero downtime for 5 years up until this point.

Unfortunately my uptime monitor failed to alert me via the push notifications I'd set up, even though it's the only app I have that not only has notifications enabled (see my post on notifications), but even has critical alerts enabled, so it'll wake me up in the middle of the night if necessary. I still have no idea why this alert didn't come through, and I have ruled out misconfiguration through various tests.

This brings me to how I will prevent this from happening in the future.

  1. Redundancy in monitoring. I now have a second monitoring service running alongside my uptime monitor which will give me a phone call, email, and text message in the event of any downtime.
  2. More aggressive rate-limiting and bot mitigation on the reverse proxy. This already reduces the server load by about half.
  3. I've bumped up the size of the reverse proxy, which can now handle about 5 times the load. This is overkill, but compute is cheap, and certainly worth the stress-mitigation. I'm already bald. I don't need to go balder.
  4. Auto-restart the reverse-proxy if bandwidth usage drops to zero for more than 2 minutes.
  5. Added a status page, available at https://status.bearblog.dev for better visibility and transparency. Hopefully those bars stay solid green forever.

This should be enough to keep everything healthy. If you have any suggestions, or need help with your own bot issues, send me an email.

The public internet is mostly bots, many of whom are bad netizens. It's the most hostile it's ever been, and it is because of this that I feel it's more important than ever to take good care of the spaces that make the internet worth visiting.

The arms race continues...

联系我们 contact @ memedata.com