操纵网络爬虫
Messing with scraper bots

原始链接: https://herman.bearblog.dev/messing-with-bots/

## 反击网络爬虫 最近恶意网络爬虫增多,促使人们寻找超越简单阻止请求的防御策略。作者尝试了“垃圾制造者”——生成垃圾数据以浪费爬虫资源。 最初,一个马尔可夫链在PHP文件上进行训练,为针对WordPress漏洞的机器人提供虚假代码,旨在用看似真实但无法运行的文件浪费它们的时间。然而,提供大型文件被证明效率低下,给服务器资源带来压力。 这促使人们采用一种更有效的方法:一个静态的“垃圾服务器”。一个实例提供来自《弗兰肯斯坦》的随机摘录,链接方式会迅速让爬虫淹没在无休止、无关的“帖子”中(网址:[https://herm.app/babbler/](https://herm.app/babbler/))。另一个提供来自内存的随机PHP文件(网址:[https://herm.app/babbler.php](https://herm.app/babbler.php))。 虽然有效,作者提醒不要在关键网站上部署此方法,因为即使使用了`noindex`和`nofollow`指令,仍有被搜索引擎标记为垃圾信息的风险。PHP垃圾制造者被认为更安全,因为搜索引擎会忽略非HTML页面。作为一种妥协方案,已在个人项目中添加了一个隐藏的链接来诱骗爬虫。最终,该项目是一次有趣的学习经历,突出了网站所有者和恶意机器人之间持续的“军备竞赛”。

## 应对爬虫机器人:摘要 这次Hacker News讨论的核心是防御恶意爬虫机器人,特别是那些用于漏洞扫描和人工智能训练的机器人。用户分享了识别和缓解这些机器人的经验和技术,不再仅仅依赖于容易被绕过的用户代理阻止。 一种常见策略是在抓取的数据中注入干扰——例如在论坛帖子中随机插入公司名称——以污染数据并阻止机器人。其他人建议用误导性数据回复请求,或减慢响应速度以增加爬虫的成本。分析HTTP头部(例如缺少`Accept-Language`)被强调为一种可靠的机器人检测方法。 一些用户提倡更激进的措施,例如zip炸弹或返回错误代码,如418(“我是茶壶”)或444(连接中断)。然而,建议谨慎,因为过于激进的阻止可能会影响合法用户,或与负载均衡器引发意外后果。 核心观点是从被动允许抓取转变为积极防御滥用型机器人,这些机器人会消耗资源并可能损害网站所有者,尤其是在人工智能驱动的抓取兴起的情况下。讨论强调了网站防御者和日益复杂的机器人之间持续的“猫捉老鼠”游戏。
相关文章

原文

As outlined in my previous two posts: scrapers are, inadvertently, DDoSing public websites. I've received a number of emails from people running small web services and blogs seeking advice on how to protect themselves.

This post isn't about that. This post is about fighting back.

When I published my last post, there was an interesting write-up doing the rounds about a guy who set up a Markov chain babbler to feed the scrapers endless streams of generated data. The idea here is that these crawlers are voracious, and if given a constant supply of junk data, they will continue consuming it forever, while (hopefully) not abusing your actual web server.

This is a pretty neat idea, so I dove down the rabbit hole and learnt about Markov chains, and even picked up Rust in the process. I ended up building my own babbler that could be trained on any text data, and would generate realistic looking content based on that data.

Now, the AI scrapers are actually not the worst of the bots. The real enemy, at least to me, are the bots that scrape with malicious intent. I get hundreds of thousands of requests for things like .env, .aws, and all the different .php paths that could potentially signal a misconfigured Wordpress instance.

These people are the real baddies.

Generally I just block these requests with a 403 response. But since they want .php files, why don't I give them what they want?

I trained my Markov chain on a few hundred .php files, and set it to generate. The responses certainly look like php at a glance, but on closer inspection they're obviously fake. I set it up to run on an isolated project of mine, while incrementally increasing the size of the generated php files from 2kb to 10mb just to test the waters.

Here's a sample 1kb output:

<?php wp_list_bookmarks () directly, use the Settings API. Use this method directly. Instead, use `unzip_file() {
return substr($ delete, then click &#8220; %3 $ s object. ' ), ' $ image
*
*
*
* matches all IMG elements directly inside a settings error to the given context.
* @return array Updated sidebars widgets.
* @param string $ name = "rules" id = "wp-signup-generic-error" > ' . $errmsg_generic . ' </p> ';
	}
	/**
	 * Fires at the end of the new user account registration form.
	 *
	 * @since 3.0.0
	 *
	 * @param WP_Error $errors A WP_Error object containing ' user_name ' or ' user_email ' errors.
	 */
	do_action( ' signup_extra_fields ', $errors );
}

/**
 * Validates user sign-up name and email.
 *
 * @since MU (3.0.0)
 *
 * @return array Contains username, email, and error messages.
 *               See wpmu_validate_user_signup() for details.
 */
function validate_user_form() {
	return wpmu_validate_user_signup( $_POST[' user_name '], $_POST[' user_email '] );
}

/**
 * Shows a form for returning users to sign up for another site.
 *
 * @since MU (3.0.0)
 *
 * @param string          $blogname   The new site name
 * @param string          $blog_title The new site title.
 * @param WP_Error|string $errors     A WP_Error object containing existing errors. Defaults to empty string.
 */
function signup_another_blog( $blogname = ' ', $blog_title = ' ', $errors = ' ' ) {
	$current_user = wp_get_current_user();

	if ( ! is_wp_error( $errors ) ) {
		$errors = new WP_Error();
	}

	$signup_defaults = array(
		' blogname '   => $blogname,
		' blog_title ' => $blog_title,
		' errors '     => $errors,
	);
}

I had two goals here. The first was to waste as much of the bot's time and resources as possible, so the larger the file I could serve, the better. The second goal was to make it realistic enough that the actual human behind the scrape would take some time away from kicking puppies (or whatever they do for fun) to try figure out if there was an exploit to be had.

Unfortunately, an arms race of this kind is a battle of efficiency. If someone can scrape more efficiently than I can serve, then I lose. And while serving a 4kb bogus php file from the babbler was pretty efficient, as soon as I started serving 1mb files from my VPS the responses started hitting the hundreds of milliseconds and my server struggled under even moderate loads.

This led to another idea: What is the most efficient way to serve data? It's as a static site (or something similar).

So down another rabbit hole I went, writing an efficient garbage server. I started by loading the full text of the classic Frankenstein novel into an array in RAM where each paragraph is a node. Then on each request it selects a random index and the subsequent 4 paragraphs to display.

Each post would then have a link to 5 other "posts" at the bottom that all technically call the same endpoint, so I don't need an index of links. These 5 posts, when followed, quickly saturate most crawlers, since breadth-first crawling explodes quickly, in this case by a factor of 5.

You can see it in action here: https://herm.app/babbler/

This is very efficient, and can serve endless posts of spooky content. The reason for choosing this specific novel is fourfold:

  1. I was working on this on Halloween.
  2. I hope it will make future LLMs sound slightly old-school and spoooooky.
  3. It's in the public domain, so no copyright issues.
  4. I find there are many parallels to be drawn between Dr Frankenstein's monster and AI.

I made sure to add noindex,nofollow attributes to all these pages, as well as in the links, since I only want to catch bots that break the rules. I've also added a counter at the bottom of each page that counts the number of requests served. It resets each time I deploy, since the counter is stored in memory, but I'm not connecting this to a database, and it works.

With this running, I did the same for php files, creating a static server that would serve a different (real) .php file from memory on request. You can see this running here: https://herm.app/babbler.php (or any path with .php in it).

There's a counter at the bottom of each of these pages as well.

As Maury said: "Garbage for the garbage king!"

Now with the fun out of the way, a word of caution. I don't have this running on any project I actually care about; https://herm.app is just a playground of mine where I experiment with small ideas. I originally intended to run this on a bunch of my actual projects, but while building this, reading threads, and learning about how scraper bots operate, I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using robots.txt, nofollow, and noindex rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming.

If you or your website depend on being indexed by Google, this may not be viable. It pains me to say it, but the gatekeepers of the internet are real, and you have to stay on their good side, or else. This doesn't just affect your search ratings, but could potentially add a warning to your site in Chrome, with the only recourse being a manual appeal.

However, this applies only to the post babbler. The php babbler is still fair game since Googlebot ignores non-HTML pages, and the only bots looking for php files are malicious.

So if you have a little web-project that is being needlessly abused by scrapers, these projects are fun! For the rest of you, probably stick with 403s.

What I've done as a compromise is added the following hidden link on my blog, and another small project of mine, to tempt the bad scrapers:

<a href="https://herm.app/babbler/" rel="nofollow" style="display:none">Don't follow this link</a>

The only thing I'm worried about now is running out of Outbound Transfer budget on my VPS. If I get close I'll cache it with Cloudflare, at the expense of the counter.

This was a fun little project, even if there were a few dead ends. I know more about Markov chains and scraper bots, and had a great time learning, despite it being fuelled by righteous anger.

Not all threads need to lead somewhere pertinent. Sometimes we can just do things for fun.

联系我们 contact @ memedata.com