伪造jpeg
Faking a JPEG

原始链接: https://www.ty-penguin.org.uk/~auj/blog/2025/03/25/fake-jpeg/

作者创建了一个“ Spigot”,这是一个Web应用程序,该应用程序生成具有Gibberish内容的假网页,以捕获激进的Web爬网。他们对隐藏其身份的爬行者感到沮丧,他们注意到了一个新的爬行者“ ImagesIftbot”,积极地搜索了无图像网站上的图像。为了使它不用过多使用CPU使用,他们设计了一种生成“垃圾” JPEG的方法。 这个想法是将现有的JPEG用作模板,保留结构部件并用随机数据填充像素数据块。虽然不是完美的(解码的标志错误),但大多数观看者仍然会显示“图像”,这可能会给爬行者带来不便。此方法是有效的,每秒生成约900张图像,比服务器的Internet连接快。现在,Spigot在其60%的页面中包含了这些垃圾JPEG,使ImagesIftBot,Meta的机器人,AmazonBot和GPTBOT之类的爬行者感到高兴,并可能增加其处理成本。对像素数据进行了简单的修改,从而降低了无效的Huffman代码的可能性。该代码在GitHub上发布。

该黑客新闻线程讨论了“伪造JPEG”,该项目旨在创建低资源,有效的JPEG,这些JPEG充满了垃圾数据,以阻止Web爬网,尤其是那些刮擦AI培训数据的爬网。作者的目的是使爬行的机器人昂贵,同时最大程度地减少服务器负载。 评论者辩论提供垃圾数据的道德规范,其中一些人认为这是有道理的,而其他人则辩称,而另一些人则主张为诸如Archive.org之类的合法爬行者提供服务。讨论涉及机器人.txt,“ proxyware”伪装成VPN,以及用于识别爬行者的方法。 有建议改善系统的建议,例如为图像欺骗AI创建更可信的假字幕,并生成将图像放置在受限区域的EXIF数据。其他建议是使用比JPEG更有效的图像格式来超载僵尸网络。总体而言,该线程是围绕网络刮擦和数据中毒的技术讨论和道德考虑因素的混合。
相关文章

原文

I've been wittering on about Spigot for a while. It's small web application which generates a fake hierarchy of web pages, on the fly, using a Markov Chain to make gibberish content for aggressive web crawlers to ingest.

Spigot has been sitting there, doing its thing, for a few months now, serving over a million pages per day. I've not really been keeping track of what it's up to, but every now and then I look at its logs to see what crawlers are hitting it.

Sadly, two of the hardest-hitting crawlers go to extreme lengths to hide their identity, generating random, and unlikely, browser signatures (e.g. Firefox version 134.0, 64 bit, on Windows 98!) and accessing from random addresses. It seems quite likely that this is being done via a botnet - illegally abusing thousands of people's devices. Sigh.

Where I can identify a heavy hitter, I add it to the list on Spigot's front page so I can track crawler behaviour over time.

Anyway... a couple of weeks ago, I noticed a new heavy hitter, "ImageSiftBot". None of Spigot's output contained images, but ImageSiftBot was busily hitting it with thousands of requests per hour, desperately looking for images to ingest. I felt sorry for its thankless quest and started thinking about how I could please it.

My primary aim, for Spigot, is that it should sit there, doing its thing, without eating excessive CPU on my server. Generating images on the fly isn't trivial, in terms of CPU load. If I want to create a bunch of pixels, in a form that a crawler would believe, I pretty much have to supply compressed data. And compressing on the fly is CPU intensive. That's not going to be great for Spigot, and is a complete waste when we're just generating throw-away garbage in any case.

I got to thinking: compression tends to increase the entropy of a bit stream. If a file doesn't look to have random content then it's compressible, and an optimally compressed set of data would be more or less indistinguishable from random data. JPEGs are pretty well compressed. So the compressed data in a JPEG will look random, right?

If I had a template for a JPEG file, which contained the "structured" parts (info on size, colour depth, etc) and tags indicating where highly compressed data goes, I could construct something that looks like a JPEG by just filling out the "compressed" areas with random data. That's a very low-CPU operation. The recipient would see something that looks like a JPEG and would treat the random data as something to decompress.

I read up a bit on the structure of JPEG files and discovered they can be quite complex. But that doesn't matter much. A JPEG file is made up of chunks. Each chunk has a marker and a length (sometimes implicitly zero, sometimes only determined by reading the chunk's content, looking for the next marker). So, parsing a JPEG is relatively simple. And I've got loads of JPEGs. So: what if I scan a bunch of existing files, discarding the "comment" chunks, noting just the length of the "pixel data" chunks and keeping the rest? How big would the result be?

I've currently got 514 JPEGs on my web site, totalling about 150MBytes of data. If I scan all of them, keeping just the "structured" chunks and noting the "pixel" chunk lengths, the resulting data set is under 500KBytes - a drop in the ocean. That gives me 514 realistic templates, of various sizes, colour depths, etc.

Generating a JPEG would come down to:

template = random.choice(template_list)
for chunk in template.chunks:
   if chunk.type is pixel_data:
       output(random.randbytes(chunk.length)) 
   else:
       output(chunk.data)

And that it!

So I knocked together some test code. And discovered it's not quite as simple as that. Real pixel data isn't quite random - it's Huffman coded, and there's a bit of structure there. If I fill out the pixel chunks with purely random data, the decoder notices places where the data is incorrect. I'm sure someone with more brains/time/inclination than me would be able to parse the other chunks in the template to determine exactly what Huffman codes can be inserted into the pixel chunks, still without needing to actually do compression.

But that's where I bowed out. Because... every JPEG viewer I've tried accepts my garbage data and displays an image. Even though the decoder notes issues, it still emits pixel data. And that might just be good enough to inconvenience web crawlers. I bet most of them don't care about errors, so long as they don't result in a broken image. Even if they do care about errors, they still have to grab the data and try to decode it before they can tell it's broken. And that's increasing their costs, which is fine by me.

The image at the top of this page is an example, generated on the fly by from the code. Your browser will probably display it, despite it being a faulty JPEG.

Back to efficiency: How quickly can I generated these garbage images? As I said, I'm using templates based on images from my site. I usually optimise images for the web, resulting in JPEGs having a range of sizes, but mostly around 1280x960 pixels and 200-300KBytes. A quick test shows I can generate around 900 such images per second on my web server using this method (in Python). That's around 190MBytes/second and very substantially faster than my web server's connection to the Internet. Nice!

I've wired the above into Spigot and around 60% of Spigot-generated pages will now have a garbage JPEG in them. As with Spigot, I seed the random number generator for an image with a value derived from the URL. So, while the above image was generated on the fly, if you reload it, you'll get the same image.

ImageSiftBot is very happy with this and grabbed around 15,000 garbage images today. I expect it'll ramp its rate up over the next few days as it finds more links. Meta's bot, AmazonBot, and GPTBot are also getting excited!

I need to tidy up the Python class that does this, but will release it in due course. It's under 100 lines of code (but could do with more comments!).

[2025-03-26] Now released on GitHub

[2025-03-28] After thinking a lot about Huffman codes, I've added a bit mask against the generated pixel data. "AND"-ing every generated byte with 0x6D ensures that no strings of three or more 1's appear in the bit stream. This massively reduces the probability (from > 90% to < 4%) of generating a JPEG that has invalid Huffman codes, without requiring a lot more CPU.

The focus is to make generating the garbage as cheap as possible for me and as expensive as possible for the abusive web crawler. After examining how JPEG uses Huffman codes, it wouldn't be excessively difficult to generate perfectly valid Huffman streams. But it would eat a lot more CPU for very little gain.

联系我们 contact @ memedata.com