Java虚拟线程吞噬我的内存:一个网络爬虫的速度与内存之战
Java Virtual Threads Ate My Memory: A Web Crawler's Tale of Speed vs. Memory

原始链接: https://dariobalinzo.medium.com/virtual-threads-ate-my-memory-a-web-crawlers-tale-of-speed-vs-memory-a92fc75085f6

作者探索了使用Java虚拟线程来加速一个简单的网络爬虫,该爬虫最初使用平台线程构建。切换到虚拟线程极大地提高了URL处理速度,但很快导致了OutOfMemoryError。这表明,虚拟线程虽然消除了I/O瓶颈,但却因下载内容的速度快于处理速度而压垮了系统。 解决方案涉及显式地管理并发。作者使用了信号量来限制并发任务的数量,通过控制下载速率来防止过度内存消耗。他们还注意到,现实场景中通常涉及持续的工作到达,而不是大量的初始任务爆发。 这段经历突出了虚拟线程改变了并发管理的方式。平台线程的隐式资源限制不再充当保护措施。开发人员必须主动管理资源并防止不受控制的并发,以避免诸如内存耗尽等意外问题。虚拟线程功能强大,但需要仔细考虑资源管理。

Hacker News 的讨论围绕一篇关于 Java 虚拟线程及其在网络爬虫中内存消耗的博客文章展开。作者遇到了由于爬虫无限制的内存使用而导致的 OutOfMemoryError。 评论者强调,虚拟线程并不能消除对流控制和背压的需求。与 JavaScript 的事件循环不同,Java 虚拟线程需要开发者显式地管理并发限制。一些人建议使用信号量来限制并发并防止资源耗尽,这与有界线程池的功能类似。 讨论还涉及到 Java 中 OutOfMemoryError 的可恢复性,关于捕获它们是否是好的实践存在不同的意见。文章也提到了为异步架构设计合适的调度程序以及将 CPU 使用率视为有限资源的重要性。还建议使用流式/缓冲处理作为减少内存使用的方法。虚拟线程可以提高 IO 密集型任务的性能。最终,大家一致认为虚拟线程是一个强大的工具,但适当的资源管理和对并发的理解仍然至关重要。

原文
Dario Balinzo
Photo by Julian Hochgesang on Unsplash

I built a simple web crawler using good old platform threads. It was just a multithreaded crawler, nothing fancy. But then, curiosity struck: “What happens if I use Virtual Threads instead?”

Virtual Threads are one of my favorite recent additions to the Java ecosystem. Switching from platform threads to Virtual Threads dramatically improved the URL processing rate… until the whole thing blew up with an OutOfMemoryError.

Yep. Fast became too fast.

I love exploring Java’s new features. Virtual Threads, Records, Pattern Matching, you name it. But every tool has tradeoffs. In software engineering, solving one problem can introduce another. Our job isn’t just writing code; it’s balancing performance, safety, and maintainability.

This post is about my little hacking session, where I tried to unleash the power of Virtual Threads, only to discover it takes a bit of finesse to avoid turning performance into a memory bomb. I’ll walk you through what happened, how I fixed it, and how you can too.

Before playing with Virtual Threads, I created a very basic crawler using traditional platform threads. The list of URLs to process was predetermined and inserted into an executor queue. Each platform thread would take a new task from the queue, which consisted of: getting a URL, fetching its content from a local server (to eliminate bandwidth and external rate limiting), emulating some processing, and then moving on to the next one.

public class PlatformThreadsCrawler {

//the executor service used to fetch and process data
private final ExecutorService executorService = Executors.newFixedThreadPool(200);
private final HttpClient httpClient = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(10))
.build();
private final AtomicLong processedCount = new AtomicLong(0);

//submit the job in the executor queue and then wait for results
public void crawlUrls(List<String> urls) {
long startTime = System.currentTimeMillis();

// Submit all download tasks
CompletableFuture<?>[] futures = new CompletableFuture[urls.size()];
int index = 0;
for (String url : urls) {
futures[index++] = CompletableFuture.runAsync(
() -> downloadAndProcess(url),
executorService
);
}

// Wait for all to complete
CompletableFuture.allOf(futures).join();

long endTime = System.currentTimeMillis();
System.out.println("\n=== Crawl Complete ===");
System.out.println("Time taken: " + (endTime - startTime) + "ms");
}

private void downloadAndProcess(String url) {
try {
//download
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(url))
.timeout(Duration.ofSeconds(10))
.build();

HttpResponse<byte[]> response = httpClient.send(request,
HttpResponse.BodyHandlers.ofByteArray());

//if ok process the data
if (response.statusCode() == 200) {
byte[] content = response.body();

// Simulate some processing work
int result = process(content);
} else {
System.err.println("HTTP " + response.statusCode() + " for: " + url);
}

} catch (Exception e) {
System.err.println("Error downloading " + url + ": " + e.getMessage());
} finally {
long count = processedCount.incrementAndGet();
if (count % 100 == 0) {
System.out.println("Processed " + count + " URLs");
}
}
}

private static int process(byte[] content) {
//emulate some processing logic (omitted here)
}

}

I submitted a list of 20K urls, to be processed by 200 platform threads. For testing, I used a local HTTP server with static routes to simulate pages. I generated the urls starting from this list, and then duplicating it. The goal was to emulate small and medium file download (like simple html and small images).

urls.addAll(List.of(
"http://localhost:8080/data/1kb", // 1KB
"http://localhost:8080/data/10kb", // 10KB
"http://localhost:8080/data/100kb", // 100KB
"http://localhost:8080/data/1mb" // 1MB
));

The Java heap max size was 1GB, to emulate a scenario in which memory available was not “infinite”. I emulated the file processing with a simple word count logic. Here you can see some stats captured with VisualVM:

Since a large portion of thread time is spent blocked receiving network data, Virtual Threads should theoretically allow us to handle many more concurrent operations without the overhead of platform threads. Time to make it faster!

I swapped out my

Executors.newFixedThreadPool(200)

with

Executors.newVirtualThreadPerTaskExecutor()

As you can see in Virtual Threads there isn’t a direct built-in mechanism to set a global limit on the total number of virtual threads that can be created (Virtual threads are designed to be lightweight and numerous).

I ran the exact same logic. Boom! pages were being fetched in what felt like milliseconds. I was thrilled!

Until the JVM gave up with an OutOfMemoryError!

Let’s break it down:

  • Virtual Threads removed the I/O bottleneck.
  • As a result, URLs were fetched in parallel at a much higher rate.
  • But processing (e.g., parsing and consuming the response) didn’t get the same speed boost.
  • Without any back-pressure, the program kept stuffing memory with pending results.

It wasn’t just a faster crawler, it became a hyperactive downloader with no brakes!

(Note that I voluntarily limited memory in both crawler versions to easily compare their memory usage. In this particular scenario Virtual Threads required more memory. This might not be a problem in a real world application depending on heap size and on the kind of business logic.)

So how do we fix this without giving up on Virtual Threads?

Limit concurrency using a semaphore

We can introduce a Semaphore to limit the number of concurrent tasks in flight:

public class ControlledVirtualThreadsCrawler {
private final ExecutorService executorService =
Executors.newVirtualThreadPerTaskExecutor();
private final Semaphore concurrencyLimit = new Semaphore(500);

private void downloadAndProcess(String url) {
concurrencyLimit.acquire();
try {
// ... existing download and process logic
} finally {
concurrencyLimit.release();
}
}
}

Before launching a new task, acquire a permit. Release it after processing completes. If no permits are available then the Virtual Thread is blocked. This keeps Virtual Threads under control, preventing too many urls to be downloaded and processed at the same time.

Avoid submitting too many tasks at the same time

In our test scenario, we submitted all 10,000 URLs at once, an artificial burst that rarely happens in production. In realistic applications, work arrives continuously over time.

Implementing rate limiting or spreading the arrival of scraping requests over time might prevent overwhelming the crawler.

This experience taught me that Virtual threads aren’t just faster platform threads, they fundamentally change how we think about concurrency limits and resource management. The traditional patterns and assumptions that worked with platform threads may not apply.

Virtual threads are incredibly powerful, but they require us to be more explicit about resource management. The JVM’s natural resource constraints (like thread limits) that previously acted as implicit backpressure mechanisms are no longer there to save us from ourselves.

联系我们 contact @ memedata.com