雷群问题：防止踩踏

雷群问题：防止踩踏
Thundering herd problem: Preventing the stampede

原始链接: https://distributed-computing-musings.com/2025/08/thundering-herd-problem-preventing-the-stampede/

## 缓存雪崩问题及解决方案 “缓存雪崩”问题发生在缓存未命中频繁请求的条目时，导致大量并发请求同时击中数据库。虽然缓存旨在*减少*数据库负载，但这种情况下实际上可能*增加*负载，从而抵消了缓存的好处。本文通过使用Postgres和Redis的Spring Boot应用程序，并采用旁路缓存模式，演示了这个问题。对相同缺失产品ID的并发请求导致了多次数据库查询和缓存回填，可以通过Zipkin追踪观察到。提出了两种解决方案： 1. **分布式锁（使用Redis）：** 在访问数据库之前获取锁。只有一条请求可以从数据库获取数据，其他请求则重试。这确保了每个键只有一个数据库调用，即使在多个节点上也是如此，但会引入网络开销。 2. **进程内同步（使用`CompletableFuture` & `ConcurrentHashMap`）：** 通过使用进程内同步避免网络调用。`CompletableFuture`确保只有一条请求执行数据库查询和缓存更新。但是，此解决方案无法在多个应用程序节点之间协调，如果请求分散，可能会导致冗余查询。作者的代码示例（可在GitHub上找到）说明了这两种方法，展示了它们如何缓解缓存雪崩问题并减少数据库负载。两者之间的选择取决于应用程序的架构和可扩展性需求。缓存并非万无一失的解决方案，理解这些边缘情况对于构建可扩展的应用程序至关重要。

原文

Couple of years back I wrote a post describing the thundering herd problem. Now that I look back, I had very cursory knowledge about the problem. I knew what it was & how can it impact an application. I knew a possible solution that can be used to solve the problem. But these were individual pieces of information. There was nothing connecting them to give me an end to end view & that was primarily because I had neither seen the problem nor the solution in action. I knew the fundamental blocks but that was it.

So in this post I will take another stab at this problem area but this time I will couple it with an actual implementation. We will reproduce the thundering herd problem & then take a look at couple of solutions which can be used to solve the problem. Lets dive in.

Understanding thundering herd problem

A typical solution that pops up whenever an application experiences increased query load on a database is to place a cache in between the web server & the database. Intention behind this is that, we first lookup from cache & if there is an existing entry then we avoid looking up from database & return the entry. If there is a cache miss then we lookup from database & backfill the cache so that it can be used for future requests. This works as we end up using cache for the hot keys which are frequently queried while keeping a low traffic on the database. Sweet & simple.

The issue begins when we have large number of concurrent requests for the same record. All requests end up querying the cache at the same time & all of them end up with a cache miss. This results in all requests ending up querying the same record from the database & eventually increasing the overall load on the database. This means if your application starts experiencing lookup patterns that comprises of multiple such hot keys, your original solution of using a cache to decrease load on database no longer works.

Reproducing thundering herd problem

Lets take a look at an application which demonstrates this problem. We have a simple backend application built using Spring Boot that uses a Postgres database to store records & Redis to cache the values. Our application follows the cache-aside pattern as described above. I have also added Zipkin tracing to verify the thundering herd problem. Lets take a look at the service code that results in the thundering herd problem(Note that I have cleaned up the code to remove tracing instrumentation. You can view the complete code for this demo on this Github link).

public Product getProductById(UUID id) throws ProductNotFoundException {
    String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;

    // Check if product is in cache
    Product productFromCache = redisTemplate.opsForValue().get(cacheKey);

    // If product is in cache, return it
    if (productFromCache != null) {
        return productFromCache;
    }

    // If the product is not in cache, fetch it from DB
    Product product = productRepository.findById(id)
            .orElseThrow(() -> new ProductNotFoundException(id));

    // Backfill the cache
    redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);

    return product;
}

The above code will result in thundering herd problem if we get concurrent requests for the same key. In order to test it, I have used a small Golang program to replicate concurrent request scenario. Lets view a demo for this:

In the above demo, you can clearly see that for every request querying the same product id, we see 3 separate traces:

Lookup from cache
Lookup from database
Backfill cache

Fixing thundering herd problem

Now that we have seen the thundering herd problem in action, lets go through how we can solve this issue. Our end goal here is to prevent all requests from going to the database in case of a cache miss. At the same time we do want any one of the requests to lookup from database so that we can serve the response to the users. I have approached the solution in 2 different ways.

Distributed lock

Given that we are using Redis, we can create a distributed lock for each cache key. Only the request that is able to acquire the lock can go to the database to fetch the results while other requests retry looking up from cache at regular intervals. Once the request that acquired the lock retrieves the record from database, it backfills the cache which other requests can lookup from in one of the retry attempts.

Distributed lock on Redis is suitable if you are working across multiple nodes & you want to end up having a single database call for a key lookup even when these concurrent requests are spread across multiple nodes. Though this requires an additional network call to Redis to acquire the lock & comes along with typical network related challenges. Here is the code for the implementation & you can view the complete code on this Github link.

public Product getProductById(UUID id) throws ProductNotFoundException {
    String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;
    // Check if product is in cache
    Product productFromCache = redisTemplate.opsForValue().get(cacheKey);

    // If product is in cache, return it
    if (productFromCache != null) {
        return productFromCache;
    }

    // If the product is not in the cache, then acquire a lock over
    // the cache key
    String lockKey = cacheKey + ":lock";
    String lockValue = UUID.randomUUID().toString();
    Duration lockTtl = Duration.ofSeconds(10);
    Boolean lockAcquired = stringRedisTemplate.opsForValue()
        .setIfAbsent(lockKey, lockValue, lockTtl);
    if (Boolean.TRUE.equals(lockAcquired)) {
        // This is required to avoid a race condition where another thread
        // could acquire the lock and backfills the cache in between the
        // current thread checks the cache and acquires the lock.
        Product doubleCacheLookup = redisTemplate.opsForValue().get(cacheKey);
        if (doubleCacheLookup != null) {
            return doubleCacheLookup;
        }
        try {
            // Look up the product from the database
            Product product = productRepository.findById(id)
                    .orElseThrow(() -> new ProductNotFoundException(id));

            // Backfill the cache
            redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);

            return product;
        } finally {
            releaseLock(lockKey, lockValue);
        }
    } else {
        // If the lock was not acquired, wait for the cache to be populated and
        // then return the product from the cache
        return waitAndRetryFromCache(cacheKey, id);
    }
}

Lets take a look at the demo with the above distributed lock & see how it solves the thundering herd problem. To reiterate the expected outcome is that only one of the requests end up querying the database while other requests fetch the value from the cache.

In the above demo, we can clearly see that the request that was able to acquire the lock ends up with traces for:

Cache lookup
Double checking cache to avoid race condition
Database lookup
Cache backfill

While we do not see any traces for database lookup for any of the other requests. This way we make sure that not all requests end up querying from the database & eventually over loading the database.

In-process synchronization

We can also use in-process synchronization instead of Redis lock to solve the thundering herd problem. This way we avoid the network call to acquire the lock & also to double check the cache to prevent race condition.

In Java we can achieve this by using a CompleteableFuture along with a ConcurrentHashMap. When a request encounters a cache miss, it either creates a new CompleteableFuture or picks up an existing one for a specific key. Using ConcurrentHashMap we can ensure that only 1 request is able to create the future while other requests end up picking the existing instance of the future. We also don’t need to double check the cache as now everything is in process & our map implementation provides us with synchronization guarantee.

public Product getProductById(UUID id) throws ProductNotFoundException {
    String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;
    // Check if product is in cache
    Product productFromCache = redisTemplate.opsForValue().get(cacheKey);

    // If product is in cache, return it
    if (productFromCache != null) {
        return productFromCache;
    }

    // If not found in the cache, perform a database lookup and backfill the cache
    CompletableFuture<Product> future = ongoingRequests.computeIfAbsent(id,
        productId -> CompletableFuture.supplyAsync(() -> {
            try {
                // Look up the product from the database
                Product product = productRepository.findById(id)
                        .orElseThrow(() -> new ProductNotFoundException(id));

                // Backfill the cache
                redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);
                return product;
            } finally {
                ongoingRequests.remove(productId);
            }
        }));
    try {
        return future.get();
    } catch (ExecutionException e) {
      // Handle exception
    }
}

One drawback of in-process synchronization is that we miss out on the coordination across nodes. So if these concurrent requests are fanned out across multiple nodes then we will still end up with multiple database lookups. You can choose the approach that best fits your use case. All the code for this implementation can be found on this Github link.

Caching is not a magic bullet to improve scalability & it comes with its own set of challenges. Missing out on such edge cases can impact your application negatively. Hope you got something useful out of this post. Till next time, happy learning.

雷群问题：防止踩踏 Thundering herd problem: Preventing the stampede

雷群问题：防止踩踏
Thundering herd problem: Preventing the stampede