
Couple of years back I wrote a post describing the thundering herd problem. Now that I look back, I had very cursory knowledge about the problem. I knew what it was & how can it impact an application. I knew a possible solution that can be used to solve the problem. But these were individual pieces of information. There was nothing connecting them to give me an end to end view & that was primarily because I had neither seen the problem nor the solution in action. I knew the fundamental blocks but that was it.
So in this post I will take another stab at this problem area but this time I will couple it with an actual implementation. We will reproduce the thundering herd problem & then take a look at couple of solutions which can be used to solve the problem. Lets dive in.
Understanding thundering herd problem
A typical solution that pops up whenever an application experiences increased query load on a database is to place a cache in between the web server & the database. Intention behind this is that, we first lookup from cache & if there is an existing entry then we avoid looking up from database & return the entry. If there is a cache miss then we lookup from database & backfill the cache so that it can be used for future requests. This works as we end up using cache for the hot keys which are frequently queried while keeping a low traffic on the database. Sweet & simple.
The issue begins when we have large number of concurrent requests for the same record. All requests end up querying the cache at the same time & all of them end up with a cache miss. This results in all requests ending up querying the same record from the database & eventually increasing the overall load on the database. This means if your application starts experiencing lookup patterns that comprises of multiple such hot keys, your original solution of using a cache to decrease load on database no longer works.

Reproducing thundering herd problem
Lets take a look at an application which demonstrates this problem. We have a simple backend application built using Spring Boot that uses a Postgres database to store records & Redis to cache the values. Our application follows the cache-aside pattern as described above. I have also added Zipkin tracing to verify the thundering herd problem. Lets take a look at the service code that results in the thundering herd problem(Note that I have cleaned up the code to remove tracing instrumentation. You can view the complete code for this demo on this Github link).
public Product getProductById(UUID id) throws ProductNotFoundException {
String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;
// Check if product is in cache
Product productFromCache = redisTemplate.opsForValue().get(cacheKey);
// If product is in cache, return it
if (productFromCache != null) {
return productFromCache;
}
// If the product is not in cache, fetch it from DB
Product product = productRepository.findById(id)
.orElseThrow(() -> new ProductNotFoundException(id));
// Backfill the cache
redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);
return product;
}
The above code will result in thundering herd problem if we get concurrent requests for the same key. In order to test it, I have used a small Golang program to replicate concurrent request scenario. Lets view a demo for this:
In the above demo, you can clearly see that for every request querying the same product id, we see 3 separate traces:
- Lookup from cache
- Lookup from database
- Backfill cache
Fixing thundering herd problem
Now that we have seen the thundering herd problem in action, lets go through how we can solve this issue. Our end goal here is to prevent all requests from going to the database in case of a cache miss. At the same time we do want any one of the requests to lookup from database so that we can serve the response to the users. I have approached the solution in 2 different ways.
Distributed lock
Given that we are using Redis, we can create a distributed lock for each cache key. Only the request that is able to acquire the lock can go to the database to fetch the results while other requests retry looking up from cache at regular intervals. Once the request that acquired the lock retrieves the record from database, it backfills the cache which other requests can lookup from in one of the retry attempts.
Distributed lock on Redis is suitable if you are working across multiple nodes & you want to end up having a single database call for a key lookup even when these concurrent requests are spread across multiple nodes. Though this requires an additional network call to Redis to acquire the lock & comes along with typical network related challenges. Here is the code for the implementation & you can view the complete code on this Github link.
public Product getProductById(UUID id) throws ProductNotFoundException {
String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;
// Check if product is in cache
Product productFromCache = redisTemplate.opsForValue().get(cacheKey);
// If product is in cache, return it
if (productFromCache != null) {
return productFromCache;
}
// If the product is not in the cache, then acquire a lock over
// the cache key
String lockKey = cacheKey + ":lock";
String lockValue = UUID.randomUUID().toString();
Duration lockTtl = Duration.ofSeconds(10);
Boolean lockAcquired = stringRedisTemplate.opsForValue()
.setIfAbsent(lockKey, lockValue, lockTtl);
if (Boolean.TRUE.equals(lockAcquired)) {
// This is required to avoid a race condition where another thread
// could acquire the lock and backfills the cache in between the
// current thread checks the cache and acquires the lock.
Product doubleCacheLookup = redisTemplate.opsForValue().get(cacheKey);
if (doubleCacheLookup != null) {
return doubleCacheLookup;
}
try {
// Look up the product from the database
Product product = productRepository.findById(id)
.orElseThrow(() -> new ProductNotFoundException(id));
// Backfill the cache
redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);
return product;
} finally {
releaseLock(lockKey, lockValue);
}
} else {
// If the lock was not acquired, wait for the cache to be populated and
// then return the product from the cache
return waitAndRetryFromCache(cacheKey, id);
}
}
Lets take a look at the demo with the above distributed lock & see how it solves the thundering herd problem. To reiterate the expected outcome is that only one of the requests end up querying the database while other requests fetch the value from the cache.
In the above demo, we can clearly see that the request that was able to acquire the lock ends up with traces for:
- Cache lookup
- Double checking cache to avoid race condition
- Database lookup
- Cache backfill
While we do not see any traces for database lookup for any of the other requests. This way we make sure that not all requests end up querying from the database & eventually over loading the database.
In-process synchronization
We can also use in-process synchronization instead of Redis lock to solve the thundering herd problem. This way we avoid the network call to acquire the lock & also to double check the cache to prevent race condition.
In Java we can achieve this by using a CompleteableFuture along with a ConcurrentHashMap. When a request encounters a cache miss, it either creates a new CompleteableFuture
or picks up an existing one for a specific key. Using ConcurrentHashMap
we can ensure that only 1 request is able to create the future while other requests end up picking the existing instance of the future. We also don’t need to double check the cache as now everything is in process & our map implementation provides us with synchronization guarantee.
public Product getProductById(UUID id) throws ProductNotFoundException {
String cacheKey = PRODUCT_CACHE_KEY_PREFIX + id;
// Check if product is in cache
Product productFromCache = redisTemplate.opsForValue().get(cacheKey);
// If product is in cache, return it
if (productFromCache != null) {
return productFromCache;
}
// If not found in the cache, perform a database lookup and backfill the cache
CompletableFuture<Product> future = ongoingRequests.computeIfAbsent(id,
productId -> CompletableFuture.supplyAsync(() -> {
try {
// Look up the product from the database
Product product = productRepository.findById(id)
.orElseThrow(() -> new ProductNotFoundException(id));
// Backfill the cache
redisTemplate.opsForValue().set(cacheKey, product, CACHE_TTL);
return product;
} finally {
ongoingRequests.remove(productId);
}
}));
try {
return future.get();
} catch (ExecutionException e) {
// Handle exception
}
}
One drawback of in-process synchronization is that we miss out on the coordination across nodes. So if these concurrent requests are fanned out across multiple nodes then we will still end up with multiple database lookups. You can choose the approach that best fits your use case. All the code for this implementation can be found on this Github link.
Caching is not a magic bullet to improve scalability & it comes with its own set of challenges. Missing out on such edge cases can impact your application negatively. Hope you got something useful out of this post. Till next time, happy learning.