Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.
First, I'd like to apologize to our users for the interruption in service. This is easily the worst outage we've seen in my time here. It's just not acceptable.
The Problem
The issue actually started earlier that weekend. Here's the Bluesky AppView's requests chart for the days leading up to the really bad day (Monday):
The yellow/green isn't important, but those dips are super nasty! They represent real user-facing downtime. Ouch!
We got a page on Saturday April 4. I took a look, thinking it was likely a transit issue. We have pretty extensive network monitoring, and it all looked clear.
I did, however, notice a spike in log lines like this in our AppView data backend (called the "data plane"):
{
"time": "2026-04-03T22:16:07.944910324Z",
"level": "ERROR",
"msg": "failed to set post cache item",
"uri": "at://did:plc:mhvcx2z27zq2jtb3i7f5beb7/app.bsky.feed.post/3mim4uloar22m",
"error": "dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use"
}The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.
The Root Cause
It took a long time to find the actual issue due to subpar observability. We generally have excellent monitoring in the data plane, but it does assume that each request to it is small and doesn't do much work.
This particular RPC (GetPostRecord) takes a batch of post URIs, and looks them all up in memcached, then scylla upon cache miss. What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.
Every RPC handler in the data plane does bounded concurrency (i.e. errgroup.SetLimit). However, this endpoint did not! It was the only endpoint in the entire system that was missing it.
That means that we'd launch 15-20 thousand goroutines for the request, slam the daylights out of memcached by dialing a ton of connections, then close and return them to the OS since our max idle conn pool size was 1000. They would build up in the TCP TIME_WAIT state, and exhaust all available ports.
The go code looks like this:
func GetPostRecords(uris []string) ([]*Post, error) {
posts := make([]*Post, len(uris))
var group errgroup.Group
for ndx, uri := range uris {
group.Go(func() error {
post, err := memcache.GetPost(uri)
if err != nil {
return err
}
if post != nil {
posts[ndx] = post
return nil
}
post, err = scylla.GetPost(uri)
if err != nil {
return err
}
if post != nil {
posts[ndx] = post
return nil
}
return nil
})
}
if err := group.Wait(); err != nil {
return nil, err
}
return posts, nil
}Youch! We saw pretty much right away that we were exhausting ports but had no idea what the root cause was. There are lots of places where we use memcached, and I specifically pulled out that one JSON log line because it specifies that it was an issue with the post cache. We also have a user cache, interaction counts cache, and many more. We saw error logs from all cache types (which makes sense since all memcached behavior was impacted), so it was not at all clear right away that it was an issue with GetPostRecord.
Also note that the new internal service runs in only one of our data centers at present, which is why we only saw that site having issues. That definitely contributed to the confusion since we didn't have metrics per-client in the data plane.
Death Spiral
We didn't find and fix this issue until Wednesday of this week, even though service was stabilized on Monday. So what did we do in the meantime to stop the bleeding?
I spent most of Saturday and Sunday chasing this down and still not finding the root cause, and service was hobbling along poorly, but surviving. Then, Monday, something tripped. It turns out that we put ourselves in a death spiral! This negative feedback loop is what caused the massive outages on Monday.
It turns out, whenever we get an error from memcache, we log it. We do a couple million requests a second to our memcached instances, and so we were attempting to do millions of logs per second.
Logging in go uses the blocking write(2) syscall. This huge number of blocking syscalls coupled with our attempt to continue to serve millions of requests a second caused the go runtime to spawn many more OS threads (M's in go parlance). It was roughly 10x more Ms' than compared to the healthy baseline (150 vs 1500).
That larger batch of M's was in turn putting pressure on the garbage collector:
Those massive pauses in stop-the-world GC duration meant requests were stalling.
Couple that with the fact that we had some very aggressively tuned GOGC and GOMEMLIMIT environment variable values and memory limits meant that our data plane was actually OOM'ing every so often! That's why the service was working for like 30 minutes, then down for a while, then would come back for a bit, and repeat.
OOMs are obviously bad (we should have zero of them), but they're ordinarily not that big a deal. However, the fact that the memcached connection pools were already tool saturated meant that when the data plane was restarted, it couldn't create new memcached connections when it came up since those existing connections were stuck in TIME_WAIT, which resulted in even more port exhaustion issues. Death spiral!
The band-aid fix was insane but did the job. This is how we actually fixed the outage on Monday, before we found the true root cause:
memcachedClient.DialContext = func(ctx context.Context, network, address string) (net.Conn, error) {
ip := net.IPv4(127, byte(1+rand.IntN(254)), byte(rand.IntN(256)), byte(1+rand.IntN(254)))
d := net.Dialer{LocalAddr: &net.TCPAddr{IP: ip}}
return d.DialContext(ctx, network, address)
}That got us out of the death loop because it expands the client ip+port space. Crazy, but effective! We removed this once we fixed the true root.
Summary
In my recent talk, I mentioned that you should add extensive observability before you have an outage. We do have a lot, but it's never enough! We need to add per-client o11y as well as get better metrics on when clients send small numbers of large requests.
It was all buried in there, but it was hard to know where to look when so much was falling over all at once. You need to have the mental discipline and high granularity in your metrics to be able to cut through the noise to find the real root cause. It's hard work!
Also, logging too much isn't great. Logging here and there is fine, but I'd prefer to do prometheus metrics or OTEL tracing since they're better designed for high-scale systems.
Finally, apologies again for the extensive interruption in service. The team and I take our operations extremely seriously, and this was a really bad day.
EDIT: Also, the status page said this was an issue with a 3rd party provider. It was clearly not, apologies for that miscommunication! At the time I posted that status page update, I was looking at some traceroutes that indicated some pretty substantial packet loss from a cloud provider to our data center, but those were not the root cause of the issue.