``` Bluesky 2026年4月故障事故报告 ```

``` Bluesky 2026年4月故障事故报告 ```
Bluesky April 2026 Outage Post-Mortem

原始链接: https://pckt.blog/b/jcalabro/april-2026-outage-post-mortem-219ebg2

## Bluesky 服务中断总结 (2024年4月8日) Bluesky 在4月8日（星期一）经历了一次重大中断，持续时间长达8小时，影响了大约一半的用户。根本原因是一个最近部署的内部服务向数据平面发送了异常大量的请求——具体来说，一批请求包含15,000-20,000个帖子URI查找，而通常情况下为1-50个。这导致一个关键端点 (`GetPostRecord`) 超载，该端点缺乏并发限制，由于TIME_WAIT状态的连接累积，耗尽了可用的TCP端口。由此产生的memcached故障触发了过多的日志记录，进一步加剧了系统压力，导致垃圾回收暂停时间增加以及出现内存不足 (OOM) 错误——形成了一种不稳定的“死亡螺旋”。最初的故障排除受到可观测性不足的阻碍，特别是缺乏按客户端的指标。一个临时解决方案，即对memcached连接随机化本地IP地址，打破了死亡螺旋，使团队能够在识别核心问题的同时稳定服务。 Bluesky 对此次中断表示歉意，并强调需要改进可观测性，特别是关于大型请求模式，以及在高规模系统中采取更谨慎的日志记录方法。他们还纠正了之前状态页面更新中错误地将问题归因于第三方提供商的情况。

Hacker News新旧评论提问展示招聘提交登录Bluesky 2026年4月故障回顾 (pckt.blog)14点由 jcalabro 40分钟前 | 隐藏 | 旧 | 收藏 | 2评论帮助 threecheese 23分钟前 [–] > 我错过了的是，我们上周部署了一项新的内部服务，它发送的GetPostRecord请求少于每秒三次，但有时会一次性发送15-20千个URI。通常，我们每次请求可能会进行1-50个帖子查找。这就导致了问题。回复bombcar 0分钟前 | 父级 [–] 零，一，很多，很多千。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.

First, I'd like to apologize to our users for the interruption in service. This is easily the worst outage we've seen in my time here. It's just not acceptable.

The Problem

The issue actually started earlier that weekend. Here's the Bluesky AppView's requests chart for the days leading up to the really bad day (Monday):

The yellow/green isn't important, but those dips are super nasty! They represent real user-facing downtime. Ouch!

We got a page on Saturday April 4. I took a look, thinking it was likely a transit issue. We have pretty extensive network monitoring, and it all looked clear.

I did, however, notice a spike in log lines like this in our AppView data backend (called the "data plane"):

{
  "time": "2026-04-03T22:16:07.944910324Z",
  "level": "ERROR",
  "msg": "failed to set post cache item",
  "uri": "at://did:plc:mhvcx2z27zq2jtb3i7f5beb7/app.bsky.feed.post/3mim4uloar22m",
  "error": "dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use"
}

The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.

The Root Cause

It took a long time to find the actual issue due to subpar observability. We generally have excellent monitoring in the data plane, but it does assume that each request to it is small and doesn't do much work.

This particular RPC (GetPostRecord) takes a batch of post URIs, and looks them all up in memcached, then scylla upon cache miss. What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.

Every RPC handler in the data plane does bounded concurrency (i.e. errgroup.SetLimit). However, this endpoint did not! It was the only endpoint in the entire system that was missing it.

That means that we'd launch 15-20 thousand goroutines for the request, slam the daylights out of memcached by dialing a ton of connections, then close and return them to the OS since our max idle conn pool size was 1000. They would build up in the TCP TIME_WAIT state, and exhaust all available ports.

The go code looks like this:

func GetPostRecords(uris []string) ([]*Post, error) {
    posts := make([]*Post, len(uris))

    var group errgroup.Group
    

    for ndx, uri := range uris {
        group.Go(func() error {
            post, err := memcache.GetPost(uri)
            if err != nil {
                return err
            }
            if post != nil {
                posts[ndx] = post
                return nil
            }

            post, err = scylla.GetPost(uri)
            if err != nil {
                return err
            }
            if post != nil {
                posts[ndx] = post
                return nil
            }

            return nil
        })
    }

    if err := group.Wait(); err != nil {
        return nil, err
    }

    return posts, nil
}

Youch! We saw pretty much right away that we were exhausting ports but had no idea what the root cause was. There are lots of places where we use memcached, and I specifically pulled out that one JSON log line because it specifies that it was an issue with the post cache. We also have a user cache, interaction counts cache, and many more. We saw error logs from all cache types (which makes sense since all memcached behavior was impacted), so it was not at all clear right away that it was an issue with GetPostRecord.

Also note that the new internal service runs in only one of our data centers at present, which is why we only saw that site having issues. That definitely contributed to the confusion since we didn't have metrics per-client in the data plane.

Death Spiral

We didn't find and fix this issue until Wednesday of this week, even though service was stabilized on Monday. So what did we do in the meantime to stop the bleeding?

I spent most of Saturday and Sunday chasing this down and still not finding the root cause, and service was hobbling along poorly, but surviving. Then, Monday, something tripped. It turns out that we put ourselves in a death spiral! This negative feedback loop is what caused the massive outages on Monday.

It turns out, whenever we get an error from memcache, we log it. We do a couple million requests a second to our memcached instances, and so we were attempting to do millions of logs per second.

Logging in go uses the blocking write(2) syscall. This huge number of blocking syscalls coupled with our attempt to continue to serve millions of requests a second caused the go runtime to spawn many more OS threads (M's in go parlance). It was roughly 10x more Ms' than compared to the healthy baseline (150 vs 1500).

That larger batch of M's was in turn putting pressure on the garbage collector:

Those massive pauses in stop-the-world GC duration meant requests were stalling.

Couple that with the fact that we had some very aggressively tuned GOGC and GOMEMLIMIT environment variable values and memory limits meant that our data plane was actually OOM'ing every so often! That's why the service was working for like 30 minutes, then down for a while, then would come back for a bit, and repeat.

OOMs are obviously bad (we should have zero of them), but they're ordinarily not that big a deal. However, the fact that the memcached connection pools were already tool saturated meant that when the data plane was restarted, it couldn't create new memcached connections when it came up since those existing connections were stuck in TIME_WAIT, which resulted in even more port exhaustion issues. Death spiral!

The band-aid fix was insane but did the job. This is how we actually fixed the outage on Monday, before we found the true root cause:




memcachedClient.DialContext = func(ctx context.Context, network, address string) (net.Conn, error) {
	ip := net.IPv4(127, byte(1+rand.IntN(254)), byte(rand.IntN(256)), byte(1+rand.IntN(254)))
	d := net.Dialer{LocalAddr: &net.TCPAddr{IP: ip}}
	return d.DialContext(ctx, network, address)
}

That got us out of the death loop because it expands the client ip+port space. Crazy, but effective! We removed this once we fixed the true root.

Summary

In my recent talk, I mentioned that you should add extensive observability before you have an outage. We do have a lot, but it's never enough! We need to add per-client o11y as well as get better metrics on when clients send small numbers of large requests.

It was all buried in there, but it was hard to know where to look when so much was falling over all at once. You need to have the mental discipline and high granularity in your metrics to be able to cut through the noise to find the real root cause. It's hard work!

Also, logging too much isn't great. Logging here and there is fine, but I'd prefer to do prometheus metrics or OTEL tracing since they're better designed for high-scale systems.

Finally, apologies again for the extensive interruption in service. The team and I take our operations extremely seriously, and this was a really bad day.

EDIT: Also, the status page said this was an issue with a 3rd party provider. It was clearly not, apologies for that miscommunication! At the time I posted that status page update, I was looking at some traceroutes that indicated some pretty substantial packet loss from a cloud provider to our data center, but those were not the root cause of the issue.