现在怎么办？大型系统中的错误处理

现在怎么办？大型系统中的错误处理
What Now? Handling Errors in Large Systems

原始链接: https://brooker.co.za/blog/2025/11/20/what-now

## 错误处理作为系统级属性 Cloudflare 最近的一次中断，由 Rust 的 `unwrap()` 调用触发，引发了关于错误处理的讨论。核心要点并非代码本身，而是*如何*处理错误是全局系统属性，而非局部属性。进程崩溃是否可接受取决于诸如故障相关性、架构能力以及有意义地继续运行的能力等因素。不相关、孤立的故障通常最好通过崩溃来简化系统状态。然而，相关故障（包括恶意攻击）需要强大的错误拒绝和持续运行。在架构上，具有高容错性的系统（如无服务器函数或 Erlang）可以更优雅地处理崩溃。业务逻辑也很重要——继续使用最后的良好配置通常是可行的，但跳过数据库更新可能导致数据损坏。最终，有效的错误处理需要主动设计和“爆炸半径降低”——通过诸如基于单元的架构等技术来限制故障的影响。这承认了系统的固有复杂性，并通过最大限度地减少潜在问题的范围来优先考虑弹性。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录现在怎么办？大型系统中的错误处理 (brooker.co.za) 4 分，thundergolfer 发表于 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

More options means more choices.

Cloudflare’s deep postmortem for their November 18 outage triggered a ton of online chatter about error handling, caused by a single line in the postmortem:

.unwrap()

If you’re not familiar with Rust, you need to know about Result, a kind of struct that can contain either a succesful result, or an error. unwrap says basically “return the successful results if their is one, otherwise crash the program”¹. You can think of it like an assert.

There’s a ton of debate about whether asserts are good in production², but most are missing the point. Quite simply, this isn’t a question about a single program. It’s not a local property. Whether asserts are appropriate for a given component is a global property of the system, and the way it handles data.

Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification.

One of ten web servers behind a load balancer encounters uncorrectable memory errors, and takes itself out of service.
One of ten multi-threaded application servers behind a load balancer encounters a null pointer in business logic while processing a customer request.
One database replica receives a logical replication record from the primary that it doesn't know how to process
One web server receives a global configuration file from the control plane that appears malformed.
One web server fails to write its log file because of a full disk.

If you don’t want to play, and just see my answers, click here: .

There are three unifying principles behind my answers here.

Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failures can be correlated (including by adversarial user behavior), its best to design the system to reject the cause of the errors and continue.

Can they be handled at a higher layer? This is where you need to understand your architecture. Traditional web service architectures can handle low rates of errors at a higher layer (e.g. by replacing instances or containers as they fail load balancer health checks using AWS Autoscaling), but can’t handle high rates of crashes (because they are limited in how quickly instances or containers can be replaced). Fine-grained architectures, starting with Lambda-style serverless all the way to Erlang’s approach, are designed to handle higher rates of errors, and crashing rather the continuing is appropriate in more cases.

Is it possible to meaningfully continue? This is where you need to understand your business logic. In most cases with configuration, and some cases with data, its possible to continue with the last-known good version. This adds complexity, by introducing the behavior mode of running with that version, but that complexity may be worth the additional resilience. On the other hand, in a database that handles updates via operations (e.g. x = x + 1) or conditional operations (if x == 1 then y = y + x) then continuing after skipping some records could cause arbitrary state corruption. In the latter case, the system must be designed (including its operational practices) to ensure the invariant that replicas only get records they understand. These kinds of invariants make the system less resilient, but are needed to avoid state divergence.

The bottom line is that error handling in systems isn’t a local property. The right way to handle errors is a global property of the system, and error handling needs to be built into the system from the beginning.

Getting this right is hard, and that’s where blast radius reduction techniques like cell-based architectures, independent regions, and shuffle sharding come in. Blast radius reduction means that if you do the wrong thing you affect less than all your traffic - ideally a small percentage of traffic. Blast radius reduction is humility in the face of complexity.

Footnotes

Yes, I know a panic isn’t necessarily a crash, but it’s close enough for our purposes here. If you’d like to explain the difference to me, feel free.
And a ton of debate about whether Rust helped here. I think Rust does two things very well in this case: it makes the unwrap case explicit in the code (the programmer can see that this line has “succeed or die behavior”, entirely locally on this one line of code), and prevents action-at-a-distance behavior (which silently continuing with a NULL pointer could cause). What Rust doesn’t do perfectly here is make this explicit enough. Some suggested that unwrap should be called or_panic, which I like. Others suggested lints like clippy should be more explicit about requiring unwrap to come with some justification, which may be helpful in some code bases. Overall, I’d rather be writing Rust than C here.

现在怎么办？大型系统中的错误处理 What Now? Handling Errors in Large Systems

现在怎么办？大型系统中的错误处理
What Now? Handling Errors in Large Systems