乱麻 (luàn má)

乱麻 (luàn má)
Tangled Mess

原始链接: https://www.subbu.org/articles/2025/tangled-mess/

## 从自力更生到相互依赖：韧性的转变历史上，企业通过复制基础设施——二级数据中心和复制——来建立韧性，这源于不可靠的硬件和灾难恢复的合同义务。这种“灾难恢复税”在系统大多自包含时是合理的。云计算和开源软件的兴起改变了一切。虽然它们提供了速度和灵活性，但也引入了新的相互依赖性。系统现在依赖于复杂的云服务和SaaS提供商网络，使得传统的冗余策略无效。构建一个完全冗余的系统需要*所有*依赖项都同样具有韧性——这通常是不切实际的期望。如今，专注于复制整个站点通常是资源配置错误。相反，企业应该优先了解和利用云提供商提供的内置韧性功能（备份、故障转移等）。实现极高的可用性（五九或更高）需要对专业人才、标准化架构和深入人本的、优先考虑容错能力的工程文化进行大量投资。最后，法律和合规团队需要超越对二级站点的过时要求，专注于现代韧性实践，例如定期的“演练日”和拥抱故障场景。韧性是有成本的，清晰理解业务需求对于有效投资至关重要。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录混乱的局面 (subbu.org) 13 分，作者 gpi 2 天前 | 隐藏 | 过去 | 收藏 | 2 条评论 andrewstuart 1 天前 | 下一个 [–] 这有什么意义？似乎是一篇为云服务辩护的文章 - “AWS 是不可避免的，当然它很糟糕且不可靠，但这就是现在和未来，所以不要幻想你能建立自己的冗余系统，因为你做不到。” 似乎这就是信息。回复 outofmyshed 1 天前 | 上一个 [–] 这是糟糕的建议。多区域架构很难，而且 AWS 并没有做太多来使其更容易。跨区域数据传输对他们来说利润太高了，他们不会这样做。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

Once upon a time, our software was simple. All it took was a database, a user interface, and some glue code between them, running behind a web server farm, to go online. Most of the business-critical data stayed in the database. People rented space in colocation centers and bought servers to run their databases and code. Those colocation centers were not reliable, and your business could go down due to power or cooling failures. The servers were also not reliable, and you could lose data due to disk failures. A nasty cable damage somewhere, and the network could be out for hours or days. To get around such booboos, IT teams rented space in secondary colocation centers, bought servers and expensive replication systems to replicate data to those secondary sites, and took steps to bring up secondary sites in the event of disasters.

Businesses were worried about doing business with other companies whose infrastructure failures could hurt them. So legal people got involved to add contractual language to their agreements. Those clauses required their vendors to implement disaster recovery policies. In some cases, those contracts also required declaring procedures for recovering critical systems at the secondary site, testing them once or twice a year, and sharing the results. Since breaches of such contracts could lead to termination of business agreements, companies hired compliance professionals to draft rules for their tech teams to follow. Tech teams begrudgingly followed those rules to ensure their backups and replication were working and that they could bring up the most critical systems in the secondary colocation centers. Paying such a disaster recovery tax made sense because the infrastructure was unreliable.

It was okay, sort of, until the cloud happened. Cloud providers started offering a variety of infrastructure services to build software. Open source gave us even more flexibility to introduce new software architecture patterns. As engineers, we loved that choice and started binging on it, adopting whatever cloud companies began offering. Of course, this choice gave us a tremendous speed advantage without requiring years of engineering investments to build complex distributed systems. Those services also began offering higher fault tolerance and availability than what most enterprises could fathom. Moving systems from on-prem to the cloud became necessary to take advantage of that flexibility. Many people have built their careers around cloud transformation, and that is still ongoing.

For a while, cloud services were flaky. It fell to customers to keep their businesses running even when a cloud provider’s systems were down. It seemed sensible to work around those problems and build redundancy in a second cloud region. Many people tried that pattern. Some, like Netflix, succeeded, or at least are known to have succeeded; I don’t know if that is still the case today. Many had partial success to the extent of getting some stateless parts of the business running from multiple cloud regions.

Around the same time, the SaaS industry took off. The proliferation of online systems has increased complexity and fueled the hunger for automation across enterprises. This created opportunities for SaaS-based companies to fill that gap and offer a variety of services, from infrastructure to customer service to finance to sales and marketing. Relying on third-party SaaS became a necessity for every enterprise. You can no longer take code to production without depending on a subscription or pay-as-you-go service from another company. The net result of this flexibility and abundance is that almost everything is now interconnected.

We are a tangled mess

We are now a tangled mess. There are no more independent systems. Our systems share the fate with large swaths of the Internet. Almost every business now depends on other companies, mostly consuming their services. Thus, all bets on building redundant secondary sites are off. You not only have to get your part right, but you also need all your dependencies to do the same to be fully redundant. Most companies can’t even make their software redundant across multiple locations due to the variety of services they are building, their interconnectedness, and the types of infrastructure needs. After all, building highly available and fault-tolerant systems requires more discipline, talent, and time than most enterprises can afford. Let’s not kid ourselves.

Where does this leave us?

First, get unstuck from the old paradigm of redundancy in secondary sites. That is over-simplified thinking. It no longer makes sense for most companies to waste their precious resources on building redundancy across multiple cloud regions. Yes, cloud providers will fail from time to time, as last week’s AWS us-east-1 outage. Yet they are still incentivized to invest billions of dollars and time in the resilience of their infrastructure, services, and processes. As for yourself, instead of focusing on redundancy, invest in learning to use those cloud services correctly. These days, most cloud services offer knobs to help their customers survive disasters (like automated backups, database failovers, and availability zone failures). Know what those are, and follow meticulously.

Second, if you truly need and care about five or more nines of absolute (i.e., not brown-out) availability for your business, make sure your business can afford the cost. To achieve such availability, you have to do several things right. You need the right talent who understands how to build highly available, fault-tolerant systems. In most cases, you have to develop that talent in-house because such talent is rare. Then you need to standardize patterns like cells, global caches, replication systems, and eventual consistency for every critical piece of code you create. You will need to invest in paved paths to make it easy to follow those patterns. Implementing those patterns takes time, and you need to get them right. Most importantly, you also need a disciplined engineering culture that prioritizes high availability and fault tolerance in every decision. Your culture needs to embrace constrained choices and sacrifice flexibility in favor of high availability and fault tolerance.

Third, somehow cajole your compliance and legal people to refine or avoid the dangerous parts like “secondary sites.” Unless your company’s architecture is stuck in time, like 20 years ago, such language no longer makes sense. Refining such language can be easier said than done since some contracts are dated, and fixing that may be the least important priority for your legal teams. Don’t get me wrong. You still need to invest in game days and similar failure-embracing exercises to build resilience in your culture. But how you practice resilience needs to evolve.

We are indeed in a tangled mess. Resilience is costly. Know what you need, figure out the business cost, and do what you can.

乱麻 (luàn má) Tangled Mess

乱麻 (luàn má)
Tangled Mess