二十年站点可靠性工程的经验教训

二十年站点可靠性工程的经验教训
Lessons Learned from Twenty Years of Site Reliability Engineering

原始链接: https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned/

总之，根据给定的二十年站点可靠性工程经验教训，有几个关键要点可用于管理和维护关键基础设施的可靠性和弹性： 1. 选择与中断严重程度成比例的缓解路径。未经仔细评估，请勿升级响应。 2. 在紧急情况发生之前彻底测试恢复机制。故障可能会延长停机时间。 3. 实施一个“大红色按钮”，可以快速扭转任何负面结果。 4. 执行全面的集成测试。单元测试还不够。 5. 通信方法必须包括备份选项。避免仅依赖 Google 服务进行危机沟通。 6. 有意设计具有备份功能的降级性能模式。优雅地生存并恢复。 7. 请注意，数据中心完全故障确实会发生，并测试灾难恢复能力和恢复能力。 8. 自动化响应并具有经过验证的成功率。使用避免在最终用户遇到问题后解决根本原因的方法。 9. 缩短推出之间的时间，以最大限度地减少不正确的推出带来的不利影响。频繁部署和全面预测试可减少运营中断。 10. 使用同质硬件基础设施可以提高效率，但会带来潜在的严重设备故障后果。 11. 维护多样化的基础设施，并产生相关的额外成本，以解决基础设施中隐藏的未知因素，最大限度地减少系统停机时间。通过应用这些原则，组织将增强其关键基础设施的整体稳定性和完整性。

以下是文章的摘要和分享的见解： - 讨论围绕假设的大红色按钮场景和防止灾难性故障的提示展开，特别关注通过各种措施及早发现问题的重要性。 - Google SRE 使用 CDB（“常量数据库”）在出现错误时自动回滚到最后一次已知的良好配置。 - 功能标志、向后兼容性和其他工具可以帮助最大限度地减少中断并降低错误严重性。 - 作者提到相对于 SWE 更喜欢 SRE，因为后者倾向于使用旨在减少自身工作量的无效解决方案来造成中断。随着人们越来越重视控制工程师的工作时间，错误预算在实际情况中也变得不那么有价值。 - 此外，作者还谈到了大型企业中遇到的问题，提供了工程师在尝试模拟混沌工程功能中的服务错误/延迟时遇到的中断等示例。然而，人们注意到，除非雇主重视工程师的劳动，否则 SLI/SLA/错误预算缺乏实用性，并且对于不愿意补偿工人待命时间的企业来说，错误预算实际上是免费的。 - 关于文章中提到的 Google 拥有的顶级域名，它们在科技巨头中大量存在：Google 拥有 11 个仅用于其自身目的的域名扩展，包括“.app”、“.chrom”（ghrome 的缩写形式）。 com），以及“.here”和“.search”等其他域名，而可口可乐、波音和万豪国际集团也拥有自定义 TLD。 - 讨论最后更新了最近新加坡银行服务中断的最新情况，原因是外包供应商应用了不正确的冷却器设置，导致停机时间从下午 3 点持续到第二天早上。强调了有关冷却系统故障或数据中心故障的恢复机制、性能下降模式、灾难恢复和功能缓解策略的经验教训。最终，这一失败表明，鉴于这个城邦国家的数字化转型努力，消费者信任度受到侵蚀。

原文

Or, Eleven things we have learned as Site Reliability Engineers at Google

Authors

Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey

Contributors

Ali Biber, Guy Nadler, Luisa Fearnside, Thomas Holdschick, and Trevor Mattson-Hamilton

Foreword

A lot can happen in twenty years, especially when you're busy growing.

Two decades ago, Google had a pair of small datacenters, each housing a few thousand servers, connected in a ring by a pair of 2.4G network links. We ran our private cloud (though we didn't call it that at the time) using Python scripts such as "Assigner" and "Autoreplacer" and "Babysitter" which operated on config files full of individual server names. We had a small database of the machines (MDB) which helped keep information about individual servers organized and durable. Our small team of engineers used scripts and configs to solve some common problems automatically, and to reduce the manual labor required to manage our little fleet of servers.

Time passed, Google's users came for the search and stayed for the free GB of Gmail, and our fleet and network grew with it. Today, in terms of computing power, we are over 1,000 times as large as we were 20 years ago; in network, over 10,000 times as large, and we spend far less effort per server than we used to while enjoying much better reliability from our service stack. Our tools have evolved from a collection of Python scripts, to integrated ecosystems of services, to a unified platform which offers reliability by default. And our understanding of the problems and failure modes of distributed systems also evolved, as we experienced new classes of outages. We created the Wheel of Misfortune, we wrote Service Best Practices guides, we published Google's Greatest Hits, and today are delighted to present:

Lessons learned from two decades of Site Reliability Engineering

Let's start back in 2016, when YouTube was offering your favorite videos such as "Carpool Karaoke with Adele" and the ever-catchy "Pen-Pineapple-Apple-Pen." YouTube experienced a fifteen-minute global outage, due to a bug in YouTube's distributed memory caching system, disrupting YouTube's ability to serve videos. Here are three lessons we learned from this incident.

1

The riskiness of a mitigation should scale with the severity of the outage

There's a meme where one person posts a picture of a spider seen in their house, and the captain says, "TIME 2 MOVE 2 A NEW HOUSE!". The joke is that the incident (seeing a scary spider) would be responded to with a severe mitigation (abandon your current home and move to a new one). We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it's meant to resolve. During the aforementioned YouTube outage, a risky load-shedding process didn't fix the outage... it instead created a cascading failure.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity. In a best case scenario, a risky mitigation resolves an outage. In a worst case scenario, the risky mitigation misfires and the outage is prolonged by something that was intended to fix it. Additionally, if everything is broken, you can make an informed decision to bypass standard procedures.

2

Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time. Similarly, an outage is a terrible opportunity to try a risky load-shedding process for the first time. To keep your cool during a high-risk and high-stress situation, it's important to practice recovery mechanisms and mitigations beforehand and verify that:

they'll do what you need them to do
you know how to do them

Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we've doubled down on testing.

At one point, we wanted to push a caching configuration change. We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure. Turns out, caching was a pretty critical feature for YouTube, and the config change had some unintended consequences that fully hobbled the service for 13 minutes. Had we canaried those global changes with a progressive rollout strategy, this outage could have been curbed before it had global impact. Read more about the canary strategy in this paper, and learn more in this video.

Around the same timeframe, YouTube's slightly younger sibling, Google Calendar, also experienced an outage which serves as the backdrop for the next two lessons.

4

Have a "Big Red Button"

A "Big Red Button" is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever's happening. "Big Red Buttons" come in many shapes and sizes—and it's important to identify what those big red buttons might be before you submit a potentially risky action. We once narrowly missed a major outage because the engineer who submitted the would-be-triggering change unplugged their desktop computer before the change could propagate. So when planning your major rollouts, consider What is my big red button? Ensure every service dependency has a "big red button" to exercise in an emergency. See "Generic Mitigations" for more!

5

Unit tests alone are not enough - integration testing is also needed

Ahh.... unit tests. They verify that an individual component can perform the way we need it to. Unit tests have intentionally limited scope, and are super helpful, but they also don't fully replicate the runtime environment and productionized demands that might exist. For this reason, we are big advocates of integration testing! We can use integration tests to verify that jobs and tasks can perform a cold start. Will things work the way we want them to? Will components work together the way we want them too? Will these components successfully create the system we want them to? This lesson was learned during a Calendar outage in which our testing didn't follow the same path as real use, resulting in plenty of testing... that didn't help us assess how a change would perform in reality.

Shifting to an incident that happened in February 2017, we find our next two lessons.

First, unavailable OAuth tokens caused millions of users to be logged out of devices and services, and 32,000 OnHub and Google WiFi devices to perform a factory reset. Manual account recovery claims jumped by 10x because of failed logins. It took Google about 12 hours to fully recover from the outage.

6

COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!

Yes, it was a bad time. You want to know what made it worse? Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services... relying on these Google services was, in retrospect, kind of a bad call. Ensure that you have non-dependent backup communication channels, and that you have tested them.

Then, the same 2017 incident led us to better understand graceful degradation:

7

Intentionally degrade performance modes

It's easy to think of availability as either "fully up" or "fully down" ... but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience. So we've built degraded performance modes carefully and intentionally—so during rough patches, it might not even be user-visible (it might be happening right now!). Services should degrade gracefully and continue to function under exceptional circumstances.

This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.

8

Test for Disaster resilience

Besides unit testing and integration testing, there are other types of very important testing: disaster resilience and recovery testing. While resilience testing verifies that your service or system could survive in the event of faults, latency, or disruptions, recovery testing verifies that your service can transition back to homeostasis after a full shutdown. Both should be critical pieces of your business continuity strategy—as described in "Weathering the Unexpected". A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying "What Ifs", for example, "What if part of your network connectivity gets shut down unexpectedly?".

9

Automate your mitigations

In March of 2023, a near-simultaneous failure of multiple networking devices occurred in a few datacenters, resulting in a widespread packet loss. In this 6-day outage, an estimated 70% of services experienced varied levels of impact, depending on the location, service load, and configuration at the time of network failure.

In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there's a clear signal that a particular failure is occurring, then why can't that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.

10

Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong

In March of 2022, a widespread outage in the payments system prevented customers from completing transactions, resulting in the Pokémon GO community day being postponed. The cause was the removal of a single database field, which should have been safe as all uses of that field were removed from the code beforehand. Unfortunately, a slow rollout cadence of one part of the system meant that the field was still being used by the live system.

Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.

11

A single global hardware version is a single point of failure

Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.

This happened in March 2020 when a networking device that had an undiscovered zero-day bug, encountered a change in traffic patterns that triggered that bug. As the same model and version of the device was being used across the network, a substantial regional outage ensued. What prevented this from being a total outage was the presence of multiple network backbones that allowed high-priority traffic to be routed via a still working alternative.

Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.

So there you have it! Eleven lessons learned, from two decades of Site Reliability Engineering at Google. Why eleven? Well, you see, Google Site Reliability, with our rich history, is still in our prime.