我们应该摒弃平均 CPU 利用率这一指标。

我们应该摒弃平均 CPU 利用率这一指标。
We should get rid of average CPU utilization

原始链接: https://www.theocharis.dev/blog/why-we-should-get-rid-of-average-cpu-utilization/

一个 Go 应用程序在监控面板显示健康的情况下，仍反复出现“context deadline exceeded”错误。根本原因是 Kubernetes/Docker 中的 **CFS（完全公平调度器）节流**机制，该机制将 CPU 限制为一种时间预算，而非固定的处理能力。当容器化进程出现突发负载时，它可能会在短时间内耗尽 100 毫秒的 CPU 总预算，导致内核在周期剩余时间内暂停其执行。由于这种情况发生在毫秒级，平滑处理后的平均 CPU 指标往往维持在较低水平，从而掩盖了后台严重的延迟瓶颈。对于延迟敏感型工作负载，标准的“平均 CPU”图表具有误导性。为诊断此问题，工程师应监控： * **cgroup `cpu.stat`**：重点关注 `nr_throttled` 和 `throttled_usec`。 * **内核 PSI**：使用 `cpu.pressure` 检测资源饱和情况。 * **Steal Time**：在虚拟化环境中可通过 `top` 命令中的 `%st` 查看。 * **应用层检测**：监控 Goroutine 调度延迟，以识别应用程序何时因缺少 CPU 时间而被阻塞。归根结底，开发者必须超越平均利用率指标，以检测实际的 CPU 匮乏问题，从而确保系统性能优良，而非仅仅停留在“高效”节流的表象上。

关于“我们应该摒弃平均 CPU 利用率”这篇文章的 Hacker News 讨论指出，业界普遍认为聚合指标往往具有误导性，或不足以用于监控系统性能。评论者认为： * **指标的局限性：** 仅依赖 CPU 利用率会忽略性能限流实现不当的问题，也无法反映内存使用的深层复杂性。 * **更好的替代方案：** 许多人建议将重点转向延迟测量，并采用“宽日志”（wide logs）或 OpenTelemetry 追踪来获得更好的可见性，尽管目前这会带来存储成本。 * **基础设施策略：** 一个反复出现的观点是，CPU 限制本身就可能是有害的，往往会导致性能下降，而非解决问题。 * **不同观点：** 虽然有人主张优化代码，但另一些人认为，通过资源分配（或仅仅是取消任意约束）才是解决延迟问题的更务实的方法。总体而言，社区认为这是一个关于误读高级指标的警示案例，并强调在现代系统架构中需要更细粒度的诊断数据。

原文

2026-05-14 · Based on a talk at Cloud Native Aachen Meetup, April 2026

A Go function in our application kept getting cancelled in production.

The function had a tight timeout. The same code ran fine in our development setups, in our CI and CD pipelines, in every integration test we had. In production it would sometimes blow past the timeout and die with context deadline exceeded.

What made it worse was the state machine library^{we used. When its context got cancelled, it wouldn’t recover on its own. It would crash and hang. We couldn’t reproduce it.}

When we talked to users, they reported the CPU utilization looked fine.

It took us weeks to find the cause.

Why average CPU isn’t enough

One of the first things you learn when you work with computers is: when it gets slow, you open the task manager (or the equivalent) and check the CPU. If it’s high, you look at what the process is, and you stop it or you work on it, and you continue.

On Linux servers you have top, htop, whatever, with a lot of other numbers on the screen. At least for me, when I was working with it, I always just looked at the average CPU. Then when you continue with your career, you provision VMs. Maybe on VMware or Hyper-V on-prem. Maybe on AWS, Azure, or Hetzner. You pick a number of vCPUs. If you look closer, it says something like “performance” or “dedicated” as a more expensive option. But at least I myself never really asked why. It was just cheaper without.

The instinct, taught to you by every tool, by every vendor, by every dashboard, fails here.

Every tool shows you average CPU utilization, but none of the tools help you interpret it. None of them tell you that CPU utilization isn’t linear to how much capacity you have. They just show you a percentage. The jump from 80% to 81% CPU utilization adds roughly 20× more wait time than the jump from 10% to 11%^{. So even with 20% “headroom” at 80% utilization, latencies have already started climbing^:}

CPU utilization	Wait for a 10 ms request
10%	~11 ms
80%	~50 ms
95%	~200 ms

M/M/1 queueing model baseline.

Low utilization: a new request waits about one slot for the current job to finish.

Higher utilization: the same request waits three slots. Reality is more complex of course (random arrivals, variable service times). The M/M/1 model captures the details.

Average CPU is the right metric for one question: are our CPUs utilized? That’s a cost question, and the IT department is right to ask it.

It only works if your workloads can wait. For latency-sensitive ones, higher utilization just means longer waits.

But in our case, our CPU utilization wasn’t even high. What we hit was a Linux kernel feature that Docker and Kubernetes use to enforce per-container limits, called the cgroup, and one of its byproducts: throttling.

There was a resource limit on the container, set to 2000m. We read it as “two CPUs.” The kernel read it as a time budget. When the budget runs out, the container is throttled until the next period begins.

None of the tools we or our customer had in front of us would show this. And that’s why it took us weeks to find the reason for the context deadline exceeded. Every graph said everything was fine. Every user said everything was fine.

It was extremely surprising.

How CFS throttling actually works

Let’s assume you’re running a service in a container that processes HTTP messages, and you’ve set resource limits because that’s what every guide recommends^.

Guaranteed QoS pod config: requests = limits = 2000m

kubectl top pod shows 800 millicores. Your Horizontal Pod Autoscaler (HPA) is configured to scale up at 80% utilization. 800m of a 2000m limit is 40 percent, far from the 80 percent target. Everything looks fine. Right?

No, let’s look closer. There are three numbers that determine what happens here:

Your resource limit: 2000m
The kernel’s CFS scheduling period: 100 ms by default.
The host CPUs: 4 cores.

A 100 ms CFS scheduling period. With a 2000m limit, the container gets 200 ms of CPU time per period.

Now, why does it matter how many CPU cores the host has? Because this is where the abstraction leaks. The container can spend that 200 ms across every CPU core available on the node.

Now imagine an HTTP service. A request could come in that is resource intensive and could burst the entire available budget across all 4 cores within 50 ms.

A 50 ms burst across 4 cores exhausts the 200 ms budget.

When a second request arrives, it is throttled and has to wait 50 ms more until the next scheduling period.

The next 50 ms of wall clock are throttled.

Now imagine this repeats. If your load pattern is burst, idle, idle, idle, burst, your p99 latency can go through the roof. Yet every CPU graph still says everything is fine.

Alternating burst-idle-idle-idle-burst pattern: average CPU looks healthy while p99 climbs.

This is what was happening to our Go function. The function ran with a tight timeout and needed to execute within that time frame. Sometimes a different goroutine had already burst through the available budget of that period. The function would starve and die with context deadline exceeded. What made it worse: one of the underlying libraries we used^{would land in a deadlock state if you cancelled the context. But every CPU graph still looked fine.}

This isn’t unique to us. Indeed Engineering published the same pattern from production in 2019:^{one application, 4.1 CPUs allocated, usage well below one core, and yet most 100 ms periods throttled.}

CPU usage stays well below the 4.1 cores allocated. Dave Chiluk, Indeed Engineering (2019)

Same workload: most 100 ms periods are throttled. Dave Chiluk, Indeed Engineering (2019)

How to spot it and react to it

We spent weeks looking in the wrong places. The metric we needed had been there all along, in /sys/fs/cgroup/cpu.stat^{. The kernel writes nr_throttled (the count of throttle events) and throttled_usec (the total microseconds frozen) for every container with a CPU limit set. Run kubectl exec <pod> -- cat /sys/fs/cgroup/cpu.stat. The output looks like this:}

usage_usec 49823715
user_usec 41205893
system_usec 8617822
nr_periods 300
nr_throttled 30
throttled_usec 6142188
nr_bursts 0
burst_usec 0

If those counters are climbing, there’s your answer.

Until the ecosystem catches up, you need to check this yourself. Kube-prometheus ships a CPUThrottlingHigh alert, but most installations have it disabled because it fires too often.^{For containers on dedicated cores, cat /sys/fs/cgroup/cpu.stat is your check. The same directory also exposes cpu.pressure^{: a kernel PSI signal that catches saturation even when you’re under quota. For vCPUs in a hypervisor, the cgroup file doesn’t show hypervisor-level throttling. You see that as steal time (%st in top)^{. Steal time is what happens when your VM’s hypervisor gives your vCPU’s slot to another tenant: your code is ready to run, but the physical CPU is busy with someone else’s.}}}

That’s the immediate workaround. The longer-term answer is application-side starvation detection. The application asks if a millisecond is still a millisecond. If not, it has been starved of CPU. The cause could be CFS throttling, hypervisor steal time on a virtual CPU, a sibling process inside the same container bursting the shared budget away, or anything else above the application. The application alerts that something is wrong. Ideally it also reacts, for example by descheduling background maintenance work to free CPU for foreground requests.

Redpanda’s reactor calls this a “reactor stall”^{, their term for the same starvation. CockroachDB built a feedback controller around Go’s /sched/latencies:seconds histogram, which measures how long goroutines wait between becoming runnable and actually running, treating p99 latency above 1 ms as the trigger to shed background work.}

Go 1.25 made GOMAXPROCS cgroup-aware by default^{. This stops a Go application from using more threads than the cgroup allows, reducing the chance of starvation. It doesn’t help when sibling processes in the same container burn the shared budget. Application-side detection remains the universal answer.}

Summary

To recap: when you set a resource limit on a Docker container, you don’t limit the number of CPUs it has access to. Instead, you give it a time budget per 100 ms scheduling period. For example, a resource-intensive HTTP request can burst through all node cores at once. And if you’re unlucky, the average CPU graph won’t see it.

What to watch instead:

cgroup throttling (nr_throttled and throttled_usec in cpu.stat)
kernel PSI (cpu.pressure) for cgroup-native saturation
hypervisor steal time (%st in top)
application-level starvation signals (in Go, /sched/latencies:seconds)

Use these together to see what the average CPU graph hides.

Now imagine running an application inside a larger organization, like a traditional enterprise. Say, a large manufacturer with an IT department, partly outsourced, responsible for setting up VMs and cloud resources. You ask for more CPUs because the application isn’t working properly, or for the resource limit to be removed because it’s causing issues. They point at the 40% CPU graph and ask why you need more. They also point at the recommended guides^{for auditors and compliance, which require setting resource limits.}

This is why we should get rid of average CPU utilization. Not the limits. Those stay because auditors require them. But we can make working with them easier by exposing more than the average CPU graph. Maybe one signal that says “this container needs more CPU.” The CPU isn’t a production machine you’re optimizing for high utilization. It’s a resource you use to keep your application running smoothly.

If you find yourself in this conversation, send them this article.