我们如何发现 7 TiB 闲置内存
How We Found 7 TiB of Memory Just Sitting Around

原始链接: https://render.com/blog/how-we-found-7-tib-of-memory-just-sitting-around

## Kubernetes大规模部署:内存优化之旅 调试大规模Kubernetes基础设施很少是一蹴而就,而是一系列渐进式的改进。本文详细描述了一次这样的旅程,重点关注与DaemonSet(特别是Calico和Vector)在拥有大量命名空间的集群中,因监听命名空间列表而导致的意外高内存使用问题。 团队发现,Calico和Vector都在持续跟踪命名空间,消耗了大量内存,影响了apiserver性能和整体集群效率。优化Calico带来了初步收益,但Vector也显示出高内存使用,原因是日志中引用了命名空间标签。 通过质疑命名空间数据的必要性,他们成功配置Vector使其无需该数据运行,最初节省了50%的内存。进一步的调查,源于一位同事的观察,发现了一个配置疏忽——一个重复的Kubernetes日志源——从而导致了额外的显著减少。 最终结果是在其基础设施上节省了惊人的7TiB内存,提高了发布稳定性,并使系统更易于管理。这凸显了在解决Kubernetes扩展挑战时,持续调查、协作解决问题以及质疑假设的重要性。

## Hacker News 讨论:发现渲染集群中浪费的内存 最近 Hacker News 的讨论源于 Render.com 博客文章,其中详细介绍了他们在 Kubernetes 集群中发现的 7TiB 意外消耗的内存。根本原因追溯到他们的日志管道(Vector)过度使用内存,具体与跟踪 Kubernetes 命名空间有关。 用户们争论这种资源分配的合理性,一些人指出在云环境中存在类似的过度配置现象——通常比雇佣专门人员或优化复杂系统更便宜。另一些人则指出了缺乏尽职尽责的 DevOps 实践,举例说明了由于疏忽而导致的不必要的资源,例如运行多个数据库服务器。 作者确认最初没有注意到这个问题,因为资源增长是渐进的,并且他们的基础设施规模庞大。这次对话强调了性能分析、顽固调查以及保持高性能期望的重要性。许多评论者强调,现代计算机速度非常快,软件中的低效率应该得到积极解决。讨论的解决方案包括优化 Kubernetes 命名空间跟踪以及改进性能分析工具,以识别架构瓶颈。
相关文章

原文

Debugging infrastructure at scale is rarely about one big aha moment. It’s often the result of many small questions, small changes, and small wins stacked up until something clicks.

Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes.
Getting ready to dissect what I like to call: the Kubernetes hypercube of bad vibes.
Credits: Hyperkube from gregegan.net, diagram (modified) from Kubernetes community repo

Plenty of teams run Kubernetes clusters bigger than ours. More nodes, more pods, more ingresses, you name it. In most dimensions, someone out there has us beat.

There's one dimension where I suspect we might be near the very top: namespaces. I say that because we keep running into odd behavior in any process that has to keep track of them. In particular, anything that listwatches them ends up using a surprising amount of memory and puts real pressure on the apiserver. This has become one of those scaling quirks you only really notice once you hit a certain threshold. As this memory overhead adds up, efficiency decreases: each byte we have to use for management is a byte we can't put towards user services.

The problem gets significantly worse when a daemonset needs to listwatch namespaces or network policies (netpols, which we define per namespace). Since daemonsets run a pod on every node, each of those pods independently performs a listwatch on the same resources. As a result, memory usage increases with the number of nodes.

Even worse, these listwatch calls can put significant load on the apiserver. If many daemonset pods restart at once, such as during a rollout, they can overwhelm the server with requests and cause real disruption.

A few months ago, if you looked at our nodes, the largest memory consumers were often daemonsets. In particular, Calico and Vector which handle configuring networking and log collection respectively.

We had already done some work to reduce Calico’s memory usage, working closely with the project’s maintainers to make it scale more efficiently. That optimization effort was a big win for us, and it gave us useful insight into how memory behaves when namespaces scale up.

Memory profiling results
Memory profiling results

Time-series graph of memory usage per pod for calico-node instances
Time-series graph of memory usage per pod for calico-node instances

To support that work, we set up a staging cluster with several hundred thousand namespaces. We knew that per-namespace network policies (netpols) were the scaling factor that stressed Calico, so we reproduced those conditions to validate our changes.

While running those tests, we noticed something strange. Vector, another daemonset, also started consuming large amounts of memory.

Memory usage per pod graph showing Vector pods
Memory usage per pod graph showing Vector pods

The pattern looked familiar, and we knew we had another problem to dig into. Vector obviously wasn’t looking at netpols but after poking around a bit we found it was listwatching namespaces from every node in order to allow referencing namespace labels per-pod in the kubernetes logs source.

That gave us an idea: what if Vector didn’t need to use namespaces at all? Was that even possible?

As it turns out, yes, they were in use in our configuration, but only to check whether a pod belonged to a user namespace.