Runc 在 CPU 请求不是 10 的倍数时会中断 Pod。
Runc breaks pods when CPU requests aren't multiples of 10

原始链接: https://github.com/opencontainers/runc/issues/4982

## 间歇性 Pod 创建失败,CPU 限制为 4096m 使用 systemd cgroup 驱动的 Kubernetes 中,CPU 限制为 4096m 的 Pod 间歇性地创建失败。根本原因在于 containerd 在将毫核转换为微秒时,CPU 配额计算结果不确定(4096m 有时变为 409600µs,有时变为 410000µs)。Runc 一致地计算为 410000µs。 这种不匹配违反了 cgroup v1 约束(子配额不能超过父配额),导致内核错误和 Pod 失败。问题似乎与节点相关,因为失败的 Pod 会导致父 cgroup 停留在 409600µs,从而导致该节点上后续的 Pod 创建也失败。 该问题在将 CPU 限制从 8192m 降低到 4096m 后出现,影响了分数核心值。它与过时的 cgroup 无关,而是 containerd 在 Pod 沙箱创建期间进行的不一致的初始计算。该问题影响 containerd 1.7.27、runc 1.3.2 和 Kubernetes 1.30.14-eks。修复需要 containerd 中确定性和一致的毫核到微秒转换逻辑,与 runc 对齐。

最近的 Hacker News 讨论围绕着针对容器运行时 `runc` 提交的一份错误报告,内容涉及 CPU 请求处理。核心问题是:如果 CPU 请求不是 10 的倍数,`runc` 会导致 Pod 崩溃。然而,报告本身引发了争论,因为它完全由 LLM 生成,导致对其冗长和不简洁的风格的批评——具有讽刺意味的是,尽管 LLM 具有总结能力。 许多评论者对人工智能生成的内容在开源项目中的日益增长的趋势表示沮丧,认为浪费了维护者的时间,并且难以解析“人工智能垃圾”。 发现该问题已通过拉取请求修复。 进一步调查显示,该报告可能暴露了敏感的客户信息(一家使用 AWS 的保险公司),该信息与 Accenture 的人工智能基础设施项目相关联,引发了对粗心数据处理的担忧。 这次讨论凸显了人们对开源贡献方向以及 LLM 对项目质量和安全性的影响的担忧。
相关文章

原文

Description

When using the systemd cgroup driver with a CPU limit of 4096m, pod creation fails intermittently because containerd non-deterministically calculates either 409600 or 410000 microseconds for the parent cgroup, while runc consistently calculates 410000 for child cgroups. When they mismatch, the Linux kernel rejects the child cgroup creation with "invalid argument".

Root Cause

Investigation reveals non-deterministic behavior in containerd when converting 4096m to microseconds:

  1. Containerd (when creating pod sandbox) - INCONSISTENT:

    • Sometimes calculates: 4096m → 409600 microseconds (correct: 4096 / 1000 * 100000)
    • Sometimes calculates: 4096m → 410000 microseconds (rounded: 4.1 * 100000)
    • Sets parent cgroup: cpu.cfs_quota_us to whichever value it calculated
  2. runc (when creating application container) - CONSISTENT:

    • Always calculates: 4096m → 410000 microseconds (appears to round 4.096 to 4.1)
    • Tries to set child cgroup: cpu.cfs_quota_us = 410000
  3. Result:

    • When containerd picks 410000: Parent = 410000, child = 410000 → Success!
    • When containerd picks 409600: Parent = 409600, child = 410000 → Kernel rejects! (child > parent)
    • In cgroup v1, child quotas cannot exceed parent quotas

Why It Appears Node-Specific

The issue seems to only affect "previously used nodes" because:

  • When containerd picks 409600 and the pod fails, the parent cgroup gets stuck
  • The pause container remains alive with the 409600 parent cgroup
  • All subsequent attempts to create the pod on that node fail (child 410000 > parent 409600)
  • Fresh nodes might get lucky and containerd picks 410000 → works fine
  • But those nodes would fail too if containerd had picked 409600 on first attempt

This is not about stale cgroups from old pods - it's about which value containerd randomly picks during pod sandbox creation.

Error Message

failed to create containerd task: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error setting cgroup config for procHooks process: failed to write "410000":
write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc65bd648_3faf_4778_90e4_a21afb2a6ad0.slice/cri-containerd-149d004f6e52b5665c6209d1f33a7e516049b79456444e3f74af49e62c5c80c8.scope/cpu.cfs_quota_us:
invalid argument: unknown

Evidence from Investigation

Failing node (containerd picked 409600):

# Parent cgroup quota - containerd calculated 409600
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-podc65bd648...slice/cpu.cfs_quota_us
409600

# Pod sandbox metadata confirms
$ crictl inspectp b4420139f34f8
"cpu_quota": 409600

# Containerd logs show runc trying to write 410000
$ journalctl -u containerd | grep "410000"
failed to write "410000": write .../cpu.cfs_quota_us: invalid argument

Working node (containerd picked 410000):

# Parent cgroup quota - containerd calculated 410000!
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-pod7f7424a2...slice/cpu.cfs_quota_us
410000

# Application container child cgroup also 410000
$ cat /sys/fs/cgroup/cpu,cpuacct/.../cri-containerd-98745b3c2c216...scope/cpu.cfs_quota_us
410000

# They match - no error!

Both nodes running:

  • Same containerd version: 1.7.27
  • Same runc version: 1.3.2
  • Same Kubernetes version: 1.30.14-eks-113cf36
  • Same pod spec with CPU limit: 4096m

Additional Context

  • This issue started occurring after changing CPU limits from 8192m → 4096m
  • The problem is specific to CPU values resulting in fractional cores (4.096)
  • The 400 microsecond difference (410000 - 409600) violates cgroup v1's parent-child quota constraint
  • Critical finding: Same containerd version behaves differently - this is non-deterministic
  • Calculation theory:
    • Correct: 4096 / 1000 * 100000 = 409600
    • Rounded: 4.1 * 100000 = 410000 (rounding 4.096 to 4.1)

Questions for Maintainers

  1. Where in containerd's codebase does the millicore → microsecond conversion happen for pod sandbox creation?
  2. Why would containerd calculate two different values (409600 vs 410000) for the same input (4096m)?
  3. Is there a race condition or different code path that causes this non-determinism?
  4. Should containerd and runc be using shared conversion logic to ensure consistency?

Related Issues

This appears similar to but distinct from:

However, this is a new issue involving non-deterministic behavior in containerd 1.7.27 when calculating CPU quotas for fractional core values with systemd cgroup driver.

Steps to reproduce the issue

  1. Deploy a Kubernetes pod with CPU limit 4096m multiple times on different fresh nodes

    • Observe: Some pods succeed, some fail (non-deterministic)
    • Successful pods: containerd calculated parent cgroup cpu.cfs_quota_us = 410000
    • Failed pods: containerd calculated parent cgroup cpu.cfs_quota_us = 409600
    • runc always tries to write 410000 for child cgroup
  2. On nodes where containerd picked 409600:

    • runc attempts to create application container
    • runc tries to write 410000 to child cgroup's cpu.cfs_quota_us
    • Kernel rejects: child quota (410000) > parent quota (409600)
    • Container creation fails with "invalid argument" error
    • Pod enters CrashLoopBackOff
    • Pause container remains alive with parent cgroup stuck at 409600
  3. All subsequent restart attempts on that node continue to fail

    • Containerd reuses the existing pod sandbox
    • Parent cgroup still has 409600
    • runc still tries 410000
    • Pattern repeats indefinitely
  4. Evicting the pod and forcing it to a different node:

    • May work if containerd picks 410000 on the new node
    • Will fail if containerd picks 409600 on the new node
    • Outcome is non-deterministic

Example pod spec that reproduces the issue:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test-container
    image: nginx:latest
    resources:
      limits:
        cpu: "4096m"
        memory: "16Gi"
      requests:
        cpu: "1024m"
        memory: "8Gi"

How to Verify Which Value Containerd Picked

On a node where the pod was deployed:

# Get pod UID
kubectl get pod <pod-name> -o jsonpath='{.metadata.uid}'

# Check parent cgroup on the node
cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID_with_underscores>.slice/cpu.cfs_quota_us

# 409600 = pod will fail
# 410000 = pod will succeed

Critical Note: The issue is not about "previously used nodes" - it's about which value containerd randomly calculates during initial pod sandbox creation. The appearance of being node-specific is because once a node gets stuck with 409600, it stays stuck.

Describe the results you received and expected

Expected behavior:

Containerd and runc should use consistent, deterministic calculations when converting millicores to microseconds for CPU quotas.

For a CPU limit of 4096m:

  • Both containerd and runc should calculate: 4096 / 1000 * 100000 = 409600 microseconds
  • OR both should calculate: 4.1 * 100000 = 410000 microseconds
  • They must agree - parent and child cgroups must have compatible values
  • Container should create successfully every time, regardless of node
  • Behavior should be deterministic, not random

Actual behavior:

  • Containerd: Non-deterministically calculates either 409600 or 410000 for the same input
    • Sometimes: 409600 microseconds (mathematically correct)
    • Sometimes: 410000 microseconds (rounded)
    • No obvious pattern - same version, same config, different results
  • runc: Consistently calculates 410000 microseconds (always rounds 4.096 to 4.1)
  • When they mismatch (containerd=409600, runc=410000):
    • Child cgroup creation fails with kernel error: "invalid argument"
    • Pod enters CrashLoopBackOff with 199+ restart attempts
    • Parent cgroup gets stuck with 409600, preventing all future attempts
    • Requires manual node cordoning and pod eviction
  • When they match (containerd=410000, runc=410000):

Impact:

  • Non-deterministic pod scheduling - same pod spec may work or fail randomly
  • Cannot reliably deploy pods with CPU limit 4096m (or other fractional core values)
  • Once a node "loses the lottery" and gets 409600, it's permanently broken for that pod
  • Requires operational workarounds (cordon/drain/evict)
  • Production impact on Amazon EKS clusters

Root Issue:

This is fundamentally a consistency bug - containerd and runc must use the same conversion logic, and that logic must be deterministic.

What version of runc are you using?

runc version 1.3.2
commit: aeabe4e711d903ef0ea86a4155da0f9e00eabd29
spec: 1.2.1
go: go1.24.9
libseccomp: 2.5.2

Additional environment details:

  • containerd version: 1.7.27 (commit: 05044ec0a9a75232cad458027ca83437aae3f4da)
  • Kubernetes version: 1.30.14-eks-113cf36 (Amazon EKS)
  • Cgroup version: v1
  • Cgroup driver: systemd (SystemdCgroup = true in containerd config at /etc/containerd/config.toml)

Host OS information

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2026-06-30"

Platform: Amazon EKS (Elastic Kubernetes Service) managed node

Host kernel information

Linux ip-10-7-66-184.prod-eks.newfront.com 5.10.245-241.976.amzn2.x86_64 #1 SMP Tue Oct 21 22:09:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Kernel version: 5.10.245-241.976.amzn2.x86_64

联系我们 contact @ memedata.com