生产环境磁盘空间不足

生产环境磁盘空间不足
Running out of disk space in production

原始链接: https://alt-romes.github.io/posts/2026-04-01-running-out-of-disk-space-on-launch.html

## Kanjideck 服务器启动与故障排除一个服务器在小型 Hetzner 机器上启动（4GB 内存，40GB 磁盘，NixOS），用于分发 Kanjideck 文件，包括一个大型的 2.2GB 文件。宣布可用后，服务器立即被流量淹没，并迅速耗尽磁盘空间。最初的问题源于 Plausible Analytics 的数据库（8.5GB）和 Nix 存储（15GB）占用了大部分空间。紧急措施包括清除日志并尝试清除 Nix 存储，但空间太有限。最终，Nix 存储被移动到单独的卷，解决了眼前的危机。然而，用户报告大型文件下载不完整。这被追溯到 Nginx 的缓冲配置；增加 `proxy_max_temp_file_size` 允许成功交付。随后的磁盘空间激增显示 Nginx 正在保留 14.5GB 已删除的临时文件。禁用 Nginx 缓冲 (`proxy_buffering off`) 并设置 `proxy_max_temp_file_size 0` 最终稳定了系统。服务器在初始启动期间经历了大约 2 小时的停机和部分功能，这凸显了仔细配置和在压力下冷静故障排除的重要性。

## 磁盘空间不足：Hacker News 总结最近 Hacker News 上出现了一场关于生产环境磁盘空间危机的讨论，以及预防措施。核心问题是：意外的磁盘耗尽会使系统瘫痪，即使初始空间看起来很充足。有几种解决方案被提出，强调了主动规划的重要性。一种常见策略是创建“压载文件”——虚拟文件（最好填充随机字节以避免稀疏文件优化），可以快速删除以释放用于系统操作（如锁文件）的关键空间。除此之外，用户强调要稳健地监控 *inodes* 和磁盘块的使用情况，而不仅仅依赖于基于百分比的磁盘空间警报。还提到了特定于文件系统的功能，如 ZFS 预留和 Btrfs 压缩。对于提供静态内容，建议使用 CDN 或具有 `X-Accel-Redirect` 等功能的反向代理，而不是直接将大文件写入磁盘。这次讨论强调了经验的价值，以及深度防御的必要性——将监控与随时可用的空间储备相结合，以避免代价高昂的停机时间。

原文

Last night I put up a simple server which allowed customers to download the digital Kanjideck files. This server is hosted on a small Hetzner machine running NixOS, at 4GB of RAM and 40GB of disk space. One of these downloadable files weights 2.2GB.

The matter at hand boils down to a simple Haskell program which serves static files (with some extra steps regarding authorization) plus an nginx reverse proxy which proxies requests to a certain “virtual host” to the Haskell program.

Not even minutes after I announced that the files were finally available, hundreds of customers visited my server all at once. As the logs started flying off of my screen with all the accesses, I started noticing a particularly interesting message, repeated over and over again:

Mar 31 20:43:03 mogbit kanjideck-fulfillment[2528300]: user error
(Unexpected reply to: MAIL "<...> at kanjideck.com",
Expected reply code: 250, Got this instead: 452 "4.3.1 Insufficient system storage\r\n")

Oh no. No one’s able to access their files and I’m already receiving emails about it. I messaged the users explaining the server was having some issues that I was resolving.

Grafana shows 40GB/40GB disk space used up, so does df -h have 100% usage of /dev/sda. I have to clear up space fast. I’m afraid at this point that I’m not even receiving the user complaints anymore since my mail could be getting dropped by lack of space.

I rushed to run du -sh on everything I could, as that’s as good as I could manage. The two larger culprits were /var/lib’s Plausible Analytics, with a 8.5GB (clickhouse) database, and the /nix/store with the full server configuration, installation, and executables, at 15GB.

(In hindsight, I should have stopped right here to think carefully about what could possibly be occupying the remaining 20GB. I assumed “the rest of the files”, but looking back the “rest of the files” could hardly total 20GB.)

Delete everything that I can.

First off, the /nix/store may have unnecessary executables and past configurations. This should be a big win. Drop it all with

$ nix-collect-garbage -d
...
removing old generations of profile /nix/var/nix/profiles/system
error: opening lock file '/nix/var/nix/profiles/system.lock': No space left on device

journalctl --vacuum-time=1s

clickhouse-client -q "TRUNCATE TABLE system.query_log"
  Received exception from server (version 24.3.7):
  Code: 243. DB::Exception: Received from localhost:9000. DB::Exception: Cannot reserve 1.00 MiB, not enough space. (NOT_ENOUGH_SPACE)
  (query: TRUNCATE TABLE system.query_log)

Hetzner didn’t have an available cloud instance with more space for me to upgrade to.

Plan B: I could still buy more space as a separate Volume.

The /nix/store is an immutable store and I had heard of people setting up their nix stores on separate drives before. It was also the largest system component at 12GB now. A perfect candidate.

Luckily (rather, due to NixOS) everything went smoothly with this transition. Following the instructions on “Moving the store” in the NixOS Wiki worked flawlessly. The new Volume was labeled nix with mkfs.ext4 -L nix /dev/sdb and the mounting migration first done manually, but at the end of the day we have a final declarative configuration of the system:

  fileSystems."/nix" = {
     device = "/dev/disk/by-label/nix";
     fsType = "ext4";
     neededForBoot = true;
     options = [ "noatime" ];
   };

After rebooting the server, the /nix/store was living in a separate volume and the root drive finally had enough space to reply to the users.

Grafana was no longer red all over and the logs were no longer streaming error messages. The filesystem was still 50% used up and it seemed to increase up to around 60-65% when various users were downloading the large 2.2GB file. But. Working.

This morning I anxiously opened my inbox. There were a handful of complaints about the large 2.2GB file download stopping halfway through and never successfully downloading.

However, users were able to access the download page and download all the other (arguably more important) files. Not bad! Grafana was still green and filesystem usage at about 50%.

The large file bug was important to fix promptly.

Recall from the server architecture that nginx proxies to the program which serves the files.

proxy_max_temp_file_size, which defaults to
```
Default: proxy_max_temp_file_size 1024m;
```

One bug down. And the disk space issue seemed tamed, but… during the day, it briefly spiked up to 100% again! Without last night’s huge pressure I was able to investigate this more soundly.

The lsof +L1 command finds unlinked open files (see man lsof), i.e. files to which there are no links from the file system but are still referenced by some process and thus can’t be collected. Files which wouldn’t ever show up with ds -h. I was greeted by 14.5 GB of deleted files held by nginx!

[nix-shell:~]# lsof +L1 | grep nginx
  nginx     4659       nginx mem    REG    0,1   10485760     0    1187 /dev/zero
  nginx     4659       nginx mem    REG    0,1       4096     0    1188 /dev/zero
  nginx     4972       nginx mem    REG    0,1   10485760     0    1187 /dev/zero
  nginx     4972       nginx mem    REG    0,1       4096     0    1188 /dev/zero
  nginx     4972       nginx  19u   REG   8,17  137494528     0 2103873 /tmp/nginx_proxy/6/19/0000000196 (deleted)
  nginx     4972       nginx  21u   REG   8,17  596893696     0 2104973 /tmp/nginx_proxy/1/17/0000000171 (deleted)
  nginx     4972       nginx  24u   REG   8,17  298344448     0 2105098 /tmp/nginx_proxy/3/17/0000000173 (deleted)
  nginx     4972       nginx  25u   REG   8,17 1785765888     0 2105000 /tmp/nginx_proxy/2/17/0000000172 (deleted)
  nginx     4972       nginx  29u   REG   8,17  894984192     0 2105100 /tmp/nginx_proxy/4/17/0000000174 (deleted)
  nginx     4972       nginx  31u   REG   8,17 1489068032     0 2101531 /tmp/nginx_proxy/5/17/0000000175 (deleted)
  nginx     4972       nginx  35u   REG   8,17  745529344     0 2103341 /tmp/nginx_proxy/7/17/0000000177 (deleted)
  nginx     4972       nginx  37u   REG   8,17  965054464     0 2105961 /tmp/nginx_proxy/2/21/0000000212 (deleted)
  nginx     4972       nginx  41u   REG   8,17 1340678144     0 2103603 /tmp/nginx_proxy/0/18/0000000180 (deleted)
  nginx     4972       nginx  43u   REG   8,17 1633722368     0 2101619 /tmp/nginx_proxy/3/18/0000000183 (deleted)
  nginx     4972       nginx  44u   REG   8,17 1927659520     0 2103872 /tmp/nginx_proxy/5/19/0000000195 (deleted)
  nginx     4972       nginx  52u   REG   8,17  795958957     0 2103671 /tmp/nginx_proxy/2/19/0000000192 (deleted)
  nginx     4972       nginx  53u   REG   8,17 1591967405     0 2103702 /tmp/nginx_proxy/3/19/0000000193 (deleted)
  nginx     4972       nginx  58u   REG   8,17 2387853312     0 2103313 /tmp/nginx_proxy/1/21/0000000211 (deleted)

[nix-shell:~]# lsof +L1 | awk '/nginx/ {sum += $7} END {print sum/1024/1024/1024 " GiB"}'
  14.5528 GiB

proxy_max_temp_file_size? Let’s read the documentation more carefully this time:

When buffering of responses from the proxied server is enabled, […], a part of the response can be saved to a temporary file.

The zero value disables buffering of responses to temporary files.

Nginx is buffering the 2.2GB file my program is serving to temporary files. Oh dear. Let’s fix that:

    "<...>.kanjideck.com" = base {
      "/" = {
        proxyPass = "http://127.0.0.1:" + toString(ports.kanjideck-fulfillment) + "/";
        extraConfig = ''
          proxy_buffering off;
          proxy_max_temp_file_size 0;
        '';
      };
    };

Grafana immediately cheered up, the server was finally fresh and lean and disk usage jumped to 20% with no more spikes:

In the disk usage graph images you can find the sudden drop to acceptable levels, which now reigns.

The server couldn’t serve access requests from 20:40 to sometime around 23:00, i.e. for about the first 2 hours immediately after launch.
Secondly, users couldn’t download the large file despite the remaining ones being available.
Both of these bugs turned out to be misconfigurations in the nginx reverse proxy.
It’s difficult to reason under pressure. Experience, that I didn’t have here, would have helped.

Note: this was written fully by me, human.

生产环境磁盘空间不足 Running out of disk space in production

生产环境磁盘空间不足
Running out of disk space in production