生产环境磁盘空间不足
Running out of disk space in production

原始链接: https://alt-romes.github.io/posts/2026-04-01-running-out-of-disk-space-on-launch.html

## Kanjideck 服务器启动与故障排除 一个服务器在小型 Hetzner 机器上启动(4GB 内存,40GB 磁盘,NixOS),用于分发 Kanjideck 文件,包括一个大型的 2.2GB 文件。 宣布可用后,服务器立即被流量淹没,并迅速耗尽磁盘空间。 最初的问题源于 Plausible Analytics 的数据库(8.5GB)和 Nix 存储(15GB)占用了大部分空间。 紧急措施包括清除日志并尝试清除 Nix 存储,但空间太有限。 最终,Nix 存储被移动到单独的卷,解决了眼前的危机。 然而,用户报告大型文件下载不完整。 这被追溯到 Nginx 的缓冲配置;增加 `proxy_max_temp_file_size` 允许成功交付。 随后的磁盘空间激增显示 Nginx 正在保留 14.5GB 已删除的临时文件。 禁用 Nginx 缓冲 (`proxy_buffering off`) 并设置 `proxy_max_temp_file_size 0` 最终稳定了系统。 服务器在初始启动期间经历了大约 2 小时的停机和部分功能,这凸显了仔细配置和在压力下冷静故障排除的重要性。

## 磁盘空间不足:Hacker News 总结 最近 Hacker News 上出现了一场关于生产环境磁盘空间危机的讨论,以及预防措施。核心问题是:意外的磁盘耗尽会使系统瘫痪,即使初始空间看起来很充足。 有几种解决方案被提出,强调了主动规划的重要性。一种常见策略是创建“压载文件”——虚拟文件(最好填充随机字节以避免稀疏文件优化),可以快速删除以释放用于系统操作(如锁文件)的关键空间。 除此之外,用户强调要稳健地监控 *inodes* 和磁盘块的使用情况,而不仅仅依赖于基于百分比的磁盘空间警报。还提到了特定于文件系统的功能,如 ZFS 预留和 Btrfs 压缩。对于提供静态内容,建议使用 CDN 或具有 `X-Accel-Redirect` 等功能的反向代理,而不是直接将大文件写入磁盘。 这次讨论强调了经验的价值,以及深度防御的必要性——将监控与随时可用的空间储备相结合,以避免代价高昂的停机时间。
相关文章

原文

Last night I put up a simple server which allowed customers to download the digital Kanjideck files. This server is hosted on a small Hetzner machine running NixOS, at 4GB of RAM and 40GB of disk space. One of these downloadable files weights 2.2GB.

The matter at hand boils down to a simple Haskell program which serves static files (with some extra steps regarding authorization) plus an nginx reverse proxy which proxies requests to a certain “virtual host” to the Haskell program.

Fig 1. Simplified server architecture

Not even minutes after I announced that the files were finally available, hundreds of customers visited my server all at once. As the logs started flying off of my screen with all the accesses, I started noticing a particularly interesting message, repeated over and over again:

Hetzner didn’t have an available cloud instance with more space for me to upgrade to.

Plan B: I could still buy more space as a separate Volume.

The /nix/store is an immutable store and I had heard of people setting up their nix stores on separate drives before. It was also the largest system component at 12GB now. A perfect candidate.

Luckily (rather, due to NixOS) everything went smoothly with this transition. Following the instructions on “Moving the store” in the NixOS Wiki worked flawlessly. The new Volume was labeled nix with mkfs.ext4 -L nix /dev/sdb and the mounting migration first done manually, but at the end of the day we have a final declarative configuration of the system:

proxy_max_temp_file_size? Let’s read the documentation more carefully this time:

When buffering of responses from the proxied server is enabled, […], a part of the response can be saved to a temporary file.

The zero value disables buffering of responses to temporary files.

Nginx is buffering the 2.2GB file my program is serving to temporary files. Oh dear. Let’s fix that: