Zml-smi：用于GPU、TPU和NPU的通用监控工具

Zml-smi：用于GPU、TPU和NPU的通用监控工具
Zml-smi: universal monitoring tool for GPUs, TPUs and NPUs

## zml-smi：通用硬件监控 zml-smi 是一款全面的 GPU、TPU 和 NPU 诊断和监控工具，是 nvidia-smi 和 nvtop 等工具的多功能替代品。它提供 NVIDIA、AMD、Google TPU 和 AWS Trainium 设备在硬件性能和健康状况方面的实时洞察，并计划随着 ZML 的扩展支持更多平台。主要功能包括：通过 `--top` 标志显示设备利用率、温度和内存使用情况；提供主机级别指标，如 CPU 利用率和内存；以及详细说明使用设备的进程及其资源使用信息。 zml-smi 专为可移植性而设计，仅需要设备驱动程序和 GLIBC，并在完全沙盒化的环境中运行。它利用现有库（NVML 用于 NVIDIA，AMD SMI 用于 AMD）和 API（gRPC 用于 TPU，libnrt 用于 Trainium）来收集详细指标——镜像 tpu-info 和 neuron-top 等工具的数据——甚至可以通过下载的 ID 文件动态更新 AMD GPU 的识别。

## Zml-smi：一种用于GPU、TPU和NPU的新监控工具一种名为**zml-smi** (zml.ai) 的新工具旨在为GPU、TPU和NPU提供通用的监控功能。该工具在Hacker News上发布，讨论集中在其特性和潜在缺点上。虽然备受赞扬，评论者指出已有的工具，如**nvtop**和**all-smi**，已经提供了一些重叠的功能，包括TPU和更广泛的硬件支持。zml-smi的关键区别在于其对**沙盒**的关注，开发者认为这一特性无法有效地合并到现有项目中。用户报告了NPU支持的测试结果不一，尤其是在Ryzen AI处理器上。开发者承认了这一点，并计划进行调查。值得注意的是，创建者确认提供**Prometheus格式**输出，并表示愿意添加**CPU使用率**监控。有人担心，如果将供应商特定的拦截方法集成到更大的项目（如nvtop）中，可能会影响可维护性。

原文

zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs and NPUs. It provides real-time insights into the performance and health of your hardware.

It is a mix between nvidia-smi and nvtop.

It transparently supports all the platforms ZML supports. That is NVIDIA, AMD, Google TPU and AWS Trainium devices. It will be extended to support more platforms in the future as ZML continues to expand its hardware support.

Getting started

You can download zml-smi from the official mirror.

$ curl -LO 'https://mirror.zml.ai/zml-smi/zml-smi-v0.2.tar.zst'
$ tar -xf zml-smi-v0.2.tar.zst
$ ./zml-smi/zml-smi

Listing devices

$ zml-smi

Monitoring devices

The --top flag provides real-time monitoring of device performance, including utilization, temperature, and memory usage.

$ zml-smi --top

Completely sandboxed

zml-smi doesn’t require any software on the target machine besides the device driver and the GLIBC (mostly due to the fact that some shared objects from vendors are loaded).

Host

zml-smi displays host-level metrics such as CPU model and utilization, memory usage, and temperature.

Available metrics

Hostname, Kernel, CPU Model, CPU Core Count, Memory Used / Total, Uptime, Load Average (1m / 5m / 15m), Device Count

Processes

zml-smi also provides insights into the processes utilizing the devices, including their resource usage and command lines. This is available for all platforms.

Available metrics

PID, Device Index, Device Utilization, Device Memory, Process Command Line

NVIDIA

Metrics are given through the NVML library, which ships with the driver. As such, it is expected to be on the system.

Available metrics

GPU Utilization, Temperature, Power Draw, Encoder Utilization, Decoder Utilization, VRAM Used, VRAM Total, Memory Bus Width, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SM Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Link Generation, PCIe Link Width, PCIe TX Throughput, PCIe RX Throughput

AMD

Metrics are provided through the AMD SMI library. zml-smi ships with it in its sandbox.

In order to support the latest AMD GPUs, zml-smi at build time downloads the amdgpu.ids file from both Mesa and ROCm (7.2.1 at the time of this article) and merges them together. This allows zml-smi to recognize and report on the latest AMD GPU models, even if they are not yet included in the official ROCm release. This is the case for Ryzen AI Max+ 395 (Strix Halo) for instance.

Sandboxing that file turned somewhat tricky. Because libdrm-amdgpu expects to find it in /opt/amdgpu/share/libdrm/amdgpu.ids, we had to get a bit creative. We didn’t want to install anything outside the binary sandbox. Nor did we want to patch that string inside libdrm.

So we created a shared object named zmlxrocm.so that is added to the DT_NEEDED section of libdrm_amdgpu.so.1. Then, fopen64 is renamed to zmlxrocm_fopen64, which is then provided by zmlxrocm.so. Since we now sit between libdrm and fopen64, we can intercept the call to fopen64, compare the path against /opt/amdgpu/share/libdrm/amdgpu.ids and redirect it to the sandboxed copy of the file.

Available metrics

GPU Utilization, Memory Usage, Temperature, Power Draw, VRAM Used, VRAM Total, Temperature, Fan Speed, Power Draw, Power Limit, Graphics Clock, SoC Clock, Memory Clock, Max Graphics Clock, Max Memory Clock, PCIe Bandwidth, PCIe Link Generation, PCIe Link Width

TPU

Metrics are provided via the local gRPC endpoint exposed by the TPU runtime. Those are the same metrics exposed to the tpu-info tool from Google.

Available metrics

TensorCore Duty Cycle, HBM Used, HBM Total

AWS Trainium

Metrics are provided through a private API found in libnrt.so, which zml-smi embeds in its sandbox. Those are the same metrics provided by the neuron-top utility.

Available metrics

Core Utilization, HBM Used, HBM Total, Tensor Memory, Constant Memory, Model Code, Shared Scratchpad, Nonshared Scratchpad, Runtime Memory, Driver Memory, DMA Rings, Collectives, Notifications