使用OpenTelemetry和Prometheus监控我的Minecraft服务器

使用OpenTelemetry和Prometheus监控我的Minecraft服务器
Monitoring my Minecraft server with OpenTelemetry and Prometheus

原始链接: https://www.dash0.com/blog/monitoring-minecraft-with-opentelemetry

这位用户为孩子们搭建了一个原版Minecraft服务器，优先考虑可靠性和监控。他们选择在云端的Linux虚拟机上将其作为Systemd单元运行。监控设置使用了三个组件：一个用于JVM指标的OpenTelemetry Java Agent，一个用于游戏特定指标（玩家数量、挖掘的方块、吃掉的蛋糕）的Minecraft Prometheus Exporter，以及一个OpenTelemetry Collector，用于聚合和规范化来自两者的遥测数据以及Systemd日志，并将所有数据发送到Dash0。 Java Agent提供JVM的洞察信息，而用Go语言编写的Minecraft Exporter则用于“拉取”特定的游戏数据。OpenTelemetry Collector抓取这个Exporter的数据。日志使用Journald从Systemd收集。Dash0用于根据JVM指标和日志分析创建服务器停机、重启和启动失败的警报，确保在孩子们抱怨之前发现问题。

这个Hacker News帖子讨论了一篇关于使用OpenTelemetry和Prometheus监控Minecraft服务器的文章。讨论涵盖了Minecraft服务器管理的各个方面，包括性能优化、服务器类型（原版与PaperMC）以及运行可靠服务器的复杂性。用户们就小型服务器是否需要详细的遥测数据展开了辩论，一些人认为简单的容器化和重启策略就足够了。另一些人则维护了监控对于学习和预先发现问题的价值。该帖子还涉及到弹性系统和实时警报之间的权衡，以及遥测数据可能过量或分散游戏体验的可能性。原文章的作者，一位Dash0的员工，参与了讨论，并强调了监控设置的可重用性以及使用OpenTelemetry的好处。

询问 HN：您在服务器上监控什么？ 2024-08-19

（评论） 2024-08-19

使用 Bash 从头开始编写 Minecraft 服务器（2022） 2024-03-04

使用智能插头、Prometheus 和 Grafana 监控能源使用情况 2024-05-07

（评论） 2025-03-27

原文

The Minecraft Server repository on GitHub provides step-by-step setup instructions you can follow along.

I want a Minecraft server for multiplayer, where I can do mischief with the kids. And hopefully do not embarrass myself too much by being repeatedly killed by angry chickens or whatever.

Kids these days think Java is not a cool programming language. Little do they know that it powers one of the games they love most: Minecraft.

(Well, technically the “original” Minecraft server is written in Java. Microsoft made things confusing by adding the Bedrock server, which reportedly uses a combination of C, C# and Java, and is different in terms of gameplay from vanilla in subtle ways. Opinions on Reddit differ on why Bedrock needed to exist in the first place.)

There are so many ways to host a Minecraft server when one considers the multiple launchers (ATLauncher, CurseForge, Bukkit, Fabric, Fork and the fifty more you are likely already typing in the comments). But I am a man of simple tastes, and running the “vanilla” Minecraft server as a Systemd unit on a Linux VM in the cloud is exactly my cup of tea.

And if I have learned anything about providing IT infrastructure to my family, it is that their expectations on SLOs and system reliability are up there with NASA’s Moon exploration programs. So, the Minecraft server should work reliably and, if it goes down, I should know well before they do.

Hence: I need monitoring.

Lots and lots of monitoring.

The monitoring setup for my Minecraft server is shown in this diagrams:

There are three components that collaborate in collecting the telemetry to send to Dash0:

The OpenTelemetry Java Agent runs inside the Java Virtual Machine powering the Minecraft server itself, and it reports runtime telemetry about the JVM to the OpenTelemetry Collector.
The Minecraft Exporter for Prometheus collects Prometheus metrics that are specific to Minecraft like player count, how many blocks have been mined, and most importantly: how many cake slices have been eaten.
The OpenTelemetry Collector, well, collects more telemetry, like Systemd logs for the Minecraft server and the other components, receives telemetry from the OpenTelemetry Java Agent, scrapes the Minercraft exporter, adds some normalization to the telemetry (most importantly: resource metadata to neatly categorize which telemetry comes from what), and sends it all off to Dash0.

The OpenTelemetry Java Agent has an extensive set of automatic instrumentations for distributed tracing for many application protocols like HTTP, databases, messaging queues, etc. But none of them are quite applicable to a Minecraft server: Minecraft clients talk over a TCP socket with a protocol specific to Minecraft, and there is no distributed tracing instrumentation for it in the OpenTelemetry Java Agent for it.

And to be honest, I am not missing it: I don’t see value in creating a span, say, every time I get manhandled by an Enderman. First of all: I don’t need that recorded for posterity. And secondly: that is way too much entropy for the good of the universe.

The Java dashboard in Dash0, showing me metrics about my Minecraft server.

But what we get out of the box with the OpenTelemetry Java Agent, is runtime metrics about the Java Virtual Machine itself, and especially, CPU and memory. In Dash0, it’s one click to import the Java integration and get out-of-the-box visibility in the key JVM metrics.

Collecting JVM metrics will tell us a lot about whether the server is running, or why in some situations it may feel slow, e.g., CPU usage too high, for example because of garbage collection. But what about the fun stuff, like how many players are connected, how many blocks are mined?

Besides, there’s things I will make sure do not get recorded, like how many times I died (the minecraft_deaths_total counter). Luckily, Dash0’s Spam filters will allow me to bury my shame with just a couple clicks.

Many, many Prometheus exporters

First of all, there is a lot of software out there that can expose Prometheus metrics about a Minecraft server. A search on GitHub yielded:

The Minecraft Prometheus Exporter, which uses the extensibility of Bukkit to add a Prometheus endpoint via an additional JAR file. I wanted to use a vanilla Minecraft server, so anything relying on Bukkit is not an option for me.
The minecraft-prometheus-exporter (names in the Prometheus ecosystem tend to be pretty to the point, which leads to name clashes) which uses Fabric, another way to run Minecraft servers with mods. Like Bukkit, Fabric was not an option for me.
The minecraft-exporter, written in Python. I truly had no wish to wrangle Python and its package ecosystem, so I gave this a pass.
And finally, the one I went with: the Minecraft Prometheus Exporter by Engin Diri, which ticked for me all the boxes: written in Go, easy to download from the GitHub releases pages, and has a lot of cool telemetry.

Now, finally armed with a Prometheus exporter that tickles my fancy, let’s have a look at how Prometheus exports generally work. Because, if you come from the world of OpenTelemetry, it may not be what you expect.

Push vs Pull

Collecting metrics about the Minecraft-specific aspects of a Minecraft server is a bit more convoluted than collecting telemetry about the Java Virtual Machine it runs on. While both the OpenTelemetry Java Agent and the Minecraft Exporter collect metrics, there is a fundamental difference in how they ship the metrics to a backend. Specifically: the OpenTelemetry Java Agent pushes metrics towards a destination (in our case, the OpenTelemetry Collector), while the Minecraft Exporter needs to have its metrics pulled.

The pull model is a distinctive aspect of the Prometheus ecosystem, where you need something to scrape (i.e., pull at regular intervals metrics out of) your Prometheus endpoint. In our case, scraping too is something that the OpenTelemetry Collector can do with its prometheusreceiver.

Note, the name “receiver” can be misleading here: usually the OpenTelemetry Collector receives telemetry, as in, it is sent telemetry from something else. So components that “add telemetry” into the Collector are generally called receivers, even when they “actively get” telemetry from somewhere else, like in the case of scraping Prometheus endpoints or, as we will see later, collecting Journald logs.

The last piece of telemetry we need for our setup is logs. And those are not just the logs of the Minecraft server itself, which includes information about crashes, slowdowns, player activity and so on. But also the logs about the other components of the monitoring setup, namely the OpenTelemetry Collector and the Minecraft Exporter. Since the OpenTelemetry Java Agent runs inside the Minecraft server, that is, inside the same Java Virtual Machine, the logs collected from the Minecraft server include the OpenTelemetry Java Agent ones.

In my setup, I run each of the components as a Systemd unit. The logs generated by Systemd units are collected by Journald, and the OpenTelemetry Collector can get the logs from it.

For now, I am going to keep it simple: I want to be notified if the server is down. Ideally, before my son sends me some sternly-worded messages.

Is the server running?

In Dash0, I can check for that with the following PromQL expression:

123456
  otel_metric_name="jvm.cpu.time",
  process_command_args=~".*server\\.jar.*",
  service_name="minecraft-server",
  service_namespace="minecraft"

The alert fires if there is no CPU usage reported by the JVM running the Minecraft server, which is a pretty good proxy for “the server is not running”.

Notice that this also nicely doubles as a dead-man switch for the entire setup: for example, if the Minecraft server is running, but the OpenTelemetry Collector is not, I’d still get paged.

Is the server restarting?

Knowing whether no server is running is a good start, but not nearly enough. Specifically, I should check for restarts of the JVM, which Systemd is going to do if the server crashes. This can be accomplished with a rule on logs, which in Dash0 I can query with PromQL using the “magic” dash0.logs metric:

1234567
sum by (otel_log_severity_range) (
    otel_metric_name = "dash0.logs",
    service_name = "minecraft-server",
    otel_log_body =~ "^Starting minecraft server.*"}[10m]

This rule will trigger when the server restarts, and the alert will resolve itself the next evaluation without restarts. In Dash0 I can set this to be a warning and route those alerts to Slack, so that I get a ping when there is some downtime. The $__threshold symbol is something optional in Dash0 that allows you to specify the severity of an alert. This is an extension of the Prometheus alerting model, where an alert has severity modeled only in labels, and that means that a rule can have only one severity, and one ends up having multiple copies of a rule, differing only in terms of hard-coded threshold and label.

Is the server crashing?

But there is an even more interesting alert I can raise based on logs, and that is when Systemd fails to start the Minecraft server altogether, which was happening a lot as I was working out the setup:

1234567
sum by (otel_log_severity_range) (
    otel_metric_name = "dash0.logs",
    service_name = "minecraft-server",
    otel_log_body =~ "^Failed to start the minecraft server.*"}[1m]

Besides, I do not need to know PromQL to create this rule: Dash0 has a query builder for counting logs matching specific filters:

Yeah, working out the server configuration was bumpy.

Why use logs instead of metrics?

In the Prometheus ecosystem, the traditional way to know if a server is up, is to check the up metric associated with scraping the server itself. I felt like I would get much more bang for my buck checking logs instead, and in Dash0 that is also accomplished via PromQL.

What I do miss, however, is metrics about the status of Systemd units. While there is a systemdreceiver in the OpenTelemetry Collector, when I looked into its code hoping to see a metric reporting data on Systemd units, and specifically the status (which would have been perfect for my alert rule), I was surprised to find out that the receiver seems to do precisely nothing.

Setting up a Minecraft server and monitoring it with OpenTelemetry, a Prometheus exporter and Dash0 was a really fun project. Having an excuse to dust off my Java and Linux sysadmin muscles was very welcome.

I could have spent more time on the setup, and come up with dashboards. But honestly, that is not the important bit to me: if the server is up, I’ll be busy playing. So, instead of a dashboard, I did this:

This is what time well spent looks like.