TinyKVM 的更新

TinyKVM 的更新
An Update on TinyKVM

原始链接: https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e9

## TinyKVM 更新：超越计算，迈向快速隔离 TinyKVM 在二月份开源后，已扩展其最初的纯计算重点，开始支持运行未修改的可执行文件，例如 Deno、Python WSGI 和 Lune。这是通过极简的系统调用模拟实现的——目前 50 个调用，有意“凭空捏造”以避免直接访问内核，同时在需要时允许经过清理的访问。一项关键创新是**按请求隔离**，目前是速度最快的。它采用混合方法：在每个请求后快速将受影响的页面重写回其原始状态，对于内存密集型任务则回退到完全 VM 重置。基准测试表明，TinyKVM 实现了比原生 Deno 执行更低的延迟，尽管存在隔离的开销。此外，TinyKVM 引入了一种新颖的**远程过程调用 (RPC)** 机制。通过允许在同一地址空间中的二进制文件之间直接调用函数（通过捕获和寄存器切换），它绕过了传统的依赖调度器的 IPC，实现了大约 2 微秒的速度。这使得在按请求隔离的同时，能够实现有限的持久性。最后，现在支持**VM 快照**，将整个 VM 状态存储在单个文件中（基本 Deno 实例约为 135MiB）。未来的工作旨在通过仅预加载访问过的页面来优化冷启动时间。

## TinyKVM：快速、轻量级沙箱这次Hacker News讨论的中心是TinyKVM，一个旨在提供快速高效沙箱环境的新项目。与使用完整客户操作系统进行传统虚拟化不同，TinyKVM在KVM（内核虚拟机）内运行单个进程，并在验证后直接在主机上处理系统调用。主要优势在于速度——启动时间低于100微秒——以及最小的资源消耗。这使其适用于按请求隔离，特别是对于服务器端JavaScript，为更轻松的调试提供可预测的程序状态。与Firecracker（运行完整的Linux客户）或gVisor等解决方案相比，TinyKVM优先考虑速度和简单性。目前尚未设计用于GUI应用程序，但讨论探讨了将其适应于该目的的可能性，并承认大量系统调用的开销。许多评论员还将它与Qubes和Guix等项目进行比较，指出它对安全性和隔离采取了更轻量级的方法。许多用户最初将该项目与硬件KVM切换器混淆，突出了该首字母缩写词的双重含义。

原文

Hey all. TinyKVM was open-sourced this february and since then I’ve been working on some things that are very much outside of the scope of the original implementation. Originally, it was intended to be for pure computation (and that is very much still possible, and is the default), but makes it hard to use TinyKVM outside of specialized use-cases. So, I’ve relented and implemented limited support for running unmodified executables in TinyKVM. Specifically, run-times like Deno, Python WSGI and similar run-times like Lune.

I would like to make a special shout-out to Laurence Rowe who championed KVM server, which has now become almost a de-facto CLI for TinyKVM servers. It’s very much work in progress, but give it a try if you’re interested in these kinds of things.

In order to achieve this I picked the very untraditional route of implementing system call emulation, but as poorly as possible. And as few system calls as possible. I think today there is 50 real system calls (gVisor has ~200 for comparison), and all of them will to some degree make shit up (for lack of a better term). The goal is to avoid accessing the (shared) Linux kernel when at all possible, but give sanitized access when permitted and appropriate. To give an example of what I mean by this: The only allowed ioctl operations are setting and getting non-blocking mode (FIONBIO), and reading the number of available bytes (FIONREAD). This minimalist system call API is currently able to run quite a few complex run-times unmodified. Programs are surprisingly good at handling failing system calls, or suspicious return values. If you put a jailer on top it should be good enough for production, but I do still recommend to use TinyKVM in pure compute mode. Something like Jailer + TinyKVM + Deno + per request isolation.

Per-Request Isolation

Per-request isolation is apparently not that common. I could not find any other production-level support other than in wasmtime (and previously Lucet). But, due to its lack of in-guest JIT support it will not be able to compete with Deno so I will just focus on the positives: It uses a clever lazy MADV_DONTNEED mechanism which delays the cost. You can go test wasmtime’s per-request isolation right now with the hello-wasi-http example.

In TinyKVM there are two reset modes, which together forms hybrid per-request isolation that is capable of maintaining a low memory footprint. Together, it makes the fastest per-request isolation that exists right now. It’s main mode will directly rewrite all touched pages in a VM fork back to their original contents and then leave pagetables (and TLBs) intact. This mode has turned out to be the fastest, but as it leaves the memory footprint untouched it can only grow memory usage for forked VMs. Forked VMs are tiny to begin with, but for large page rendering work it may be a concern, hence there’s a second mode triggered by a fork using memory above a limit. The second mode resets the entire VM with pagetables and everything, which happens when it uses working memory above a certain limit or an exception occurs during request handling. It’s not particularly expensive on its own, but if every VM fork would do it all the time the IPIs and coherency chatter would be a bottleneck.

Press enter or click to view image in full size

TinyKVM is able to beat native performance due to avoiding GC

So we ran a full Deno page rendering benchmark in TinyKVM and then also the very same (unmodified) benchmark natively. We made GC single-threaded in order to compare equally. It would normally run async in another thread, but you’d still have to pay the cost of doing it. What we found was that TinyKVM generally had lower p90+ latency, while native had better p50.

Now this is incredible. Per-request isolation is very very expensive. We are resetting an entire KVM VM every request back to its original state. And we’re doing it very close to native not doing it at all. We’re doing it with unmodified Deno, a big run-time, and with a full page rendering benchmark, a large piece of compute work that builds real memory pressure.

A new type of remote procedure call

One of the things not on my 2025 bingo card was creating a custom RPC mechanism. And, it’s not that great outside of its specific use-case.

I figured that if you loaded two binaries into the same address space, couldn’t you just call a function in the other just fine? Turns out yes, especially if you trap on the far jump (not a real far jump) and then switch a few important registers like the thread-pointer (FSBASE). So, if you have to ABI- or FFI-compatible programs you can essentially freely call functions in the other. Now, this sounds dangerous for sandboxing and kinda useless if you can just use a super-fast IPC like iceoryx2, but.. it turns out that super-fast IPC requires the other end to be always scheduled and crucially also requires that caller to not be adversarial and trample shared memory while you’re reading it. If both of those things are true, then go ahead and use fast IPC. Being able to directly call a remote party without depending on the scheduler it turns out is really really performant. A simple schbench benchmark will tell you all you need to know about what happens when the scheduler is busy. You can go do it on your own machine. It’s commonly a 2-digit number of milliseconds you can expect for p99. So, this new method is in fact the new king of this specific type of RPC. The only remaining part then is how do you deal with sandbox integrity? Turns out you can just not have the remote part mapped in at all, and then either:

Directly resume the remote VM with your caller VMs address space already mapped in. This means that the remote VM is “higher privileged”, sort of.
Map in the remote VM just-in-time on the execution page fault, execute the remote function call, unmap it on return (or any exception, timeout). This also means the remote VM is “higher privileged”.

So, what is this then? Are these two programs the same tenant? What does this have to do with per-request isolation?

Per-request isolation doesn’t have persistence. The entire request VM gets wiped on every request. It would be great if we could maintain something under certain conditions. So, then either it would have to make an expensive call out to a remote service (ala Binder on Android). Or, we could solve two problems in one: Allow tenants to have a program that is persistent, and give them direct scheduler-free access to it. That is, the persistent program would inform the system which functions are callable, so you can’t just randomly jump to remote memory, but you can jump to any registered address directly, which immediately executes the remote function call. Example:

static void my_backend(const char*, const char*)
{
alignas(64) char buffer[256];
sys_storage_resume(buffer, sizeof(buffer));
const char ctype[] = "text/plain";
backend_response(200, ctype, sizeof(ctype)-1,
buffer, __builtin_strlen(buffer));
}

Here’s a simple C++ request handler. Instead of jumping directly to a remote VM function, we choose to directly resume the remote VM, as it is running a complex run-time. Deno in fact.

while (true) {
// Wait for a UInt8Array buffer from C
const bufptr = drogon.symbols.wait_for_storage_task_paused();
// View it as a Uint8Array of length 256
const arrayBuffer = Deno.UnsafePointerView.getArrayBuffer(bufptr, 256);
const buffer = new Uint8Array(arrayBuffer);const redis_answer = await redisClient.get("test");
// Copy redis_answer to buffer
const response = "Hello from Deno storage inside TinyKVM, redis answer: " + redis_answer;
...
}

What’s omitted is encoding the answer back into the buffer zero-terminated. But maybe you got the gist of it: The buffer is zero-copy and we write directly into it. The only remaining thing to do after writing to the buffer is to go back to waiting.

While everything is zero-copy, you will have to duplicate anything that you want to persist. You can use allocators to allocate for the caller.

This feature currently executes safely on the order of 2 microseconds wherever I’ve benchmarked it.

Concurrent access to the remote is possible with an idea I’ve had in my head for a while: Create N threads in the remote VM and register them for use by callers. However, for now I am currently using serialized access to the remote VM. It’s also possible to fork it into many copies to avoid serializing access but that only helps you in certain cases like connection pooling to a database. That is of course supported already. If you want a single-source-of-truth then you probably also want to serialize access to the remote.

During remote calls the caller VM has to be paused. There’s no way around it, otherwise it can trample memory used by the remote VM and crash it. While zero-copy IPC exists where both can run at the same time, it’s fundamentally a question of trust and integrity. You simply can’t do that with two separate sandboxes talking to each other.

Anyway. I hope that was an introduction to the concept, at least. It’s not your everyday feature. It likely won’t solve your problems. I just think it’s a really cool idea. And I do use it, of course. For limited persistence with per-request isolation.

VM snapshots

The last topic of this post is VM snapshots. A feature that many have asked about for TinyKVM. Wouldn’t it be nice if you could snapshot a VM, transport it somewhere else, and resume it? Well, now you can. The feature is implemented by backing all of physical memory with a single file, and then adding some VM state on top and a user-provided section at the very end. This combines all state into a single file with holes.

For reference, a Deno JS hello world instance is 192MiB RSS after initializing the first time. If you save that state into a snapshot and resume it, the file is 135MiB on disk (2.4GiB logical), and RSS is 50MiB after starting with 32 VM forks.

$ du -h program/deno/deno.mem
135M program/deno/deno.mem

The startup time is 0.7ms with everything in page cache. Clearing the page cache is not a simple matter as you have to clear any caching on the disk as well. This typically means you’ll need a custom device. I don’t have all the answers right now, but I suspect it will be around 20ms to load the program from disk with everything cold.

We’re currently working on recording the actually accessed pages of a request and only preloading those. Combined with a full clear of all relevant caches we hope to see that it loads faster than any other alternatives in this space. Fast cold start is of course a crowded space, but you never know what you will find until you try. Because we will be able to know more or less the exact pages that are going to be used by the next request, we might be able to populate just the right pages and avoid loading pages that aren’t going to be used. Typically Linux will load ranges of pages optimistically based on faults. Avoiding that can save some time. We are also hosting just a single process. It’s just Deno, and nothing else. Of course, requests differ, but they should have many things in common.

I think I will end this post here. This is as far as I’ve gotten. Thanks for reading!

-gonzo