让我们讨论沙箱隔离。

原文

There is a lot of energy right now around sandboxing untrusted code. AI agents generating and executing code, multi-tenant platforms running customer scripts, RL training pipelines evaluating model outputs—basically, you have code you did not write, and you need to run it without letting it compromise the host, other tenants, or itself in unexpected ways.

The word “isolation” gets used loosely. A Docker container is “isolated.” A microVM is “isolated.” A WebAssembly module is “isolated.” But these are fundamentally different things, with different boundaries, different attack surfaces, and different failure modes. I wanted to write down my learnings on what each layer actually provides, because I think the distinctions matter and allow you to make informed decisions for the problems you are looking to solve.

The kernel is the shared surface

When any code runs on Linux, it interacts with the hardware through the kernel via system calls. The Linux kernel exposes roughly 340 syscalls, and the kernel implementation is tens of millions of lines of C code. Every syscall is an entry point into that codebase.

Untrusted Code ─( Syscall )─→ Host Kernel ─( Hardware API )─→ Hardware
                              [ 40M LOC C ]

Every isolation technique is answering the same question of how to reduce or eliminate the untrusted code’s access to that massive attack surface.

A useful mental model here is shared state versus dedicated state. Because standard containers share the host kernel, they also share its internal data structures like the TCP/IP stack, the Virtual File System caches, and the memory allocators. A vulnerability in parsing a malformed TCP packet in the kernel affects every container on that host. Stronger isolation models push this complex state up into the sandbox, exposing only simple, low-level interfaces to the host, like raw block I/O or a handful of syscalls.

The approaches differ in where they draw the boundary. Namespaces use the same kernel but restrict visibility. Seccomp uses the same kernel but restricts the allowed syscall set. Projects like gVisor use a completely separate user-space kernel and make minimal host syscalls. MicroVMs provide a dedicated guest kernel and a hardware-enforced boundary. Finally, WebAssembly provides no kernel access at all, relying instead on explicit capability imports. Each step is a qualitatively different boundary, not just a stronger version of the same thing.

Namespaces as visibility walls

Linux namespaces wrap global system resources so that processes appear to have their own isolated instance. There are eight types, and each isolates a specific resource.

Namespace	What it isolates	What the process sees
PID	Process IDs	Own process tree, starts at PID 1
Mount	Filesystem mount points	Own mount table, can have different root
Network	Network interfaces, routing	Own interfaces, IP addresses, ports
User	UID/GID mapping	Can be root inside, nobody outside
UTS	Hostname	Own hostname
IPC	SysV IPC, POSIX message queues	Own shared memory, semaphores
Cgroup	Cgroup root directory	Own cgroup hierarchy
Time	System clocks (monotonic, boot)	Own system uptime and clock offsets

Namespaces are what Docker containers use. When you run a container, it gets its own PID namespace (cannot see host processes), its own mount namespace (own filesystem view), its own network namespace (own interfaces), and so on.

The critical thing to understand is namespaces are visibility walls, not security boundaries. They prevent a process from seeing things outside its namespace. They do not prevent a process from exploiting the kernel that implements the namespace. The process still makes syscalls to the same host kernel. If there is a bug in the kernel’s handling of any syscall, the namespace boundary does not help.

In January 2024, CVE-2024-21626 showed that a file descriptor leak in runc (the standard container runtime) allowed containers to access the host filesystem. The container’s mount namespace was intact — the escape happened through a leaked fd that runc failed to close before handing control to the container. In 2025, three more runc CVEs (CVE-2025-31133, CVE-2025-52565, CVE-2025-52881) demonstrated mount race conditions that allowed writing to protected host paths from inside containers.

Cgroups: accounting is not security

Cgroups (control groups) limit and account for resource usage: CPU, memory, disk I/O, number of processes. They prevent a container from consuming all available memory or spinning up thousands of processes.

Cgroups are important for stability, but they are not a security boundary. They prevent denial-of-service, not escape. A process constrained by cgroups still makes syscalls to the same kernel with the same attack surface.

Seccomp-BPF as a filter

Seccomp-BPF lets you attach a Berkeley Packet Filter program that decides which syscalls a process is allowed to make. You can deny dangerous syscalls like process tracing, filesystem manipulation, kernel extension loading, and performance monitoring.

Docker applies a default seccomp profile that blocks around 40 to 50 syscalls. This meaningfully reduces the attack surface. But the key limitation is that seccomp is a filter on the same kernel. The syscalls you allow still enter the host kernel’s code paths. If there is a vulnerability in the write implementation, or in the network stack, or in any allowed syscall path, seccomp does not help.

Without Seccomp:
  Untrusted Code ─( ~340 syscalls )─→ Host Kernel

With Seccomp:
  Untrusted Code ─( ~300 syscalls )─→ Host Kernel

The attack surface is smaller. The boundary is the same.

Running a container in privileged mode

This is worth calling out because it comes up surprisingly often. Some isolation approaches require Docker’s privileged flag. For example, building a custom sandbox that uses nested PID namespaces inside a container often leads developers to use privileged mode, because mounting a new /proc filesystem for the nested sandbox requires the CAP_SYS_ADMIN capability (unless you also use user namespaces).

If you enable --privileged just to get CAP_SYS_ADMIN for nested process isolation, you have added one layer (nested process visibility) while removing several others (seccomp, all capability restrictions, device isolation). The net effect is arguably weaker isolation than a standard unprivileged container. This is a real trade-off that shows up in production. The ideal solutions are either to grant only the specific capability needed instead of all of them, or to use a different isolation approach entirely that does not require host-level privileges.

gVisor and user-space kernels

gVisor is where the isolation model changes qualitatively. To understand the difference, it helps to look at the attack surface of a standard container.

Standard Container (Docker)
┌───────────────────────┐
│ Untrusted Code        │
└──────────┬────────────┘
           │ ~340 syscalls
           ▼
   [ Seccomp Filter ]
           │ ~300 allowed syscalls
           ▼
┌───────────────────────┐
│ Host Kernel (Ring 0)  │ ◄── FULL ATTACK SURFACE
└───────────────────────┘

The code runs as a standard Linux process. Seccomp acts as a strict allowlist filter, reducing the set of permitted system calls. However, any allowed syscall still executes directly against the shared host kernel. Once a syscall is permitted, the kernel code processing that request is the exact same code used by the host and every other container. The failure mode here is that a vulnerability in an allowed syscall lets the code compromise the host kernel, bypassing the namespace boundaries.

Instead of filtering syscalls to the host kernel, gVisor interposes a completely separate kernel implementation called the Sentry between the untrusted code and the host. The Sentry does not access the host filesystem directly; instead, a separate process called the Gofer handles file operations on the Sentry’s behalf, communicating over a restricted protocol. This means even the Sentry’s own file access is mediated.

gVisor
┌───────────────────────┐
│ Untrusted Code        │
└──────────┬────────────┘
           │ ~340 syscalls
           ▼
┌───────────────────────┐
│ gVisor Sentry (Ring 3)│ ◄── USER-SPACE KERNEL
└──────┬────────┬───────┘
       │        │ 9P / LISAFS
       │        ▼
       │  ┌───────────┐
       │  │   Gofer   │ ◄── FILE I/O PROXY
       │  └─────┬─────┘
       │        │
       ▼        ▼
┌───────────────────────┐
│ Host Kernel (Ring 0)  │ ◄── REDUCED ATTACK SURFACE
└───────────────────────┘
  (~70 host syscalls from Sentry)

The Sentry intercepts the untrusted code’s syscalls and handles them in user-space. It reimplements around 200 Linux syscalls in Go, which is enough to run most applications. When the Sentry actually needs to interact with the host to read a file, it makes its own highly restricted set of roughly 70 host syscalls. This is not just a smaller filter on the same surface; it is a completely different surface. The failure mode changes significantly. An attacker must first find a bug in gVisor’s Go implementation of a syscall to compromise the Sentry process, and then find a way to escape from the Sentry to the host using only those limited host syscalls.

The Sentry intercepts syscalls using one of several mechanisms, such as seccomp traps or KVM, with the default since 2023 being the seccomp-trap approach known as systrap.

What this means in practice is that if someone discovers a bug in the Linux kernel’s I/O implementation, containers using Docker are directly exposed. A gVisor sandbox is not, because those syscalls are handled by the Sentry, and the Sentry does not expose them to the host kernel.

The trade-off is performance. Every syscall goes through user-space interception, which adds overhead. I/O-heavy workloads feel this the most. For short-lived code execution like scripts and tests, it is usually fine, but for sustained high-throughput I/O, it can matter.

Also, by adopting gVisor, you are betting that it’s easier to audit and maintain a smaller footprint of code (the Sentry and its limited host interactions) than to secure the entire massive Linux kernel surface against untrusted execution. That bet is not free of risk, gVisor itself has had security vulnerabilities in the Sentry but the surface area you need to worry about is drastically smaller and written in a memory-safe language.

Defense in depth on top of gVisor

gVisor gives you the user-space kernel boundary. What it does not give you automatically is multi-job isolation within a single gVisor sandbox. If you are running multiple untrusted executions inside one runsc container, you still need to layer additional controls. Here is one pattern for doing that:

Per-job PID + mount + IPC namespaces via clone3 — so each execution is isolated from other executions inside the same gVisor sandbox
Seccomp-BPF inside the namespace — blocking syscalls like clone3 (preventing nested namespace escape), io_uring (force fallback to epoll), ptrace, kernel module loading
Privilege drop — run as nobody (UID 65534) with PR_SET_NO_NEW_PRIVS
Ephemeral tmpfs for all writable paths — cleanup is a single umount2 syscall, not a recursive directory walk
Read-only root filesystem — the container itself is immutable
Capability-based file APIs — use openat2 or similar to confine file writes to the work directory, preventing path traversal via ../../etc/passwd
Network egress control — compute isolation means nothing if the sandbox can freely phone home. Options range from disabling networking entirely, to running an allowlist proxy (like Squid) that blocks DNS resolution inside the sandbox and forces all traffic through a domain-level allowlist, to dropping CAP_NET_RAW so the sandbox cannot bypass DNS with raw sockets.

gVisor Container (runsc)
 └─ Per-job PID + Mount Namespace
     └─ Seccomp BPF Filter
         └─ Privilege Drop
             └─ Network Egress Control
                 └─ Ephemeral tmpfs
                     └─ Capability-confined File Writes

Each layer catches different attack classes. A namespace escape inside gVisor reaches the Sentry, not the host kernel. A seccomp bypass hits the Sentry’s syscall implementation, which is itself sandboxed. Privilege escalation is blocked by dropping privileges. Persistent state leakage between jobs is prevented by ephemeral tmpfs with atomic unmount cleanup.

A note on forking

A practical detail that matters is the process that creates child sandboxes must itself be fork-safe. If you are running an async runtime, forking from a multithreaded process is inherently unsafe because child processes inherit locked mutexes and can corrupt state. The solution is a fork server pattern where you fork a single-threaded launcher process before starting the async runtime, then have the async runtime communicate with the launcher over a Unix socket. The launcher creates children, entirely avoiding the multithreaded fork problem.

Startup
  fork() → Launcher (Single-threaded, Poll Loop)
               │
               ├─ clone3(NEWPID | NEWNS | NEWIPC)
               │
               └─ Child (Mount, Privdrop, Seccomp, Execve)

Main Server (Async Runtime)
  │
  └─ AF_UNIX SEQPACKET ─→ Launcher

MicroVMs for hardware boundaries

MicroVMs use hardware virtualization backed by the CPU’s extensions to run each workload in its own virtual machine with its own kernel.

MicroVM Architecture
┌───────────────────────┐
│ Untrusted Code        │
└──────────┬────────────┘
           │ Syscalls
           ▼
┌───────────────────────┐
│ Guest Kernel (Ring 0) │ ◄── DEDICATED KERNEL
└──────────┬────────────┘
           │ VirtIO / MMIO
           ▼
┌───────────────────────┐
│ KVM Hypervisor (Host) │ ◄── HARDWARE BOUNDARY
└──────────┬────────────┘
           │ Secure API
           ▼
┌───────────────────────┐
│ VMM (User-Space)      │ ◄── DEVICE EMULATION
└───────────────────────┘

Code runs in a completely separate, hardware-backed environment with its own guest kernel. It is important to separate the concepts here. The hypervisor is the capability built into the Linux kernel that manages the CPU’s hardware virtualization extensions. The Virtual Machine Monitor is a user-space process that configures the VM, allocates memory, and emulates minimal hardware devices. The microVM itself is a VM that has been stripped of legacy PC cruft so it boots in milliseconds and uses minimal memory.

Escaping the guest kernel requires finding a vulnerability in the Virtual Machine Monitor’s device emulation or the CPU’s virtualization features, which are rare and highly prized.

The guest runs in a separate virtual address space enforced by the CPU hardware. A bug in the guest kernel cannot access host memory because the hardware prevents it. The host kernel only sees the user-space process. The attack surface is the hypervisor and the Virtual Machine Monitor, both of which are orders of magnitude smaller than the full kernel surface that containers share.

You generally see two different approaches to Virtual Machine Monitor design depending on the workload. The first is strict minimalism, seen in projects like Firecracker. Built specifically for running thousands of tiny, short-lived functions on a single server, it intentionally leaves out complex features like hot-plugging CPUs or passing through physical GPUs. The goal is simply the smallest possible attack surface and memory footprint.

The second approach offers broader feature support, seen in projects like Cloud Hypervisor or QEMU microvm. Built for heavier and more dynamic workloads, it supports hot-plugging memory and CPUs, which is useful for dynamic build runners that need to scale up during compilation. It also supports GPU passthrough, which is essential for AI workloads, while still maintaining the fast boot times of a microVM.

Trade-off

The trade-off versus gVisor is that microVMs have higher per-instance overhead but stronger, hardware-enforced isolation. For CI systems and sandbox platforms where you create thousands of short-lived environments, the boot time and memory overhead add up. For long-lived, high-security workloads, the hardware boundary is worth it.

Snapshotting is a feature worth noting. You can capture a running VM’s state including CPU registers, memory, and devices, and restore it later. This enables warm pools where you boot a VM once, install dependencies, snapshot it, and restore clones in milliseconds instead of booting fresh each time. This is how some platforms achieve incredibly fast cold starts even with full VM isolation.

WebAssembly with no kernel at all

WebAssembly takes a fundamentally different approach. Instead of running native code and filtering its kernel access, WASM runs code in a memory-safe virtual machine that has no syscall interface at all. All interaction with the host happens through explicitly imported host functions.

WebAssembly (WASM)
┌───────────────────────┐
│ Untrusted Code        │
└──────────┬────────────┘
           │ Function Calls
           ▼
┌───────────────────────┐
│ WASM Runtime (Host)   │ ◄── MEMORY-SAFE VM
└──────────┬────────────┘
           │ Imported Host Functions
           ▼
┌───────────────────────┐
│ Allowed Host APIs     │ ◄── EXPLICIT CAPABILITIES
└───────────────────────┘

Code runs in a strict sandbox where the only allowed operations are calling functions provided by the host. If the host doesn’t provide a file reading function, the WASM module simply cannot read files. The failure mode here requires a vulnerability in the WASM runtime itself, like an out-of-bounds memory read that bypasses the linear memory checks.

There is no syscall surface to attack because the code never makes syscalls. Memory safety is enforced by the runtime. The linear memory is bounds-checked, the call stack is inaccessible, and control flow is type-checked. Modern runtimes add guard pages and memory zeroing between instances.

The performance characteristics are attractive with incredibly fast cold starts and minimal memory overhead. But the practical limitation is language support. You cannot run arbitrary Python scripts in WASM today without compiling the Python interpreter itself to WASM along with all its C extensions. For sandboxing arbitrary code in arbitrary languages, WASM is not yet viable. For sandboxing code you control the toolchain for, it is excellent. I am, however, quite curious if there is a future for WASM in general-purpose sandboxing. Browsers have spent decades solving a similar problem of executing untrusted code safely, and porting those architectural learnings to backend infrastructure feels like a natural evolution.

The spectrum

Putting it all together, the landscape spans from fast and weak isolation to slower and highly secure isolation.

                     Isolation strength →
                     Attack surface     ↓

Namespaces      Seccomp       gVisor         MicroVM         WASM
   │               │             │              │              │
   │  visibility   │  syscall    │  separate    │  hardware    │  no kernel
   │  walls only   │  filter on  │  kernel in   │  boundary    │  access at
   │               │  same       │  user-space  │  via KVM     │  all
   │               │  kernel     │              │              │
   ▼               ▼             ▼              ▼              ▼
  Fast            Fast         Moderate       Slower        Fastest
  Weakest         Weak         Strong         Strongest     Strong*
                                                           (*limited scope)

For running trusted code that you wrote and reviewed, Docker with a seccomp profile is probably fine. The isolation is against accidental interference, not adversarial escape.

For running untrusted code in a multi-tenant environment, like short-lived scripts, AI-generated code, or customer-provided functions, you need a real boundary. gVisor gives you a user-space kernel boundary with good compatibility, while a microVM gives you a hardware boundary with the strongest guarantees. Either is defensible depending on your threat model and performance requirements.

For reinforcement learning training pipelines where AI-generated code is evaluated in sandboxes across potentially untrusted workers, the threat model is both the code and the worker. You need isolation in both directions, which pushes toward microVMs or gVisor with defense-in-depth layering.

What I’ve learned is that the common mistake is treating isolation as binary. It’s easy to assume that if you use Docker, you are isolated. The reality is that standard Docker gives you namespace isolation, which is just visibility walls on a shared kernel. Whether that is sufficient depends entirely on what you are protecting against.

It is also worth remembering that compute isolation is only half the problem. You can put code inside a gVisor sandbox or a Firecracker microVM with a hardware boundary, and none of it matters if the sandbox has unrestricted network egress for your “agentic workload”. An attacker who cannot escape the kernel can still exfiltrate every secret it can read over an outbound HTTP connection. Network policy where it is a stripped network namespace with no external route, a proxy-based domain allowlist, or explicit capability grants for specific destinations is the other half of the isolation story that is easy to overlook. The apply case here can range from disabling full network access to using a proxy for redaction, credential injection or simply just allow listing a specific set of DNS records.

Local sandboxing on developer machines

Everything above is about server-side multi-tenant isolation, where the threat is adversarial code escaping a sandbox to compromise a shared host. There is a related but different problem on developer machines: AI coding agents that execute commands locally on your laptop. The threat model shifts. There is no multi-tenancy. The concern is not kernel exploitation but rather preventing an agent from reading your ~/.ssh keys, exfiltrating secrets over the network, or writing to paths outside the project. Or you know if you are running Clawdbot locally, then everything is fair game.

The approaches here use OS-level permission scoping rather than kernel boundary isolation.

Cursor uses Apple’s Seatbelt (sandbox-exec) on macOS and Landlock plus seccomp on Linux. It generates a dynamic policy at runtime based on the workspace: the agent can read and write the open workspace and /tmp, read the broader filesystem, but cannot write elsewhere or make network requests without explicit approval. This reduced agent interruptions by roughly 40% compared to requiring approval for every command, because the agent runs freely within the fence and only asks when it needs to step outside.

OpenAI’s Codex CLI takes a similar approach with explicit modes: read-only, workspace-write (the default), and danger-full-access. Network access is disabled by default. Claude Code and Gemini CLI both support sandboxing but ship with it off by default.

The common pattern across all of these seems to be filesystem and network ACLs enforced by the OS, not a separate kernel or hardware boundary. A determined attacker who already has code execution on your machine could potentially bypass Seatbelt or Landlock restrictions through privilege escalation. But that is not the threat model. The threat is an AI agent that is mostly helpful but occasionally careless or confused, and you want guardrails that catch the common failure modes - reading credentials it should not see, making network calls it should not make, writing to paths outside the project.

Apple’s new Containerization framework (announced at WWDC 2025) is interesting here. Unlike Docker on Mac, which runs all containers inside a single shared Linux VM, Apple gives each container its own lightweight VM via the Virtualization framework on Apple Silicon. Each container gets its own kernel, its own ext4 filesystem, and its own IP address. It is essentially the microVM model applied to local development, with OCI image compatibility. It is still early, but it collapses the gap between “local development containers” and “properly isolated sandboxes” in a way that Docker Desktop never did.

Parting notes

The landscape is moving in a clear direction. There is a lot of exciting new tech out there, with people constantly pushing the limits of cold starts toward faster, securely isolated workloads using Python decorators and other novel approaches to make microvms feel like containers. I am excited to see what comes next in this space. It is definitely an area to watch.