Linux 进程内存友好的漫游

Linux 进程内存友好的漫游
A friendly tour of process memory on Linux

原始链接: https://www.0xkato.xyz/linux-process-memory/

## Linux (x86-64) 虚拟内存理解本摘要详细介绍了 Linux 如何管理进程的内存，即使物理 RAM 碎片化，也能创建连续地址空间的假象。程序与*虚拟*内存交互，由内核使用*页表*转换为*物理* RAM。当程序访问内存时，CPU 检查页表；如果不存在条目（*页面错误*），内核分配物理页面，更新表，或发出错误信号。进程拥有*虚拟内存区域 (VMA)* 的“平面图”——具有特定权限（读、写、执行）的连续范围。内存是*延迟*分配的——页面仅在访问时才创建。像 `/proc//maps` 这样的工具可以揭示这种结构。`mmap` 预留地址范围，`mprotect` 更改权限，`munmap` 释放它们。现代内核采用防御措施来对抗 Meltdown 等漏洞，使用*页表隔离 (PTI)*，在用户和内核视图之间切换。性能通过诸如*透明大页 (THP)* 之类的技术得到增强，它利用更大的内存块。对于调试，`/proc//smaps` 提供详细的内存使用信息，而 `/proc//pagemap` 提供每页的详细信息（通常需要权限）。理解这些机制对于优化性能和诊断内存相关问题至关重要。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交登录 Linux 进程内存友好游览 (0xkato.xyz) 9 分，0xkato 发表于 55 分钟前 | 隐藏 | 过去 | 收藏 | 4 条评论 sleepytimetea 发表于 25 分钟前 | 下一个 [–] 网站被阻止，判定为威胁/不安全域名。回复 drbig 发表于 22 分钟前 | 父评论 | 下一个 [–] 误报。回复 0xkato 发表于 16 分钟前 | 父评论 | 上一个 | 下一个 [–] 哈哈 drbig 发表于 15 分钟前 | 上一个 [–] 指令流水线，这正是我希望我们还能回到“原样”的原因，比如 6502 或任何不伪装/映射/表格/代理/环绕的架构。那样，但拥有大量的快速互连！... 人们总是可以做梦。回复考虑申请 YC 2026 冬季批次！申请截止至 11 月 10 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Saturday. November 01, 2025 - 24 mins

You run a program. It reads and writes addresses as if a giant, continuous slab of memory had been waiting there all along. It didn’t. Linux builds that illusion on the fly, one page at a time. This is a walk through what your process actually owns, what happens on the first touch of a byte, how protections and huge pages fit in, how to see the truth from /proc, and why modern kernels do a little extra dance to defend against Meltdown.

Note: This tour targets Linux on x86‑64, other architectures differ in details (page sizes, cache rules), but the ideas carry over.

Intro

The picture below is a quick introduction. It is a simple map you can keep in mind as you read.

Physical RAM is the real memory. It is a bunch of frames scattered around. The virtual view is the clean line your program sees. It does not match the real layout. The page table is a list. It tells which spot on the virtual line points to which frame in RAM. Disk is extra space the system can use when RAM is full.

Here is how it plays out. When you read or write, the CPU looks in the page table. If there is an entry it goes to that frame. If there is no entry you get a page fault. The system then fills a frame and adds the entry, or it stops you with an error. We will explain faults later.

When RAM is tight the system makes room. It moves pages you have not used in a while to disk, or drops file pages it can load again. If you touch one of those later it brings it back.

Markdown Image

Tiny explainers appear throughout so anyone can follow along, regardless of background.

Tiny explainer: /proc
/proc is a virtual filesystem the kernel builds in memory. It exposes process and kernel state as files. You can read them with normal tools like cat.

The floor plan you never see

Inside the kernel, your process owns one object that represents its whole address space. Think of it as a floor plan. Each room on that plan is a virtual memory area (VMA) a contiguous range with the same permissions (read, write, execute) and the same kind of backing (anonymous memory or a file).

Tiny explainer: VMA
A VMA is a continuous virtual address range with one set of rights and one kind of backing.

Under the plan sit the page tables that the hardware reads to translate your virtual addresses to real page frames.

Tiny explainer: page tables and PTE
Page tables are lookup structures the CPU walks to translate addresses. A page table entry (PTE) maps one virtual page to one physical page and holds bits like present and writable.

All threads in the process share the same plan. When the scheduler runs you, the CPU is pointed at your page tables, so pointer dereferences don’t need a syscall once a mapping exists, the hardware does the translation on its own.

You change the plan in three ways: mmap draws a room, mprotect changes the sign on its door (R/W/X), and munmap tears it down.

Tiny explainer: mmap
mmap reserves a virtual range with given permissions and a backing source.
Tiny explainer: mprotect
mprotect changes the permissions on an existing range.
Tiny explainer: munmap
munmap removes a mapping from the address space.

Everything else (creating pages, reading file data, swapping) happens lazily when you touch memory.

Tiny explainer: page
Hardware manages memory in fixed chunks called pages. On many x86‑64 machines a base page is 4 KiB. Bigger pages exist at 2 MiB and 1 GiB.

A quick glance at your own house

Run:
cat /proc/self/maps | sed -n '1,80p'

You’ll see your main binary’s segments (code, data, bss), the heap, anonymous mappings (allocators use these for big chunks), shared libraries, and thread stacks near the top.

You’ll typically also see two small regions:

[vdso]: a tiny shared object the kernel maps in so a few calls like gettimeofday can run without a kernel trap.
[vvar]: read‑only data those helpers use.

Tiny explainer: vdso and vvar
vdso is code the kernel maps into your process to make some syscalls fast. vvar holds data that code reads.

They’re why asking the time is fast.

`mmap`, without the fog

When you call mmap, you’re not “allocating memory” so much as drawing a promise on the floor plan. You say give me a range of addresses with these rights and back it by this file plus offset or by anonymous memory. Linux picks an address, makes sure it doesn’t collide, adjusts VMAs so each remains uniform, and records the promise.

Tiny explainer: ASLR
Address Space Layout Randomization places mappings at randomized locations to make exploits harder.
Tiny explainer: anonymous vs file mapping
Anonymous memory is not tied to a file and starts as zeros. File mappings mirror file contents.

It does not allocate pages yet. That comes later at first touch.

Two gotchas come up over and over:

File mappings: offset must be page aligned or mmap returns EINVAL.
Mapping past end of file is allowed, but touching beyond the true end raises SIGBUS. The VMA exists, the data does not.

Tiny explainer: MAP_PRIVATE and MAP_SHARED
MAP_SHARED means writes go back to the file and are visible to others that share it. MAP_PRIVATE means you see the file but writes go to private copy‑on‑write pages.

Anonymous mappings start life as zeroes. File mappings mirror the file. If the file ends mid page the tail of that last page reads as zeros but still belongs to the file.

MAP_FIXED means exactly here and it overwrites anything already mapped at that address. Prefer MAP_FIXED_NOREPLACE to fail instead of clobbering. Without either flag your addr is just a hint.

Tiny explainer: MAP_FIXED_NOREPLACE
Ask for an exact address and fail if something is already there. Safer than overwriting.

The first touch

Imagine *p = 42; to a fresh mapping. The CPU tries to translate the address. It finds no entry so it raises a page fault that includes the address and an error code.

Tiny explainer: page fault
A page fault is the CPU asking the kernel to handle a missing or illegal translation for an address.

The kernel’s handler runs on your behalf and asks three questions in this order:

Is the address inside any VMA
If not you are poking a hole in the plan → SIGSEGV.
Do the rights allow this access
Write to a read only page or execute from non exec → SIGSEGV.
If it is valid but missing make it real
For an anonymous mapping the kernel allocates a zero filled physical page, wires a page table entry with your requested permissions, and returns to your instruction. For a file mapping it first checks the page cache. If the data is not in RAM it reads from storage, then installs the translation and retries your instruction. Your store lands. You keep going.

Tiny explainer: page cache
The page cache is the kernel’s cache of file data in RAM. File mappings read and write through it.
Tiny explainer: zero page
Some reads from fresh anonymous memory can be satisfied by a shared read only page of zeros. A private page is created on the first write.

People count these faults:

A minor fault means the data was already in RAM and only the translation was missing.
A major fault means the kernel had to wait for I/O which is expensive.

Tiny explainer: stack guard
User stacks have a guard page. Touching just below the current stack can grow it. Touching far below looks like a bug and gets a SIGSEGV.

This same lazy first touch explains how memory is shared after fork() and how MAP_PRIVATE works. The next section shows that path.

Copy on write with `fork()` and `MAP_PRIVATE`

Why this is here. We just talked about first touch. The same rule explains why pages do not copy on fork and why MAP_PRIVATE does not change the file.

fork does not duplicate pages. The child points at the same physical pages as the parent. The kernel flips those pages to read only for both. The first write hits a copy on write fault. The kernel allocates a new page, copies the bytes, updates the writer’s page table entry to the new page with write permission, and returns. Reads still share the original page. That is why RSS stays flat after fork until you write.

Tiny explainer: RSS
Resident Set Size is how many pages of this process are currently in RAM. Tiny explainer: copy on write
Share the same page for reads. Make a private copy only when a write happens.

MAP_PRIVATE uses the same idea. You read file data through the page cache. When you write, the kernel gives you a private page. The file stays unchanged.

Things you will also run into:

fork then execve. The child replaces its whole address space soon after. That avoids most CoW work.
vfork. The child runs in the parent’s address space until it calls exec or _exit. The parent waits. Do not touch memory in the child.
clone with CLONE_VM. This makes a thread. One address space. No copy.
MAP_SHARED. Writes go to the shared page and to the file or shmem. No CoW.
MADV_DONTFORK. Leave this mapping out of the child.
MADV_WIPEONFORK. The child sees zeros for this mapping.
Transparent huge pages. Breaking CoW on a huge page may split it first. Small extra cost.

Changing rights, and the little pause you feel

Why you care. JITs and loaders flip a region from writable to executable after codegen which is W^X. That flip is not free.

Tiny explainer: W^X
Write xor Execute is a policy. A page is never writable and executable at the same time.

mprotect(addr, len, prot) changes permissions. Internally the kernel may split VMAs so each remains uniform, edits the page table entries for the range, and then does one more necessary thing. It invalidates old translations from the CPU’s small cache of address translations which is the TLB. That invalidation is the small pause you sometimes feel when a JIT flips RW to RX or back.

Tiny explainer: TLB
The Translation Lookaside Buffer caches recent translations so the CPU does not walk page tables every time.

Most systems enforce W^X. A page should not be writable and executable at the same time. JITs keep to that by flipping after codegen or by keeping two virtual mappings of the same memory so no single mapping is both.

Remember there are two layers of permission checks:

Filesystem or mount policy like noexec
Page permissions like PROT_EXEC

Either layer can block execution.

Seeing what’s really mapped

For everyday questions the friendly view is enough.

/proc/<pid>/maps shapes: addresses, rights, file names
/proc/<pid>/smaps and smaps_rollup add per region accounting like how much is resident which is RSS, private vs shared, and whether huge pages were used like AnonHugePages and FilePmdMapped

When you need truth at the per page level Linux exposes sharper tools.

/proc/<pid>/pagemap has one 64 bit entry per virtual page. It tells you whether a page is present, swapped, soft dirty, exclusively mapped with caveats for huge pages, whether it is write protected via userfaultfd, or part of a guard region. It can also reveal the page frame number which is PFN but modern kernels hide PFNs from unprivileged users. You need the right capability or root.

Tiny explainer: PFN
Page Frame Number is the physical page index used inside the kernel. Tiny explainer: userfaultfd
A file descriptor that lets a userspace thread handle faults and write protect events for a range.

/proc/kpagecount is indexed by PFN and tells you how many mappings point at a given physical page.

/proc/kpageflags is also indexed by PFN and tells you what kind of page it is and what is happening to it like anonymous or file backed, part of a transparent huge page, in the LRU, dirty, under writeback, a page table page, or the shared zero page.

Common wrinkles

Sparse files. To tell hole vs data, combine mincore() which says resident or not with lseek(..., SEEK_DATA/SEEK_HOLE) on the backing file.
Shared memory and swap. Shared and shmem pages may be non present at the PTE level while still logically allocated. Expect swap entries and non present PTEs.
Privileges. Modern kernels restrict PFN and some flag visibility to privileged users for security.

Tiny explainer: mincore
mincore tells you which pages of a mapping are in RAM.
Tiny explainer: SEEK_DATA and SEEK_HOLE
File offsets that let you skip to the next data chunk or the next hole in a sparse file.

Tiny explainer: soft dirty vs written
Soft dirty marks pages dirtied by userland but it can get lost across swaps or VMA merges. Newer kernels offer an ioctl named PAGEMAP_SCAN that scans a range for pages written since last write protect and can in the same step write protect them again. It pairs with userfaultfd write protect to give fast and race free userspace dirty tracking for snapshotting and live migration.

When your page suddenly gets bigger

Your CPU would rather cover more ground with fewer entries in its TLB. Linux can help by backing hot memory with bigger pages.

Tiny explainer: THP
Transparent Huge Pages automatically try to use larger pages for performance when safe.

Transparent Huge Pages do this automatically for anonymous memory and shmem or tmpfs. A fault can be satisfied with a 2 MiB page instead of 512 small ones. A background thread named khugepaged can also collapse adjacent base pages into a huge page when it is safe.

Tiny explainer: khugepaged
A kernel thread that scans and merges adjacent small pages into huge pages when conditions are right.

Modern kernels add multi size THP which is mTHP on some architectures. Groups of base pages like 16 KiB or 64 KiB reduce fault count and TLB pressure without always jumping to 2 MiB. They are still PTE mapped but behave as larger folios inside the VM.

Tiny explainer: mTHP
Multi size THP allows variable order large folios so you get some of the benefit without a full 2 MiB page.

You can ask for THP in a region with madvise(..., MADV_HUGEPAGE) or opt out with MADV_NOHUGEPAGE. System wide behavior lives under /sys/kernel/mm/transparent_hugepage/ with per size controls. enabled can be always, madvise, never, or inherit. Shmem or tmpfs have their own knobs like a huge= mount option with always, advise, within_size, never.

How to tell if it worked. In /proc/self/smaps the lines for a region include AnonHugePages for anonymous THP and FilePmdMapped for file or shmem huge mappings. System wide /proc/meminfo has AnonHugePages, ShmemPmdMapped, and ShmemHugePages. /proc/vmstat keeps a diary of THP events allocated on fault, fell back, split, swapped as a whole, and so on.

Controls plain map:

Top level: /sys/kernel/mm/transparent_hugepage/enabled which is always or madvise or never
Defrag effort: /sys/kernel/mm/transparent_hugepage/defrag tunes how hard the kernel tries on the fault path vs deferring to khugepaged
Shmem or tmpfs: huge=always|within_size|advise|never plus shmem specific knobs

Modern kernels may also create variable order large folios that are bigger than 4 KiB but PTE mapped not full 2 MiB PMD. This reduces fault count and TLB pressure without always jumping to 2 MiB. Behavior differs by kernel and architecture.

One trade off. Assembling a huge page may require compaction which moves other pages to free a contiguous chunk and this can add a small pause. If first touch latency matters more than steady state speed the defrag knob lets you temper how hard the kernel tries which pushes work to khugepaged instead of doing it inline.

Tiny explainer: THP vs hugetlbfs
THP is automatic and pageable. Explicit huge pages from MAP_HUGETLB or hugetlbfs are quota managed and non swap.

Dirty‑tracking in userspace, without racing the kernel

Imagine you want to copy only the pages an application modified since your last snapshot.

Give yourself the ability to catch write protect faults with userfaultfd in write protect mode.
Use PAGEMAP_SCAN over your range with the category written since last write protect. Ask the kernel to write protect matching pages and to return compact ranges of what it found.
Copy those ranges. When the app later writes to one of them userfaultfd wakes your thread. Log the dirtied page, clear write protect, and let it proceed.

This avoids walking every PTE and avoids the classic race where a page is dirtied while you were looking. It is also fast because scan plus write protect happens as one atomic operation inside the kernel.

Tiny explainer: PAGEMAP_SCAN
An ioctl that scans a virtual range for pages with properties like written since last protect and can also apply write protect in the same step.

The TLB, and why `mprotect` costs a little

The Translation Lookaside Buffer remembers recent translations so the CPU does not walk page tables on every access. If Linux changes a mapping or its permissions it must make sure stale entries are not used.

On x86 there are two broad ways to do it.

Precise invalidation. Invalidate one page at a time with INVLPG. Good for small changes. A single invalidation on a huge page mapping drops the whole 2 MiB entry.
Broader flushes. Drop many or all entries for example by reloading the page table root register. Fewer instructions now and more misses later while refilling.

Which is better depends on how big a change you made, whether you are changing small or huge pages, and the microarchitecture.

Tiny explainer: PCID
Process Context Identifiers tag TLB entries so switching page tables does not flush everything.
Tiny explainer: INVPCID
Allows targeted invalidation of TLB entries for a given tag without switching to it.

There is also a debug knob on some x86 builds named tlb_single_page_flush_ceiling that nudges when the kernel switches from per page invalidations to a broad flush.

Tiny explainer: INVLPG
A privileged instruction that invalidates TLB entries for the page containing a given address in the current address space tag.

Meltdown, and why the kernel sometimes switches maps on entry

Early 2018 brought Meltdown. Speculative execution plus a cache side channel could leak data across the user and kernel boundary. Even if a user mode load from a kernel address would fault, the CPU might speculatively execute it and touch data that leaves a measurable cache trace.

Linux’s defense on x86‑64 is Page Table Isolation which is PTI. Keep two views and switch between them on entry and exit.

Tiny explainer: CR3
CR3 holds the current page table root and on x86 switching it changes the active address space.
Tiny explainer: PTI
PTI keeps a reduced userspace view without normal kernel data mapped and a full kernel view used while in the kernel.

Cost. More page table switches, different TLB sharing behavior, and a small memory bump for extra top level tables and the per CPU entry area. With PCID Linux keeps separate TLB tags for the two views to reduce flushes. Some systems allow opting out with nopti when acceptable. Default is on.

Tiny explainer: what Meltdown reads
Permissions never turn off. The architectural access still faults. The leak is in transient speculation which leaves a timing trace.

How the kernel changes mappings safely

When Linux edits page tables the order is deliberate.

Handle cache rules first on architectures that need it.
Modify page tables by adding, removing, or changing PTEs.
Invalidate the TLB so the CPU forgets stale translations.

Under the hood are functions that match the granularity of the change like flush an address space, flush a range, or flush a single page.

There is a parallel story for kernel only mappings made with vmap and vmalloc. Before I/O the kernel flushes the vmap range so the physical page sees the latest bytes. After I/O it invalidates the vmap range so speculative reads do not go stale.

Tiny explainer: vmap and vmalloc
APIs that create kernel virtual mappings to non contiguous physical pages for use inside the kernel.

On x86 you rarely think about the instruction cache because it is coherent with data stores. On others, copying code into executable memory requires an explicit instruction cache flush before running it. The VM has hooks like copy_to_user_page and flush_icache_range where architectures do this housekeeping.

Tiny explainer: icache flush
Some CPUs need an instruction cache sync after writing new code so execution sees the new bytes.

A tiny x86 aside: stacks and calls, without the haze

In 64 bit mode registers wear an R. RIP is the instruction pointer, RSP is the stack, RBP is the frame. The stack grows down. push decrements RSP and stores. pop loads then increments. CALL pushes the return address and jumps. RET pops it into RIP.

On Linux the System V AMD64 ABI passes the first arguments in registers which are RDI, RSI, RDX, RCX, R8, R9 and returns values in RAX. Large objects go by pointer. Your stack must be readable and writable.

Tiny explainer: System V AMD64 ABI
The calling convention for 64 bit Unix like systems on x86‑64 that defines where arguments and return values go.

User code runs in ring 3. The kernel runs in ring 0. Crossings like syscalls, interrupts, and exceptions go through CPU defined gates. In 64 bit mode Linux uses a flat segmentation model and relies on paging for isolation.

Tiny explainer: rings
Rings are CPU privilege levels. Ring 3 is user mode. Ring 0 is kernel mode.

Tiny note for ARM64 readers
The ideas like stack growth and user vs kernel separation are similar. Register names, calling conventions, and syscall entry differ.

When things go sideways (and what that usually means)

mmap → EINVAL often a misaligned file offset which must be page aligned or an impossible flag combo
mmap → ENOMEM you may be out of virtual space or VMA count or you hit strict overcommit
Store to a file mapping → SIGBUS you walked past EOF. The VMA existed, the data did not
mprotect(PROT_EXEC) → EACCES could be a noexec mount or a W^X policy
Big malloc creates a new line in maps your allocator used mmap for that size
RSS balloons after fork() copy on write did its job and you wrote to lots of shared pages
Accidentally clobbered a mapping you probably used MAP_FIXED. Prefer MAP_FIXED_NOREPLACE to fail instead of overwrite

When it is mysterious, look. Start friendly with smaps_rollup for the big picture and maps for shapes. Drop to pagemap and the kpage* files only when you truly need per page truth and expect to need privileges.

A small checklist to keep nearby

Need memory now. mmap anonymous with PROT_READ|PROT_WRITE and MAP_PRIVATE|MAP_ANONYMOUS
Generating code. Keep W^X. Write bytes then mprotect(PROT_READ|PROT_EXEC)
Mapping a file. offset must be page aligned. Touching beyond real EOF is SIGBUS
Lots of major faults. Nudge the kernel with MADV_WILLNEED or touch earlier. Watch page cache and storage
Where did memory go. Start with /proc/<pid>/smaps_rollup then /proc/<pid>/maps
Forking big processes. Expect CoW. RSS grows as you write. Consider exec in the child for heavy work
Latency sensitive. Consider THP or mTHP where it helps. mlock hot sets. Watch your TLB behavior

Feedback is extremely welcomed! You can reach out to me on X @0xkato