低延迟、高吞吐量垃圾回收

低延迟、高吞吐量垃圾回收
Low-latency, high-throughput garbage collection

原始链接: https://danglingpointers.substack.com/p/low-latency-high-throughput-garbage

## LXR：一种高性能垃圾收集器 PLDI'22 介绍了 LXR，一个令人惊讶的有效垃圾收集系统，证明了暂停世界（stop-the-world）收集器在尾延迟方面可以胜过并发方法。LXR 通过混合策略实现这一点，结合了快速的引用计数路径和较慢的跟踪路径。其核心创新在于**时间粗化 (temporal coarsening)**：将精确的引用计数更新推迟到垃圾收集暂停期间。LXR 不会在每次指针更改时立即更新计数，而是缓冲这些更改并在 epoch 边界处应用它们，从而显著降低开销。这通过一个写入屏障来记录指针更新来实现。为了处理循环——引用计数面临的挑战——LXR 定期使用“开始时快照 (Snapshot-At-The-Beginning, SATB)”跟踪收集器，利用已经为时间粗化识别的根集合。引用计数使用每个对象仅 2 位存储在外部，接受偶尔的“卡住”计数，由 SATB 处理。作者确定了引用计数更新期间频繁的随机内存访问可能成为性能瓶颈，建议未来的工作可以探索硬件级别的聚合来缓解这个问题，即“缓存缺失之城”。

一个黑客新闻的讨论围绕低延迟、高吞吐量的垃圾回收技术。最初的帖子链接到一篇文章，建议使用arena分配器，利用64位虚拟内存可以实现这一目标。然而，评论者争论arena是否*是*垃圾回收，或者仅仅是一种需要手动垃圾回收的内存组织方法。用户讨论了保证GC暂停时间的困难，有人建议使用超时机制，但有人反驳说，时间不足会导致不完整的循环。像.NET的基于区域的GC避免方案等替代方案被提出，这对于游戏开发等场景有益。进一步的讨论指出链接文章中的一些比较可能已经过时，提到了像ZGC这样的垃圾回收器的改进，现在可以将stop-the-world (STW)暂停减少到仅限于线程同步，并并发执行扫描。这次对话突出了垃圾回收设计中固有的复杂性和权衡。

原文

Low-latency, high-throughput garbage collection Wenyu Zhao, Stephen M. Blackburn, and Kathryn S. McKinley, PLDI'22

Here is a counter-intuitive result: a stop-the-world garbage collector can deliver better tail latencies than industrial strength concurrent garbage collectors. This paper is more of the how-the-sausage-is-made variety rather than we-only-eat-vegan-in-the-ivory-tower. To push the cooking metaphor too far: this paper can be thought of as a sausage recipe: instructions for combining many clever ideas to make a tasty dish (named LXR). Here I describe some of the key ingredients.

LXR leverages the common trick of implementing a fast path which supports common use cases, paired with a slow path for completeness. The fast path uses reference counting (as opposed to tracing, e.g., traversing the object graph to find all live objects).

Temporal coarsening amortizes the cost of tracking reference counts across many pointer field modifications. Coalescing is how temporal coarsening is applied to pointer fields. For example, let’s say that that the pointer field named ptr is updated at various time steps as follows:

1: ptr = a;
2: // GC stops the world here
3: ptr = b;
4: ptr = c;
5: ptr = d;
6: ptr = b;
7: // GC stops the world here

Without temporal coarsening, reference counts would be updated like so:

1: ptr = a; a->ref++;
2: // GC stops the world here
3: ptr = b; a->ref--; b->ref++;
4: ptr = c; b->ref--; c->ref++;
5: ptr = d; c->ref--; d->ref++;
6: ptr = b; d->ref--; b->ref++;
7: // GC stops the world here

The key observation which leads to temporal coarsening is that the reference count modifications at steps 3, 4, and 5 are unnecessary because reference counts only cause side effects (freeing objects) during GC pauses. If the system knew a priori that GC pauses would occur only at time steps 2 and 7, then much of the reference count manipulation could be removed:

1: ptr = a; a->ref++;
2: // GC stops the world here
3: ptr = b;
4: ptr = c;
5: ptr = d;
6: ptr = b; a->ref--; b->ref++;
7: // GC stops the world here

The time between GC pauses is called a reference counting (RC) epoch. The first time a specific pointer field is updated during an RC epoch, the previous value of the field (a in the running example) and the address of the field (&ptr) are appended to two buffers. When the epoch ends, the contents of these buffers are examined and reference counts are incremented or decremented as appropriate (b++, a--).

The code which tracks which fields have been updated and adds elements onto the buffers is inlined into the application threads/business logic and is called the write barrier:

The isUnlogged,attemptToLog pair is an example of double-checked locking. The unspoken assumptions here are that the branch implied by: if (isUnlogged(field)) is highly predictable and isUnlogged is cheaper than an atomic_dec followed by an atomic_inc.

Deferral is how temporal coarsening is applied to pointers on the stack. At the start of an RC epoch, all application stacks are scanned for pointers (to find the root-reachable set). Note that this is not full tracing (objects indirectly accessible from a stack are not part of the root-reachable set). The reference count of each object in the root-reachable set is incremented. Pointers to these objects are also buffered so that reference counts can be decremented at the end of the epoch.

LXR does utilize some concurrency. During each GC pause, LXR synchronously increments reference counts as necessary, restarts the application threads, and then asynchronously handles decrementing reference counts (and freeing dead objects).

A classic problem with reference counting is handling cycles in the object graph. LXR solves this by periodically running a tracing garbage collection algorithm called Snapshot-At-The-Beginning (SATB). At first glance, it seems somewhat distasteful to implement two separate garbage collection algorithms. It turns out that SATB can be implemented as an incremental addition to the reference counting infrastructure.

In between RC epochs, LXR determines if it should start a new SATB collection. If so, LXR records the root-reachable set (note that LXR already has to find this set of objects to implement deferral, so no complexity is added here). The purpose of SATB is to find all objects which were dead at the moment in time when SATB tracing started. SATB concurrently traverses the object graph (starting from objects which were root accessible) to find all objects which are accessible from the root-reachable snapshot it took. The key question is: what happens if a pointer field is updated by an application thread in the middle of SATB traversing the object graph? The answer comes from the decbuf buffer written by the write barrier. When SATB finishes traversal, SATB iterates through all objects in decbuf to find additional objects which were live at the start of the SATB trace.

Reference counts are stored in separate allocations rather than intrusively stored in objects. LXR allocates 2 bits of memory to hold each reference count. Temporal coarsening makes this plausible by not caring about the case where the reference count of an object would exceed 3 in the middle of an RC epoch. All that matters is the value of a reference count at the boundaries between epochs. If a reference count reaches 3 at epoch boundaries, then LXR gives up on reference counting for that object and calls the reference count “stuck”. In this case, SATB tracing will eventually find and free the object when it is no longer needed.

Pretty dang good:

There is a situation I like to call “cache miss city”. This happens when some book-keeping code with very little compute requirements does frequent random accesses, IPC falls precipitously, and your wicked fast 5GHz 8-wide superscalar machine ends slows to a crawl. I suspect the code that increments and decrements reference counts falls into this category. Even with 2-bit-wide counters, I suspect each increment/decrement will cause L1/L2/TLB misses.

This sort of thing happens all the time and seems like it is ripe for a creative solution.

Here is an idea: aggregation inside of the memory hierarchy. For example, processors could support an associative atomic add instruction. On a cache miss there would be no need to read the cache line from another level of the memory hierarchy, rather all bits of the cache line (in a cache that is private to the core) would be set to zero. When a cache line is evicted, rather than overwriting contents in another level of the hierarchy, the contents of the evicted cache line could be added to data in the destination.

低延迟、高吞吐量垃圾回收 Low-latency, high-throughput garbage collection

低延迟、高吞吐量垃圾回收
Low-latency, high-throughput garbage collection