Low-latency, high-throughput garbage collection Wenyu Zhao, Stephen M. Blackburn, and Kathryn S. McKinley, PLDI'22
Here is a counter-intuitive result: a stop-the-world garbage collector can deliver better tail latencies than industrial strength concurrent garbage collectors. This paper is more of the how-the-sausage-is-made variety rather than we-only-eat-vegan-in-the-ivory-tower. To push the cooking metaphor too far: this paper can be thought of as a sausage recipe: instructions for combining many clever ideas to make a tasty dish (named LXR). Here I describe some of the key ingredients.
LXR leverages the common trick of implementing a fast path which supports common use cases, paired with a slow path for completeness. The fast path uses reference counting (as opposed to tracing, e.g., traversing the object graph to find all live objects).
Temporal coarsening amortizes the cost of tracking reference counts across many pointer field modifications. Coalescing is how temporal coarsening is applied to pointer fields. For example, let’s say that that the pointer field named ptr
is updated at various time steps as follows:
1: ptr = a;
2: // GC stops the world here
3: ptr = b;
4: ptr = c;
5: ptr = d;
6: ptr = b;
7: // GC stops the world here
Without temporal coarsening, reference counts would be updated like so:
1: ptr = a; a->ref++;
2: // GC stops the world here
3: ptr = b; a->ref--; b->ref++;
4: ptr = c; b->ref--; c->ref++;
5: ptr = d; c->ref--; d->ref++;
6: ptr = b; d->ref--; b->ref++;
7: // GC stops the world here
The key observation which leads to temporal coarsening is that the reference count modifications at steps 3, 4, and 5 are unnecessary because reference counts only cause side effects (freeing objects) during GC pauses. If the system knew a priori that GC pauses would occur only at time steps 2 and 7, then much of the reference count manipulation could be removed:
1: ptr = a; a->ref++;
2: // GC stops the world here
3: ptr = b;
4: ptr = c;
5: ptr = d;
6: ptr = b; a->ref--; b->ref++;
7: // GC stops the world here
The time between GC pauses is called a reference counting (RC) epoch. The first time a specific pointer field is updated during an RC epoch, the previous value of the field (a
in the running example) and the address of the field (&ptr
)
are appended to two buffers. When the epoch ends, the contents of these buffers are examined and reference counts are incremented or decremented as appropriate (b++, a--
).
The code which tracks which fields have been updated and adds elements onto the buffers is inlined into the application threads/business logic and is called the write barrier:
The isUnlogged,attemptToLog
pair is an example of double-checked locking. The unspoken assumptions here are that the branch implied by: if (isUnlogged(field))
is highly predictable and isUnlogged
is cheaper than an atomic_dec
followed by an atomic_inc.
Deferral is how temporal coarsening is applied to pointers on the stack. At the start of an RC epoch, all application stacks are scanned for pointers (to find the root-reachable set). Note that this is not full tracing (objects indirectly accessible from a stack are not part of the root-reachable set). The reference count of each object in the root-reachable set is incremented. Pointers to these objects are also buffered so that reference counts can be decremented at the end of the epoch.
LXR does utilize some concurrency. During each GC pause, LXR synchronously increments reference counts as necessary, restarts the application threads, and then asynchronously handles decrementing reference counts (and freeing dead objects).
A classic problem with reference counting is handling cycles in the object graph. LXR solves this by periodically running a tracing garbage collection algorithm called Snapshot-At-The-Beginning (SATB). At first glance, it seems somewhat distasteful to implement two separate garbage collection algorithms. It turns out that SATB can be implemented as an incremental addition to the reference counting infrastructure.
In between RC epochs, LXR determines if it should start a new SATB collection. If so, LXR records the root-reachable set (note that LXR already has to find this set of objects to implement deferral, so no complexity is added here). The purpose of SATB is to find all objects which were dead at the moment in time when SATB tracing started. SATB concurrently traverses the object graph (starting from objects which were root accessible) to find all objects which are accessible from the root-reachable snapshot it took. The key question is: what happens if a pointer field is updated by an application thread in the middle of SATB traversing the object graph? The answer comes from the decbuf
buffer written by the write barrier. When SATB finishes traversal, SATB iterates through all objects in decbuf to find additional objects which were live at the start of the SATB trace.
Reference counts are stored in separate allocations rather than intrusively stored in objects. LXR allocates 2 bits of memory to hold each reference count. Temporal coarsening makes this plausible by not caring about the case where the reference count of an object would exceed 3 in the middle of an RC epoch. All that matters is the value of a reference count at the boundaries between epochs. If a reference count reaches 3 at epoch boundaries, then LXR gives up on reference counting for that object and calls the reference count “stuck”. In this case, SATB tracing will eventually find and free the object when it is no longer needed.
Pretty dang good:
<This is where I describe open problems, and restrain myself from making jokes about actual pointers>
There is a situation I like to call “cache miss city”. This happens when some book-keeping code with very little compute requirements does frequent random accesses, IPC falls precipitously, and your wicked fast 5GHz 8-wide superscalar machine ends slows to a crawl. I suspect the code that increments and decrements reference counts falls into this category. Even with 2-bit-wide counters, I suspect each increment/decrement will cause L1/L2/TLB misses.
This sort of thing happens all the time and seems like it is ripe for a creative solution.
Here is an idea: aggregation inside of the memory hierarchy. For example, processors could support an associative atomic add instruction. On a cache miss there would be no need to read the cache line from another level of the memory hierarchy, rather all bits of the cache line (in a cache that is private to the core) would be set to zero. When a cache line is evicted, rather than overwriting contents in another level of the hierarchy, the contents of the evicted cache line could be added to data in the destination.