Rendering Crispy Text on the GPU

原始链接: https://osor.io/text

本文提出了一种实时字形渲染方案,解决了传统方法中常见的诸如锯齿、纹理大小和缺乏灵活性等问题。其核心思想是直接在 GPU 上运行时光栅化字形曲线,而不是依赖预烘焙纹理。这种方法具有以下几个优点: * **高质量:** 直接渲染矢量表示能够在任何分辨率下获得良好的字形质量,支持亚像素抗锯齿以获得平滑的结果。 * **低存储和内存占用:** 该方法读取字形曲线数据,而不是将字形光栅化成纹理,从而占用更少的空间。 * **灵活性:** 使用曲线可以进行动态编辑、动画处理,并有可能渲染任何矢量图像。 * **简洁性:** 该过程得到了简化,与中间烘焙步骤相比,复杂性降低。 该方法使用时间累积来随着时间的推移细化字形质量,将光栅化结果绘制到图集并保留字形以供重用。Z 序打包有效地管理图集空间。实现了亚像素抗锯齿以解决非标准亚像素显示器上的边缘问题。作者强调了高质量文本渲染在用户界面中的重要性,并希望这能激发该领域进一步的创新。

这篇 Hacker News 讨论帖围绕一篇关于在 GPU 上渲染清晰文本的博客文章展开。作者对文章的受欢迎程度感到意外,评论者们深入探讨了字体渲染的各个方面。子像素渲染是一个主要的争论点,一些人认为它对于标准分辨率显示器的可读性至关重要,而另一些人则认为它由于高 DPI(“视网膜”)显示器的普及而过时,并指出了其产生的伪影和复杂性。一些用户提到他们仍然没有视网膜显示屏,并且喜欢子像素渲染在旧显示器上的效果。 讨论还涉及到 SDF(符号距离场)和 MSDF(多通道符号距离场)文本渲染等替代方案,以及它们的局限性和潜在解决方案。评论者们讨论了现有的实现方案,例如 Valve 的 SDF 方法和 Pathfinder 的 GPU 计算着色器,并将它们与传统的 TrueType 光栅化进行了比较。性能问题,例如生成 SDF 的成本和缓存的好处,也得到了讨论。此外,还提到了 Slug 渲染软件的专利问题。最后,一些用户提到 OLED 屏幕技术由于非标准的子像素结构而存在边缘模糊的问题。
相关文章

原文

It’s not the first time I’ve fallen down the rabbit-hole of rendering text in real time. Every time I’ve looked into it an inkling of dissatisfaction always remained, either aliasing, large textures, slow build times, minification/magnification, smooth movement, etc.

Last time I landed on a solution using Multi-Channel Signed Distance Fields (SDFs) which was working well. However there was still a few things that bothered me and I just needed a little excuse to jump back into the topic.

That excuse came in the form of getting a new monitor of all things. One of those new OLEDs that look so nice, but that have fringing issues because of their non-standard subpixel structure. I got involved in a GitHub issue discussing this and of course posted my unrequested take on how to go about it. This was the last straw I needed to go try and implement glyph rendering again, this time with subpixel anti-aliasing.

Just to start things off, here is a test with a bunch of fonts, trying to test the most common styles, rounded, sharp, very thin lines, etc.

This one is higher resolution, recommended to open in a new tab and visualize at native 100% zoom if possible

And a cheeky menu to show it in movement, along with a console and the previously show demo text.

An important disclaimer about showing images and videos in this post is that artifacts might show due to minification/magnification, pixel alignment, and even subpixel structure.

There was a few things that were bothering me about using SDFs, the main ones being:

Quality

Certain fonts would struggle to render nicely, especially ones with thin features or lots of detail. SDFs represent “blobby” glyphs nicely, and even simple sharp glyphs if you go the multi-channel route. But at some point you need to increase resolution to get rid of the artifacts.

Here you can see an example of Miama switching between SDF and the new method, note how the thin features were often getting lost, as well as how the “f” was struggling due to its size.

Atlas size

The SDFs are generated offline then stored to an atlas. Even though SDFs require way less resolution for good output quality, you still need something. Especially for fonts with a lot of glyphs this was adding up. I even tried some fonts for Japanese and Chinese which I couldn’t realistically bake to a single atlas due to how big they would have been.

Here you can see the atlas for Miama, which even as a font for only Latin languages without that many special characters comes to a resolution of $4096\times1152$ with each glyph taking a $64\times64$ region.

Having multiple fonts available at runtime was adding a significant memory cost and getting them in and out was some significant streaming bandwidth. And the more fonts the bigger the issue.

Flexibility

In general, I found it fiddly to get around issues like minification or implementing new ideas like the subpixel anti-aliasing that kickstarted all this. For a while I also wanted to work with potentially any vector image, which would have required baking, so couldn’t be generated or edited at runtime.

Simplicity

Working with intermediate steps that transform the source data is a raw increase in complexity of the whole system, even if some of that complexity is hidden by some library that could take the glyphs and bake them, it’s still there.

A solution that more directly takes the raw input in the form of the bezier curves that the glyph creator made would be conceptually simpler. Over time I’ve come to appreciate solutions that have less moving parts and where the flow from source data to the desired result is as simple and understandable as possible.

The idea is fairly simple, instead of baking anything to textures, grab the curves that define the currently visible glyphs themselves, send them to the GPU and somehow rasterize them. In a way you can see this as moving the necessary rasterization step that previously was offline to be done at runtime.

This would take much less storage compared to the cost per-glyph of a cell in an atlas, it would allow for them to look good at any resolution since we’re rendering the vector representation directly and it would play nice with things like subpixel anti-aliasing, where instead of computing coverage for a single pixel, we’d do it for each of the subpixel elements.

As a very short summary, the solution consist of loading the glyph curve data directly, rasterize them at runtime to an atlas and sample said atlas as required to render the visible glyphs.

The sauce here is keeping glyphs in the atlas as long as they keep being used in subsequent frames. This allows to accumulate and refine the rasterization results, to the extent of getting very high quality sub-pixel anti-aliasing.

I’ll give an overview of the whole pipeline here in execution order. From loading the raw font until they end up on the screen.

Processing the Quadratic Bezier Curves

I’m using FreeType in an offline tool as an intermediary way to load any of the font formats they support. Then I traverse the curves of each glyph and store them in my asset format that will get passed to the GPU.

The glyphs may contain either lines, quadratic beziers (3 points) or cubic beziers (4 points). To allow for a simpler shader I convert all of these to quadratic beziers.

To transform a line to a quadratic bezier is fairly obvious, just create a new control point exactly in the middle of the two existing ones:

// Given the two points for the line
p0 := /*...*/;
p1 := /*...*/;

// Create a new control point in the middle
m := lerp(p0, p1, 0.5);

// And create a quadratic bezier with those
new_curve(p0,m,p1);

Transforming a cubic bezier curve to a quadratic one implies lowering it’s order, which is necessarily a lossy process. In this case I’m choosing to always split the cubic bezier into two quadratics, which works well in all the fonts I’ve tried:

// Given these cubic bezier points
p0 := /*...*/;
p1 := /*...*/;
p2 := /*...*/;
p3 := /*...*/;

// Calculate these extra control points
c0 := lerp(p0, p1, 0.75);
c1 := lerp(p3, p2, 0.75);
m  := lerp(c0, c1, 0.5);

// And create two quadratic bezier curves
new_curve(p0,c0,m);
new_curve(m,c1,p3);

Here you have a desmos graph where you can move the points around and see the input cubic bezier and the resulting two quadratic ones.

There’s much more interesting ways to do this split that would reduce the error further, but this works fairly well for the majority of cubic beziers found in the fonts I’ve tried. It’s also possible to use offline tools to do a higher quality transformation into a format that only has quadratic beziers like TrueType (.ttf) which would avoid this transformation altogether.

Here’s some of the points after being loaded, the blue points being the ones that define the beginning and end of the bezier curve (or on points) and the red ones being the middle point of each bezier, defining how it curves (or off points).

Calculating Coverage

Here I’m not doing anything particularly interesting or different than what you might find elsewhere. A ray is shot horizontally, left-to-right on a per-pixel basis, testing against the curves for intersections and accumulating a winding number to see if it’s considered outside (zero) or inside (non-zero). At the end of the day is “just” solving a quadratic equation.

My favorite explanation of the math behind this, with some extra neat diagrams, is in the read-me of this GitHub repository by GreenLightning explaining his GPU Font Rendering approach. It would also be a crime not to link to Sebastian Lague’s Rendering Text video where he covers the principles behind glyph rasterization and his adventures making his solution better. If you’re interested in the source code as well, both of these links can sort you out.

Something worth mentioning is that there can be issues in this step due to inaccuracies on the intersection computation, as the links above already mention. Since I knew I would be accumulating hundreds of samples over time I chose not to do anything explicitly about that at this stage and this has proven to be the right decision so far.

Most of these inaccuracies happen when the samples are at a very specific height and these can still happen in my implementation. That said, maybe one or two samples out of a few hundreds can have incorrect coverage in the worst case but after averaging these are not visible.

At the time of writing I’m accumulating up to 512 samples per-glyph if it stays on screen. If a single sample goes wrong, that means that the pixel is outputting $1/512=0.00195$ or $511/512=0.99804$ instead of $0$ and $1$ respectively which is imperceptible in practice. Furthermore, you could have a threshold where you clamp to the extremes if the coverage is close, making these $0.002$ and $0.998$ be evaluated as $0$ and $1$ respectively.

For completeness, here’s the code to compute the coverage. It iterates over a bitset to access the relevant curves of the glyph and computes a winding number to then transform it to a coverage value. For a reference about how to compute the winding number I refer you again to GreenLightning’s repository who explains it wonderfully and provides sample code.

u32 words[GLYPH_CURVE_WORD_COUNT] = /* . . . */ // Bitset marking which curves are relevant for this texel

uint4 addend = 0;
for (int tick_offset = 0; tick_offset < parameters.tick_increment; ++tick_offset)
{
    float2 subpixel_offset = quasirandom_float2(parameters.tick + tick_offset);
    float2 pixel_offset_r = lerp(per_frame.subpixel_layout.r_min, per_frame.subpixel_layout.r_max, subpixel_offset);
    float2 pixel_offset_g = lerp(per_frame.subpixel_layout.g_min, per_frame.subpixel_layout.g_max, subpixel_offset);
    float2 pixel_offset_b = lerp(per_frame.subpixel_layout.b_min, per_frame.subpixel_layout.b_max, subpixel_offset);

    float2 uv_r = (local_texel_coordinates_subpixel + pixel_offset_r) / parameters.size_in_pixels;
    float2 uv_g = (local_texel_coordinates_subpixel + pixel_offset_g) / parameters.size_in_pixels;
    float2 uv_b = (local_texel_coordinates_subpixel + pixel_offset_b) / parameters.size_in_pixels;

    float2 em_r = lerp(glyph.bbox_em_top_left, glyph.bbox_em_bottom_right, uv_r);
    float2 em_g = lerp(glyph.bbox_em_top_left, glyph.bbox_em_bottom_right, uv_g);
    float2 em_b = lerp(glyph.bbox_em_top_left, glyph.bbox_em_bottom_right, uv_b);

    float3 winding_number = 0;
    for (int word_index = 0; word_index < GLYPH_CURVE_WORD_COUNT; ++word_index)
    {
        u32 remaining_bits = words[word_index];
        while (remaining_bits)
        {
            int bit_index = firstbitlow(remaining_bits);
            int local_curve_index = (word_index * 32) + bit_index;
            remaining_bits ^= (1u << bit_index);
            int global_curve_index = glyph.curve_offset + local_curve_index;
            int first_point_index = global_curve_index * 2;
            {
                float2 p0 = point_buffer[first_point_index];
                float2 p1 = point_buffer[first_point_index + 1];
                float2 p2 = point_buffer[first_point_index + 2];
                winding_number.r += compute_winding_number(p0, p1, p2, em_r);
                winding_number.g += compute_winding_number(p0, p1, p2, em_g);
                winding_number.b += compute_winding_number(p0, p1, p2, em_b);
            }
        }
    }
    float3 coverage = saturate(winding_number);
    addend += uint4(coverage, 1);
}

This addend simply gets added to the previous value on for that texel on the atlas, which will be explained later.

For the quasirandom_float2 I’m using the fantastic $R_2$ sequence presented in by Martin Roberts. In this shadertoy you can see how it distributes the sample points to provide some very good coverage over time.

Accelerating Curve Access

A good optimization to make here is to split the glyph in some horizontal bands and store which curves of the glyph touch each band. The rasterization code is tracing only horizontally, so with this we can massively reduce the set of curves that each texel will have to test against. To do this I have a bunch of bits per-band per-glyph that represent which local curves to the glyph are present in the band.

Here is a visualization of which curves are on the different bands, highlighted in yellow. You can imagine how a ray traced from left to right of the glyph can just intersect the relevant curves.

You get some great wins by having each texel loop over the curves relevant for that band. However, this can be made faster by accessing bands uniformly per-wave, meaning that all the code that handles iterating over curves can be scalarized, and so are the curve reads (meaning they can happen once per-wave and not once per-thread on the wave). That would look something like this:

int this_thread_band_index = clamp(int(floor(uv_y * BAND_COUNT)), 0, BAND_COUNT-1);
min_band_index = WaveActiveMin(this_thread_band_index);
max_band_index = WaveActiveMax(this_thread_band_index);
for (int band_index = min_band_index; band_index <= max_band_index; ++band_index)
{
    /* . . . Add the curves for this band to be intersected against . . . */
}

And since I’m rasterizing this in compute into an atlas, I can decide which texel each thread is writing to, so I reorganize the threads to be packed horizontally, in row-major order, so the range of bands that each wave touches is minimized compared to other indexing methods like “classic” quads or Morton codes. Here is an example of how the threads are distributed. Using a $9\times11$ glyph and 16-thread waves for simplicity:

To distribute the threads like this would be as simple as:

int2 total_texel_size = parameters.texel_bottom_right - parameters.texel_top_left;
int2 local_texel_coordinates_raw = int2(thread_id % total_texel_size.x, thread_id / total_texel_size.x);
if (any(local_texel_coordinates_raw > total_texel_size))
{
    return;
}

Atlas Packing

I started by rasterizing to the screen directly, however computing high quality anti-aliasing every frame as they were being output to the final target was a significant cost.

Thinking about how to get around this it also became obvious that most rendered text stays on screen for many frames, with the same size and position, even as you’re reading this you’re probably not scaling the text, or smoothly scrolling.

Besides this, the same glyph will often appear more than once on screen at the exact same size (just look at how many “e”s there are in this sentence alone). So why bother rendering it multiple times? (Subpixel positioning is a thing and we’ll go back to that later)

So I grabbed the two most well-worn tools in the graphics tool belt, atlases and temporal accumulation.

The idea here is to have an atlas that packs the glyphs reasonably well, if a glyph we want is not on the atlas, we allocate a chunk of it and start rasterizing into it, if a glyph we want is already there, we just use it. At some point in the frame we go over al the glyphs in the atlas and decide whether we keep it (and maybe refine it with more samples) or if it’s not being used and we should free that space.

The atlas will keep in-use glyphs resident all the time, so if text on the screen hasn’t changed for a while, we have nothing to compute there, all the glyphs are ready and we just slap them onto the screen later. There is a cost of adding new glyphs, but we can spread this cost over many frames as we’ll discuss later.

Some notes about this, the inputs to the atlas do have to take a couple things into account that might not be immediately obvious. At the time of writing, if we equate this atlas to a hash-map, the “key” is the following:

Glyph_Key :: struct
{
    font : Font;
    glyph_index : int;

    // u24.8 fixed point
    quantized_size_in_pixels_x : u32;
    quantized_size_in_pixels_y : u32;

    // u0.8 fixed point
    quantized_subpixel_offset_x : u8;
    quantized_subpixel_offset_y : u8;
}

The font, index of the glyph inside the font and the size are somewhat expected. We also need the subpixel offset though, which is the fractional of the pixel position (as in frac(pixel_position)). You might want to place the glyph at any position on the screen, not necessarily aligned with the pixel grid, or you might want to smoothly move text (e.g. scrolling). If we didn’t take this into account, then all the anti-aliasing we’re doing would only be valid for a single subpixel position.

Note the usage of fixed point too. This helps collapsing nearby fractional positions and sizes to the same values. Using floating point directly would often generate different values bit-wise, even if mathematically they should have been the same. Using 8 bits for the fractional part offers more than enough resolution for smooth positions and sizes. If moving a single of this $1/256$ increments within the pixel changed the resulting value it would often be displayed in 8 bits per-component render targets or monitor outputs.

That said, you could decide that this is a trade-off you’re willing to make and say that all of your glyphs should be positioned on a pixel boundary. In my experience, slowly moving text looks awful this way since you see it jump from integer pixel boundary to pixel boundary. I wanted to use this as my solution for all text so it’s not something I went for.

Here you can see a comparison between subpixel positioning, aligned to the pixel grid and aligned to a half-resolution pixel grid to simulate seeing this in a monitor that’s half the resolution than the one you’re using.

Zooming into the 1-pixel aligned word makes the stepping even more obvious.

Where if we let the glyphs fall in subpixel positions the movement is dramatically smoother.

That said, it’s still possible to optimize for cases where you know you will do a lot of static text, for example, if you’re doing a text editor and want to use a monospaced font you can force the spacing between characters to be rounded to pixel boundaries. This way every glyph will have the same subpixel offset and always hit the atlas cache for the same glyph.

If also aligning the line breaks to the output pixel grid you get even better reuse, since the same glyphs in a monospaced font in different lines will also hit the same entry on the atlas. See how only new glyphs in the block of text allocate a new entry.

Z-Order

A great way I found to place the glyphs somewhat nicely packed at runtime was to use Z-Order Packing and a bitset for free cells within the atlas.

Z-Order curves (via Morton codes) allows you to think of the cells as a long 1D array, allocating a contiguous slice of this 1D array will give you a square in the resulting 2D atlas as long as you’re allocating a power of two number of cells.

A free bit in the bitset represents a free cell, in this case a $16\times16$ texel cell.

When a glyph wants to find a spot, it rounds up its size to the next power of two, so a glyph that needs a $25\times29$ will end up allocating a chunk that’s $32\times32$. This would require 4 $16\times16$ cells, so it’ll look for 4 contiguous free bits and set them, then return the 2D location of that first free cell using Morton codes to go from 1D to 2D.

Note that these contiguous bits also have to be aligned to the number of bits, that is, if looking for 4 free bits, those could start in index 0, 4, 8, 12, etc. If the free bits went from bit 3 to bit 6, when looking at those 4 cells they wouldn’t form a contiguous square.

The code would look something like this:

size := /*...*/

max_size_dimension := max(size.x, size.y);
aligned_size := max(BASE_SLOT_SIZE, align_to_next_power_of_2(max(max_size_dimension, 0)));
slot_size := aligned_size / BASE_SLOT_SIZE;
bits_needed := slot_size * slot_size;
assert(is_power_of_2(bits_needed));

index := find_free_contiguous_bits_aligned(bitset, bits_needed);

base_slot_coordinates := decode_morton2_16(xx index);
top_left_texel_coordinates := base_slot_coordinates * BASE_SLOT_SIZE;

And here there’s a visualization of the order the glyphs go in as well as what happens when some of they get removed and those free cells get reused for future glyphs if they fit.

Transposing Z-Order

The eagle eyed among you that have worked with Z-Order in 2D before might have noticed that this is packing in a transposed Z-Order (so… mirrored N-Order?).

This is because most long and thin glyphs that use the Latin alphabet are vertical, and transposing Z-Order allows to allocate two cells together to form a vertical rectangular section. This makes glyphs for stuff like “l”, “j”, “i” or “1” take half the space.

That said, in cases where most long and thin glyphs are horizontal, for example most of the Arabic languages, the standard Z-Order is more suited.

To do this the code above would be modified to not just use the maximum size of each dimension when calculating the bits_needed.

aligned_size := ixy(max(cast(s32)BASE_SLOT_SIZE, align_to_next_power_of_2(max(size.x, 0))),
                    max(cast(s32)BASE_SLOT_SIZE, align_to_next_power_of_2(max(size.y, 0))));
slot_size := aligned_size / BASE_SLOT_SIZE;
slot_size.y = max(slot_size.x, slot_size.y);
slot_size.x = max(slot_size.x, slot_size.y / 2);
bits_needed := slot_size.x * slot_size.y;
assert(is_power_of_2(bits_needed));

And transposing the final coordinates is simply swapping the result.

base_slot_coordinates := decode_morton2_16(xx index);
base_slot_coordinates.x, base_slot_coordinates.y = base_slot_coordinates.y, base_slot_coordinates.x;
top_left_texel_coordinates := base_slot_coordinates * BASE_SLOT_SIZE;

Here you can see the same demo but allocating glyphs that are double the height.

Temporal Accumulation

Glyphs staying in the atlas allows to keep throwing samples at them and refine the results further. This way the final result can have very high quality anti-aliasing without having to cast a significant amount of samples when the glyph just appears.

Let’s look at the intro video slowed down and with a full black background to better visualize the glyph output. Also using the Nacelle typeface on its ultra-light variant, to better show thin features.

Even in this slowed-down case it’s hard to see the glyphs visibly refining as you’re reading the text since the results are already fairly high quality. The trick here is that every glyph that first appears gets 8 samples-per-pixel on that first frame, then 4 samples next frame, then 2 and finally 1 every frame afterwards until it reaches a total of 512 samples.

This guarantees a pretty good quality when a glyph first shows up, which is important on smoothly moving or resizing glyphs. Since they do the equivalent of getting initialized every frame.

Another factor that makes this looks better is subpixel anti-aliasing, which will be touched upon in a further section.

When disabling this and just doing a single sample per-pixel every frame, with no subpixel anti-aliasing the slowed down results are as follows.

It’s more obvious how samples keep getting added. Also very interesting how the glyphs appear to shift in positioning. That’s because the initial samples are not at the center of the pixel. That’s fixed by placing the initial samples optimizing for this case, but that’d defeat part of the point of this visualization.

Even in this case, with a single sample and the shifting text it’s still not as dramatically visible as I would have imagined, showcasing how well the refinement idea and temporal accumulation works in principle.

Zooming in on a word in particular demonstrates how on the first frame the glyph has no anti-aliasing at all and the results are either black or white, then it keeps refining and shifting position until getting to a better final result with a few dozen samples.

And for completeness, with all the quality optimizations on, starting with 8 samples and with subpixel anti-aliasing that word looks like this.

This system is also easily tunable to achieve the required levels of quality and performance. Some of the knobs to twist would be:

  • How many samples/rays to add every frame.
  • Increase samples on the first few frames of a glyph or not.
  • Having a cap of “total samples” allowed per-frame to keep cost bounded.
  • Time-slice the update of existing glyphs, that is, adding samples every few frames instead of every frame.

Another note is that the cost of casting a ray scales linearly with the amount of curves it’s going to have to intersect for a given glyph. So for more precise cost-gating it might be worth to use that as a metric instead. Meaning that you’d allow to do a certain number of intersected curves per-frame.

It’s worth mentioning that performance hasn’t been a concern in my experience with this system so far. The full-screen of text of the intro peaks at about 0.1 milliseconds in my 9070 at 4k. And that cost quickly tapers down to zero when glyphs have reached the max number of samples (set at the time of writing to 512 but can be easily lowered).

Overall this system works shockingly well. Most text presented to users often stays on screen completely static, which lets it converge to high quality. Even as it shows up, the speed at which we look at words and read them is orders of magnitude slower that the time it takes a glyph to look very good. In general, I’ve found it imperceptible that the text is converging over time while at the same time it always looks nicely anti-aliased.

Subpixel Anti-Aliasing and Fringing

The gist of subpixel anti-aliasing is start thinking of the individual red, green and blue subpixel elements that for your monitor pixel as individual sample points, or rather, sample areas. Roughly you can consider the subpixel elements to be the actual “pixels” you want to render into.

In a traditional RGB LCD layout like the following, your horizontal resolution effectively triples. In traditional 4k you’d go from $3840\times2160$ to $3840\times6480$.

Image from Subpixel Zoo

Getting all this effective resolution is great! And since the light is getting mixed from neighboring pixels, there’s no reason to get bad color fringing.

As I’ve already hinted at though, the monitor I’m using is far from this 3 vertical stripes of red, green and blue, and looks like this instead.

Image from RTings Review of Oled G9

Which causes problematic fringing. And this is far from being the worst case out there, with monitors having wild arrangements like some of the ones you can see in Subpixel Zoo. A notorious recent one is LG WOLED having a red-white-blue-green structure, so it has an extra white-only subpixel and has the green and blue ones swapped from the standard order.

To show a more direct comparison on my current monitor. A default red-green-blue subpixel structure made of equal vertical rectangles would look like this. With very visible green fringing on top and magenta at the bottom.

Whereas if I set the subpixel structure on the solution presented in this article to match the one on my monitor it looks like this. Where even with subpixel anti-aliasing on there’s next to no fringing while keeping a very smooth result.

The big payoff! Finally rendering good looking text with subpixel anti-aliasing and no color fringing.

To achieve this I’ve set up a little editor where I could play with the subpixel elements position, the inner white square is the pixel, and each of the colored quads represent where I’m sampling the results of each subpixel element. Note that it’s going out of bounds of the pixel, which I’ll touch on in the next section.

If zooming in you can see how most of those pixels we’re sending to the monitor are not white, in fact there’s very few that are $RGB(1,1,1)$.

But when they’re outputting on the monitor, light from all the subpixels blends in such a way that the result is a smooth white output. Getting the desired anti-aliasing effect and better representing the intended shape of the glyph.

Note that a lot of these features are only one to one-and-a-half pixels wide. They also often fall in-between pixel cells since I’m not doing any hinting. This is picked on purpose as a hard example for the renderer to handle and to show the effectiveness of good subpixel anti-aliasing.

Overlapping Subpixels

As I was trying to match my subpixel structure I’ve found that overlapping the subpixel elements would give more accurate results. Which intuitively makes sense since light naturally mixes and diffuses slightly from the subpixel elements, so the sampled area for a given subpixel will be larger than the subpixel itself physically is. Almost behaving like a tiny point light.

So naturally you might expect a setup like this.

However letting the subpixel elements overlap each other gives better results. Also here you can see two examples of a “classic” LCD subpixel arrangement. If you’re seeing this on a screen with this arrangement it’s probably the best quality anti-aliasing you’d see in this whole article. Because all the other captures have been done with my monitor’s subpixel structure arrangement.

Note that the areas also should bleed outside the pixel itself because they are surrounded by (normally) identical pixels with identical subpixel elements. Light is not only bleeding and mixing with the light from a single pixel, but also with the neighboring subpixels.

As I was writing this article I found the Easy Scalable Text Rendering article by Evan Wallace which suggests needing to blur horizontally after rendering with subpixel anti-aliasing. Interestingly this is effectively the same thing as considering the subpixel elements themselves to be bigger and overlapping.

I really wish that having access to arbitrary subpixel structures of monitors was possible, perhaps given via the common display protocols. This would enhance subpixel anti-aliasing in general and text specifically, even in monitors that have “standard” orders, since you can be more fine-grained for the specific hardware.

This would also give freedom to display manufacturers to not have to fear trying an otherwise better subpixel structure because of issues with text rendering. Samsung changed their subpixel structure on QD-OLED to try to minimize issues like this from G8 to G9. And still on LG’s WOLED and Samsung’s QD-OLED fringing is commonly cited as one of the most notorious problems on monitors that use them.

It’s is just software, we can fix this, they shouldn’t be forced to change hardware to account for the failures of software.

Good user interfaces and especially great text is a soft spot of mine. It has the potential to carry the perceived quality of a product to an degree that’s sometimes underrated. A prime example of this is the fantastic work that Atlus consistently puts out in the Persona series or more recently Metaphor: ReFantazio. I also have to mention Nier: Automata as a personal favorite.

And it makes sense! Games will often present you with text that’s meant to grab your attention. When a text box, a menu, a title, an announcement or anything in-between shows up in a game there’s an implied focus point put on it. It looking sub-par can impact the experience as much as a badly rendered 3D scene would. So it follows that this aspect of the presentation should get their fair share of love as well.

I hope you’ve found this useful! I’d love to see more attempts to make glyph rendering in real time better and in this fashion I wish this comes across as a good motivator for more people to go tackle this.

As always, if you have any comments or there’s any questions please reach out! You can find me in most places as some variation of “osor_io” or “osor-io” as well as with the links at the bottom of the page.

Cheers! 🍻

联系我们 contact @ memedata.com