阿拉伯字体排版渲染体验及其技术债简介
Introduction to the experience of rendering Arabic typography&its technical debt

原始链接: https://lr0.org/blog/p/arabic/

阿拉伯文从本质上讲是连写字体,这意味着字母会根据其邻近字符以及所处位置(词首、词中、词尾或独立形式)而改变形状。与拉丁字母不同,阿拉伯字母没有所谓的“默认”块状形式;这些位置变化本身就是字母的一部分。此外,这种书写系统被多种语言所使用,包括波斯语、乌尔都语和信德语,每种语言都增加了独特的字符和风格要求(如波斯体)。 因此,阿拉伯文字体必须像复杂的程序一样运行,而不是静态的字形集合。现代数字渲染依赖于“成形引擎”,它以 Unicode 码位作为输入,实时执行连接、堆叠和塑造字母的逻辑。 从历史上看,早期的软件曾试图通过将特定形状编码为独立字符来绕过这种复杂性。这些“僵化”的编码至今仍存在于遗留系统中,往往导致现代应用程序出现搜索失败和渲染错误。当软件忽略这些成形规则时,就会出现常见的错误输出,即字母显示为断开且反向排列。归根结底,正确渲染阿拉伯文需要复杂的软件支持,这种软件不应将文本视为静态图像,而应将其视为由结构化书写规则所驱动的动态呈现。

Hacker News | 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交 | 登录 介绍阿拉伯语排版渲染的经验及其技术债 (lr0.org) bookofjoe 发布于 1 小时前 | 11 点 | 隐藏 | 往期 | 收藏 | 1 条评论 | 帮助 adam_rida 1 小时前 [–] 非常有意思,阿拉伯语很好地提醒了我们:文本渲染主要针对那些塑造了默认设置的文字系统得到了解决。难点在于排版、字形变换、双向文本(bidi)行为、字体回退、搜索以及编辑器模型,它们全部交织在一起。当所有层级的假设都错误时,你无法干净利落地修复其中任何一层。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

To understand why every machine since Gutenberg has wrestled this script and mostly lost, you need one structural fact: Arabic is cursive always. There is no print-versus-handwriting distinction, no block letters. The letters connect in stone inscriptions, in manuscripts, in metal, on screens. Each letter therefore changes shape depending on its neighbours (an isolated form, an initial, a medial, a final), and six letters refuse to connect forward at all, which breaks words into joined clusters and gives the script its rhythm. The shapes are not costumes over some underlying "real" letter. The positional variation is the letter.

And the alphabet is bigger than Arabic the language. Persian extends it with four letters Arabic does not have (پ pe, چ che, ژ zhe, گ gaf) and uses two of the existing letters in subtly different forms (ی for the final yāʾ, ک for kaf). Urdu adds an aspirated do-chashmī he (ھ), a retroflex set (ٹ ڈ ڑ), and a hanging ye barree (ے), and writes most of its everyday text in Nastaʿlīq, which a Naskh-shaped font will produce as a phonetically correct but visually unrecognisable approximation. Sindhi has more again. Pashto, Kurdish, Uyghur, Kashmiri, and Punjabi each take the alphabet, add what their phonology requires, and ship. Any font that calls itself "Arabic" without consulting the Persian and Urdu communities will produce, for hundreds of millions of readers in Iran and South Asia, text that is technically rendered but functionally wrong: the kaf has the wrong terminal, the heh fuses where it shouldn't, the digits are from the wrong belt. The Noto Sans Arabic family ships separate sub-fonts to cover these (NotoNaskhArabic, NotoNastaliqUrdu, NotoSansArabicUI), and OS font fallback chains usually get it right. Usually.

stored codepointisolatedinitialmedialfinal
U+0639 ʿAYNععــعــع
U+0647 HEHههــهــه

One codepoint, four shapes, chosen at render time by the shaping engine. The medial heh and the isolated heh are, to an untrained eye, different letters; I have watched students of Arabic meet the medial heh in week three and file a complaint with management. A Latin font that ships 26 lowercase shapes needs no opinion about any of this. An Arabic font is wrong unless it has opinions about all of it.

The arrangement we eventually settled on, after decades of wrong answers, is this: the encoding stores the abstract letter, and the font supplies the shapes. Unicode gives you one codepoint for ʿayn; the font carries the four positional glyphs; a shaping engine applies the OpenType features (isol, init, medi, fina, plus rlig for the ligatures the script requires, plus mark and mkmk for stacking the vowel signs) at render time. An Arabic font is a small program. The text you store is its input, not its output. The word is performed fresh every time you look at it, like music from a score.

The cleanest way to feel this is to assemble a word one letter at a time and watch every prior letter renegotiate its shape as the next one arrives:

click letters to add them to the word, in the order they appear in the codepoint stream:

RENDERED BY THE SHAPING ENGINE

Try م, then ح, then م, then د: build the name Muḥammad. The first م drops into its initial form the moment you add the ح, the ح goes medial when the next م arrives, and so on through to the د, one of the six non-joining letters, which interrupts the flow and forces what would have been the third م into a final form. Four codepoints in storage, one continuous stroke on screen. None of it happens without a shaping engine; a PDF generator that lacks one will render the same four codepoints as four disconnected isolated forms.

The wrong answers are still in the standard, fossilised, and they make excellent souvenirs. Before shaping engines existed, the 8-bit code pages of the DOS and early Windows era encoded the shapes themselves: a separate character for initial ʿayn, medial ʿayn, and so on. Unicode, which promised round-trip compatibility with anything else, had to swallow those sets whole, and they live on at U+FB50 through U+FEFF under the name Arabic Presentation Forms: several hundred codepoints that no new document should ever contain and that PDF text extractors merrily emit to this day, which is one of the reasons searching an Arabic PDF so often fails in silence. The haystack is encoded as shapes and your needle is encoded as letters. My favourite resident of the block, and one of my favourite characters in all of Unicode, is U+FDFD, : four-word invocation, bismillāh ar-raḥmān ar-raḥīm, as a single codepoint. A monument from the era when rendering was baked into the encoding because nobody trusted the renderer to do anything, preserved forever, like a fly in amber that recites.

This bites because the two encodings render identically and compare differently. The customer search bug I mentioned at the top of this article was, specifically, this:

And if you want to know what the world looks like when software skips all of this, the shaping engine, the bidi algorithm, the whole apparatus, you do not have to imagine it, because an enormous amount of software still skips all of it:

مرحبا بالعالم، هذا نص عربي

The text says "hello, world, this is Arabic text." Tick the box for the version every Arabic reader has met in the wild: on a shop sign, a boarding pass, a watermark, an old film title. Every letter drops into its isolated form and the line is laid out left to right, backwards. This is what a program produces when it draws characters one at a time and never consults a shaping engine: old Photoshop did it, matplotlib still does it out of the box, many of the PDF generators on npm do it, receipt printers do it with conviction. The standard Python workaround, arabic_reshaper plus python-bidi, fixes it by pre-baking the shaped forms into the string using that fossil block from the paragraph above.

联系我们 contact @ memedata.com