`Confusables.txt` 和 `NFKC` 在 31 个字符上存在不一致。

`Confusables.txt` 和 `NFKC` 在 31 个字符上存在不一致。
Confusables.txt and NFKC disagree on 31 characters

原始链接: https://paultendo.github.io/posts/unicode-confusables-nfkc-conflict/

## 易混淆字符与NFKC规范化：总结 Unicode的“confusables.txt”旨在通过将视觉上相似的字符（如西里尔字母‘а’映射到拉丁字母‘a’）来防止同形字攻击。然而，它的设计目的是用于*检测*，而非规范化。推荐的做法是*拒绝*包含易混淆字符的标识符，而不是重新映射它们。当使用NFKC规范化（将字符转换为标准形式——例如，全角‘Ｈ’转换为ASCII‘H’）时，会出现一个复杂情况。NFKC和易混淆字符有时会将同一个字符映射到*不同的*拉丁字母。具体来说，有31个字符存在这种冲突。例如，古老的“长S”（ſ）被易混淆字符映射到‘f’，但被NFKC映射到‘s’。如果您首先使用NFKC规范化，那么易混淆字符检查‘ſ→f’将永远不会触发。 **最佳实践：** 如果使用易混淆字符进行安全检查，请过滤您的映射，排除已经由NFKC处理的字符。这将创建一个更清晰、更有效的安全检查。如果您*不*使用NFKC，则完整的易混淆字符列表是有效的。这种差异不是错误，而是标准目标不同的结果——视觉相似性与语义等价性。理解这一点有助于构建健壮且可重现的安全措施，例如namespace-guard库中使用的613条目NFKC感知映射。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Confusables.txt 和 NFKC 在 31 个字符上存在分歧 (paultendo.github.io) 6 分，由 pimterry 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

If you’ve ever built a login system, you’ve probably dealt with homoglyph attacks: someone registers аdmin with a Cyrillic “а” (U+0430) instead of Latin “a” (U+0061). The characters are visually identical, and if your system accepts Unicode identifiers, you have an impersonation vector.

The Unicode Consortium maintains an official defence against this: confusables.txt, part of Unicode Technical Standard #39 (Security Mechanisms). It’s a flat file mapping ~6,565 characters to their visual equivalents. Cyrillic а → a, Greek ο → o, Cherokee Ꭺ → A, and thousands more.

It’s worth noting that confusables.txt is designed for detection, not normalization. TR39 itself says skeleton mappings are “not suitable for display to users” and “should definitely not be used as a normalization of identifiers.” The correct use is to check whether a submitted identifier contains characters that visually mimic Latin letters, and if so, reject it — not to silently remap those characters and let it through.

Here’s the wrinkle. If your application also runs NFKC normalization (which it should — ENS, GitHub, and Unicode IDNA all require it), then 31 entries in confusables.txt map the same character to a different target than NFKC. If you’re building a confusable map for use after NFKC normalization, those entries are unreachable. NFKC has already transformed the character before your confusable check sees it.

What NFKC normalization does

NFKC (Normalization Form Compatibility Composition) is Unicode’s way of collapsing “compatibility variants” to their canonical form. Fullwidth letters → ASCII, superscripts → normal digits, ligatures → component letters, mathematical styled characters → plain characters:

Ｈｅｌｌｏ   → Hello     (fullwidth → ASCII)
ﬁnance    → finance   (ﬁ ligature → fi)
𝐇ello     → Hello     (mathematical bold → plain)

This is the right first step for slug/handle validation. You want Ｈｅｌｌｏ to become hello, not to be rejected as containing confusable characters. NFKC handles hundreds of these compatibility forms automatically.

NFKC and confusables serve different purposes. NFKC is for normalization: producing a canonical form for storage and comparison. Confusables detection is for security: flagging characters that could fool a human reader. They answer different questions about the same input, and in a well-designed system they’re applied separately rather than chained together to produce a single output.

The conflict

Here’s what nobody seems to talk about: confusables.txt and NFKC sometimes map the same character to different Latin letters.

The classic example is the Long S (ſ, U+017F). This is the archaic letterform you see in 18th-century printing, where “Congress” was printed as “Congreſs.”

confusables.txt maps ſ → f (visually, ſ does look like f)
NFKC normalization maps ſ → s (linguistically, ſ is s)

Both are defensible mappings, but they answer different questions. TR39 asks “what does this look like?” NFKC asks “what does this mean?”

Why does this matter? If you normalize with NFKC first (converting ſ to s), then check the confusable map, the ſ→f entry never fires. NFKC already handled the character. Without NFKC, the confusable entry is correct as visual detection: ſ genuinely looks like f, and flagging it is the right call for security. But if you’re building a filtered confusable map for use downstream of NFKC (as namespace-guard does), these entries are dead code and should be removed to keep the map clean.

The full list: 31 entries

This isn’t a single edge case. I found 31 characters where confusables.txt and NFKC disagree:

The Long S

Char	Name	Codepoint	TR39 maps to	NFKC maps to
ſ	Latin Small Letter Long S	U+017F	f	s

TR39 sees the visual resemblance to f. But linguistically (and in NFKC), ſ is an archaic form of s. The NFKC mapping is unambiguously correct for any application that cares about meaning rather than just shape.

Capital I → l (16 variants)

confusables.txt maps capital I (and all its styled variants) to lowercase l. This is the classic Il1 ambiguity: in many sans-serif fonts, uppercase I, lowercase l, and digit 1 are nearly indistinguishable.

NFKC normalizes styled variants back to plain I (U+0049), a different character from the confusable target l (U+006C):

Char	Name	Codepoint	TR39 maps to	NFKC maps to
ℐ	Script Capital I	U+2110	l	I
ℑ	Fraktur Capital I	U+2111	l	I
Ⅰ	Roman Numeral One	U+2160	l	I
Ｉ	Fullwidth Latin Capital I	U+FF29	l	I
𝐈	Mathematical Bold Capital I	U+1D408	l	I
𝐼	Mathematical Italic Capital I	U+1D43C	l	I
𝑰	Mathematical Bold Italic Capital I	U+1D470	l	I
𝓘	Mathematical Script Capital I (Bold)	U+1D4D8	l	I
𝕀	Mathematical Double-Struck Capital I	U+1D540	l	I
𝕴	Mathematical Fraktur Capital I (Bold)	U+1D574	l	I
𝖨	Mathematical Sans-Serif Capital I	U+1D5A8	l	I
𝗜	Mathematical Sans-Serif Bold Capital I	U+1D5DC	l	I
𝘐	Mathematical Sans-Serif Italic Capital I	U+1D610	l	I
𝙄	Mathematical Sans-Serif Bold Italic Capital I	U+1D644	l	I
𝙸	Mathematical Monospace Capital I	U+1D678	l	I
𜳞	Outlined Latin Capital Letter I	U+1CCDE	l	I

TR39 says all of these look like “l”. It’s right: they often do, especially in sans-serif fonts. NFKC normalizes them all to plain I (U+0049). If your system runs NFKC before confusable detection, the confusable entry for these characters is unreachable. NFKC has already transformed them to plain I, which won’t match the original source character in your confusable map.

Digit 0 → O (7 variants)

Same pattern with digit zero. confusables.txt maps styled zeros to the letter O (visually similar), but NFKC collapses them to the digit “0”:

Char	Name	Codepoint	TR39 maps to
𝟎	Mathematical Bold Digit Zero	U+1D7CE	O
𝟘	Mathematical Double-Struck Digit Zero	U+1D7D8	O
𝟢	Mathematical Sans-Serif Digit Zero	U+1D7E2	O
𝟬	Mathematical Sans-Serif Bold Digit Zero	U+1D7EC	O
𝟶	Mathematical Monospace Digit Zero	U+1D7F6	O
🯰	Segmented Digit Zero	U+1FBF0	O
𜳰	Outlined Digit Zero	U+1CCF0	O

NFKC correctly preserves the digit identity. Note that ASCII 0 (U+0030) itself has a confusable entry mapping to O, so the visual confusion between zero and O is caught regardless of whether NFKC runs first.

Digit 1 → l (7 variants)

And the same again with digit one, where confusables.txt sees “l” but NFKC correctly maps to “1”:

Char	Name	Codepoint	TR39 maps to	NFKC maps to
𝟏	Mathematical Bold Digit One	U+1D7CF	l	1
𝟙	Mathematical Double-Struck Digit One	U+1D7D9	l	1
𝟣	Mathematical Sans-Serif Digit One	U+1D7E3	l	1
𝟭	Mathematical Sans-Serif Bold Digit One	U+1D7ED	l	1
𝟷	Mathematical Monospace Digit One	U+1D7F7	l	1
🯱	Segmented Digit One	U+1FBF1	l	1
𜳱	Outlined Digit One	U+1CCF1	l	1

Why this happens

This isn’t a bug in either standard. TR39 and NFKC have different purposes, and they were designed independently:

confusables.txt answers: “What does this character visually resemble?” It’s designed for the skeleton algorithm, which compares two strings for visual similarity. Mathematical Bold I (𝐈) looks like lowercase l in most fonts. That’s a legitimate visual observation.

NFKC normalization answers: “What is the canonical form of this character?” Mathematical Bold I is semantically the letter I rendered in a bold mathematical style. NFKC strips the styling, yielding plain I.

These are orthogonal concerns. Confusability is about what humans see. NFKC is about what machines should store. Neither mapping is wrong; they answer different questions. But if you use both (which you should), it’s worth knowing where they diverge, especially if you’re building a filtered confusable map for use after NFKC.

The practical impact

If you build a confusable detection system and also run NFKC normalization, you need to know about these 31 entries:

If you run NFKC first, then check confusables: The 31 entries are unreachable. NFKC has already transformed the character before your confusable check sees it. They’re dead code in your detection map, not a security hole, but worth filtering out to keep the map clean.

If you check confusables without NFKC: These entries produce correct visual detection results. That’s what confusables.txt is designed for. ſ does look like f, styled zeros do look like O, and styled ones do look like l. The confusable map is doing its job. For zeros and ones specifically, ASCII 0 and 1 themselves have confusable entries mapping to O and l, so the visual confusion is caught regardless of whether NFKC runs first.

If you use confusables for remapping (don’t do this): The problems compound. teſt becomes teft instead of test. account10 with a mathematical 1 and 0 becomes accountlO. As TR39 states, confusable mappings should not be used as normalization.

What to do about it

The approach depends on how you use confusables:

If you use confusables for detection and rejection (recommended)

Filter your confusable map to exclude any character that NFKC already handles. This keeps your map clean and ensures every entry represents a character your system will actually encounter:

const sourceChar = String.fromCodePoint(sourceCp);
const nfkcResult = sourceChar.normalize("NFKC").toLowerCase();

// NFKC already maps to a Latin letter/digit - skip this entry
// (either same target = redundant, or different target = conflict)
if (/^[a-z0-9]$/.test(nfkcResult)) continue;

// NFKC produces a valid slug fragment - skip (already handled)
if (/^[a-z0-9-]+$/.test(nfkcResult)) continue;

// NFKC doesn't resolve to ASCII - keep this confusable entry
entries.push({ source: sourceCp, target: confusableTarget });

This takes you from ~6,565 raw TR39 entries to ~613 that are meaningful after NFKC. Every remaining entry is a character that survives NFKC unchanged and visually mimics a Latin letter.

In namespace-guard, this is how it works in practice: NFKC is applied first during normalization when storing and comparing slugs. The confusable map then runs on the normalized input as a completely separate validation step — a blocklist. If any character in the normalized slug matches the map, the slug is rejected. No remapping, no skeleton, no merged output. Just: “does this string contain a character that looks like a Latin letter but isn’t one? If yes, reject.”

If you run confusables without NFKC

The full confusables.txt map works as designed. These 31 entries encode correct visual judgments: ſ does look like f, styled zeros do look like O, styled ones do look like l. No filtering needed.

Making it reproducible

Rather than hand-curating a confusable map (which becomes stale when Unicode ships new versions), I wrote a generator script that:

Downloads confusables.txt from unicode.org
Extracts all single-character → Latin letter/digit mappings
Filters out NFKC-redundant and NFKC-conflicting entries
Adds supplemental mappings for known gaps (e.g., Latin small capitals that confusables.txt misses)
Outputs a TypeScript object literal, grouped by Unicode block

The script prints stats to stderr so you can verify the filtering:

Filtered to 605 entries from TR39
  Skipped 31 NFKC-conflict entries (NFKC maps to different Latin char)
  Skipped 766 NFKC-handled entries (NFKC produces valid slug fragment)
Added 8 supplemental entries (Latin small capitals)
Total: 613 entries

When a new Unicode version ships, re-run the script and you get an updated map automatically filtered against the current runtime’s NFKC implementation. The exact counts depend on two things: the version of confusables.txt you download, and your runtime’s Unicode data tables (what String.prototype.normalize uses). The numbers in this post are from the current Unicode 16.0 data.

The broader lesson

Unicode is not one monolithic standard. It’s a collection of semi-independent specifications maintained by different working groups. UTR #15 (normalization) and UTS #39 (security) were designed for different use cases and don’t explicitly account for each other.

The 31 divergent entries aren’t a bug in either standard. confusables.txt mappings are visual judgments. NFKC mappings are semantic equivalences. Both are correct in their own context. If you build a confusable map for use after NFKC, knowing where they diverge lets you filter your map down to entries that will actually fire.

The NFKC-aware confusable map (613 entries, ~2.5 KB gzipped) ships as part of namespace-guard, a zero-dependency TypeScript library for slug/handle validation. The generator script is at scripts/generate-confusables.ts.

Thanks to ficiek, v4ss42, nemec, LousyBeggar, carrottread, medforddad, Herb_Derb, and DontBuyAwards on r/programming for feedback that shaped this revision.