字节为盲文

字节为盲文
Bytes as Braille

原始链接: https://www.engrenage.ch/i18n/scripts/bytes_as_braille/

该项目解决了在可读格式下显示混合字节和Unicode字符串的难题，尤其是在某些字节故意无法解码的情况下。传统的字节表示（如`\xc0`）显得杂乱，而简单地解码/捕获错误会导致信息丢失。作者开发了一个函数，利用盲文字符作为无法解码字节的视觉替代。盲文字符单元不是按照标准的Unicode顺序排列，而是根据其字节值重新排序，从而创建紧凑且富含模式的表示。这使得区分可解码字符串和原始字节变得容易。该解决方案现已在GitHub上提供，包括输入函数和字节颜色编码功能，以增强可见性。这种方法提供了一种更高效、更具视觉信息性的方式来检查二进制数据，尤其是在处理多种语言和数据格式时非常有用。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录字节作为盲文 (engrenage.ch) 3点由 apitman 1小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

UPDATE: Now on Github with more stuff including a neat input function

Since Python3, I have been working with bytestrings that sometimes decode as ASCII or UTF strings ; the environment this lives in is mixed, as in some strings are expected to decode while some are are expected to not decode.

Displaying those string as-is is not convenient, as quite a few Unicode symbols will not render correctly : since I work with all sort of human languages these are frequent so I wrote a very short function that would try to decode my bytestring and return “bytes” if it wouldn’t decode. This, however, was not convenient because it made it impossible to distinguish undecodable bytestrings from one another, and information was lost (because I couldn’t get the original bytestring again). Also, using the default Python mechanism to display those bytes was cumbersome, because each byte is displayed as 4 characters (such as \xc0), so the display quickly becomes quite messy and hard to read, even more so when ASCII-decodable characters are displayed as such.

It struck me that Braille symbols were a pretty workaround to this problem : although they are not ordered “logically” - actually they are, but based on historic grounds the first set comprises 6-dots symbols (U+2800 … U+283F), followed by the 8-dots symbols (U+2840 … T+28FF) and the order is well.. rather unconvenient to a lambda user like me.

The traditional cell numbering is this

Braille 8-dot Cell Numbering

also it is worth noting the following facts:

Braille is somehow a precursor of Unicode, as it uses the ⠼ symbol as a prefix to say what follows is not a letter but a number ; however and unlike Unicode, this can prefix a series of symbols
There is not one Braille : every country or language has its own Braille dialect
Braille users make heavy use of “compression”, defining aliases and shorthand often per-document in order to make reading faster

The first thing I did was rename them as such (for big-endian representation):

Big-endian numbering

Then came quite a bit of work re-ordering the cells based not on their Unicode number, but their new byte value. After updating my decode function, I’m now very happy with the result:

Braille for bytes

Of course, this can be decoded:

f = open('/tmp/sample.bin','rb')
f.write(braille_as_bytes('⠉⢤⢌⢕⢂⣍⢉⣮⣀⠭⡄⢏⢯⠤⢍⡔⢤⡕⡔⠽⠞⡚⣞⡺⠁⣇⣨⡈⣾⢁⠺⠜⢝⠐⣑⠚⠬⡈⢱⢙⠰⣢⣴⢌⠩⢇⢨⢢⣂⡢⣁⣚⣅⡖⠴⡡⠤⠦⠜⠽⠘⡴⡷⣴⠬⣞⢃⠚⠔⡹⣂⠡⣇⡅⡤⡁'))
f.close()

Following suggestions on #python, the output can be colored at will so one can make specific bytes be very visible.

In addition to being more compact, this makes it much easier to see patterns in blobs ; so if you like this way of displaying bytes and would like to skip the tedious symbol re-ordering, simply get the script on github.

字节为盲文 Bytes as Braille

字节为盲文
Bytes as Braille