花了 3 个月的时间调查一个 7 年之久的 bug,并用 1 行代码修复了它
Spending 3 months investigating a 7-year old bug and fixing it in 1 line of code

原始链接: https://lemmy.world/post/16763534

该用户分享了他们为原始 iPad 开发音乐配件的经验。 该设备通过 USB 连接,提供 MIDI 输入和输出等功能,适合使用 GarageBand 的音乐家。 它基于现有硬件构建,需要进行少量修改才能通过微控制器和调整 USB 设置来实现 iPad 兼容性。 然而,尽管制作成功,但用户报告在使用过程中丢失了 MIDI 音符。 这些丢失的音符尤其会影响具有连续声音的乐器,例如管风琴,导致对所持琴键的错误假设,并导致需要强制关闭应用程序。 作者面临的挑战是找出并解决这个问题。 通过彻底的测试和分析,根本原因被确定为设备固件内的计时问题,导致 MIDI 消息丢失。 通过优化固件音频处理部分中的模运算,问题得到了解决,从而实现了准确的 MIDI 传输。 尽管取得了成功,但由于缺乏现场升级能力,分发更新的固件仍然具有挑战性。

开发人员的个人经历揭示了公司内有效绩效和基本工资之间的巨大差异。 观察结果包括,由于谈判挑战和技能差异(例如沟通和工资导航与编码能力),顶级开发人员获得低于平均水平的薪酬。 相反,尽管结果不一致,但表现不佳的开发人员仍设法获得更高的薪水。 这些事件强调了个人需要为自己辩护并获得多样化的技能,以最大限度地提高职业机会。 一位工程师遇到了令人困惑的编码挑战,涉及重复矩阵和不匹配的哈希值,导致意外的输出错误。 尽管付出了大量努力来分析和隔离该问题,但由于对特定技术堆栈和可用工具的了解有限,进展仍然难以实现。 最终,经过多次迭代调整后,解决方案浮出水面:在重复数据删除过程中正确处理矩阵维度可以实现最佳功能。 从历史上看,没有文档的 SQL 约束造成了主要障碍。 用户身份验证中未记录的限制导致无法解释的兼容性问题。 其他案例包括错误报告和转换为替代数据类型来解决问题。 尽管这些看似不同的事件带来了独特的挑战,但它们最终强化了整个技术项目中清晰沟通和适应性的重要性。
相关文章

原文

I originally told the story over on the other site, but I thought I’d share it here. With a bonus!

I was working on a hardware accessory for the OG iPad. The accessory connected to the iPad over USB and provided MIDI in/out and audio in/out appropriate for a musician trying to lay down some tracks in Garage Band.

It was a winner of a product because at its core, it was based on a USB product we had already been making for PCs for almost a decade. All we needed was a little microcontroller to put the iPad into USB host mode (this was in the 30-pin connector days), and then allow it to connect to what was basically a finished product.

This product was so old in fact that nobody knew how to compile the source code. When it came time to get it working, someone had to edit the binaries to change the USB descriptors to reflect the new product name and that it drew <10mA from the iPad’s USB port (the original device was port-powered, but the iPad would get angry if you requested more than 10mA even if you were self-powered). This was especially silly because the original product had a 4-character name, but the new product had a 7-character name. We couldn’t make room for the extra bytes, so we had to truncate the name to fit it into the binary without breaking anything.

Anyway, product ships and we notice a problem. Every once in a while, a MIDI message is missed. For those of you not familiar, MIDI is used to transmit musical notes that can be later turned into audio by whatever processor/voice you want. A typical message contains the note (A, B, F-sharp, etc), a velocity (how hard you hit the key), and whether it’s a key on or key off. So pressing and releasing a piano key generate two separate messages.

Missing the occasional note message wouldn’t typically be a big deal except for instrument voices with infinite sustain like a pipe organ. If you had the pipe organ voice selected when using our device, it’s possible that it would receive a key on, but not a key off. This would result in the iPad assuming that you were holding the key down indefinitely.

There isn’t an official spec for what to do if you receive another key-on of the same note without a key-off in between, but Apple handled this in the worst way possible. The iPad would only consider the key released if the number of key-ons and key-offs matched. So the only way to release this pipe organ key was to hope for it to skip a subsequent key-on message for the same key and then finally receive the key-off. The odds of this happening are approximately 0%, so most users had to resort to force quitting the app.

Rumors flooded the customer message boards about what could cause this behavior, maybe it was the new iOS update? Maybe you had to close all your other apps? There was a ton of hairbrained theories floating around, but nobody had any definitive explanation.

Well I was new to the company and fresh out of college, so I was tasked with figuring this one out.

First step was finding a way to generate the bug. I wrote a python script that would hammer scales into our product and just listened for a key to get stuck. I can still recall the cacophony of what amounted to an elephant on cocaine slamming on a keyboard for hours on end.

Eventually, I could reproduce the bug about every 10 minutes. One thing I noticed is that it only happened if multiple keys were pressed simultaneously. Pressing one key at a time would never produce the issue.

Using a fancy cable that is only available to Apple hardware developers, I was able to interrogate the USB traffic going between our product and the iPad. After a loooot of hunting (the USB debugger could only sample a small portion, so I had to hit the trigger right when I heard the stuck note), I was able to show that the offending note-off event was never making it to the iPad. So Apple was not to blame; our firmware was randomly not passing MIDI messages along.

Next step was getting the source to compile. I don’t remember a lot of the details, but it depended on “hex3bin” which I assume was some neckbeard’s version of hex2bin that was “better” for some reasons. I also ended up needing to find a Perl script that was buried deep in some university website. I assume that these tools were widely available when the firmware was written 7 years prior, but they took some digging. I still don’t know anything about Perl, but I got it to run.

With firmware compiling, I was able to insert instructions to blink certain LEDs (the device had a few debug LEDs inside that weren’t visible to the user) at certain points in the firmware. There was no live debugger available for the simple 8-bit processor on this thing, so that’s all I had.

What it came down to was a timing issue. The processor needed to handle audio traffic as well as MIDI traffic. It would pause whatever it was doing while handling the audio packets. The MIDI traffic was buffered, so if a key-on or key-off came in while the audio was being handled, it would be addressed immediately after the audio was done.

But it was only single buffered. So if a second MIDI message came in while audio was being handled, the second note would overwrite the first, and that first note would be forever lost. There is a limit to how fast MIDI notes can come in over USB, and it was just barely faster than it took to process the audio. So if the first note came in just after the processor cut to handling audio, the next note could potentially come in just before the processor cut back.

Now for the solution. Knowing very little about USB audio processing, but having cut my teeth in college on 8-bit 8051 processors, I knew what kind of functions tended to be slow. I did a Ctrl+F for “%” and found a 16-bit modulo right in the audio processing code.

This 16-bit modulo was just a final check that the correct number of bytes or bits were being sent (expecting remainder zero), so the denominator was going to be the same every time. The way it was written, the compiler assumed that the denominator could be different every time, so in the background it included an entire function for handling 16-bit modulos on an 8-bit processor.

I googled “optimize modulo,” and quickly learned that given a fixed denominator, any 16-bit modulo can be rewritten as three 8-bit modulos.

I tried implementing this single-line change, and the audio processor quickly dropped from 90us per packet to like 20us per packet. This 100% fixed the bug.

Unfortunately, there was no way to field-upgrade the firmware, so that was still a headache for customer service.

As to why this bug never showed up in the preceding 7 years that the USB version of the product was being sold, it was likely because most users only used the device as an audio recorder or MIDI recorder. With only MIDI enabled, no audio is processed, and the bug wouldn’t happen. The iPad however enabled every feature all the time. So the bug was always there. It’s just that nobody noticed it. Edit: also, many MIDI apps don’t do what Apple does and require matching key on/key off events. So if a key gets stuck, pressing it again will unstick it.

So three months of listening to Satan banging his fists on a pipe organ lead to a single line change to fix a seven year old bug.

TL;DR: 16-bit modulo on an 8-bit processor is slow and caused packets to get dropped.

The bonus is at 4:40 in this video https://youtu.be/DBfojDxpZLY?si=oCUlFY0YrruiUeQq

联系我们 contact @ memedata.com