耳朵不做傅里叶变换。

耳朵不做傅里叶变换。
How the cochlea computes (2024)

原始链接: https://www.dissonances.blog/p/the-ear-does-not-do-a-fourier-transform

## 蜗牛如何处理声音蜗牛是内耳中充满液体的结构，在分析声音方面表现出色。它通过微小的骨骼接收来自鼓膜的振动，这些振动沿着基底膜传播。关键在于，该膜的结构能够分离频率——高频率在基底部共鸣，而低频率在顶点附近共鸣。沿着膜的毛细胞随后通过一种涉及离子通道的独特“陷门”机制将这些振动转化为电信号。然而，蜗牛*并不*执行标准的傅里叶变换。相反，它利用滤波器，在时间（声音发生的时间）和频率精度之间进行权衡。这种滤波策略类似于小波和加伯变换的混合，似乎针对有效地表示自然声音进行了优化。研究表明，不同类型的声音——语音、动物叫声、环境噪音——以不同的时间和频率分辨率平衡进行处理，这可能是因为人类语音进化到占据独特的声学生态位。最终，蜗牛的设计表明它优先考虑生态相关的表征，以实现有效的听觉感知。

## 耳朵并非执行 *A* 傅里叶变换这篇文章挑战了普遍认为耳朵对声音执行传统傅里叶变换的观点。相反，它认为耳朵利用的是时域局部化的频域变换，更类似于小波（特别是小波变换和加伯变换之间）。这是因为声音通常在时间上是局部的，需要一种不同于分析无限连续信号的方法。文章还提出，人类语音是在频率和包络持续时间空间中进化出独特的“利基”，可能影响了耳朵的特定特征。有人推测这是一种权衡，即占据未使用的声学空间、优化大脑处理速度以及进化约束之间的平衡。虽然标题有些引人注目，但核心论点集中在耳朵超越简单傅里叶分析的复杂处理上，强调了它以细微的精度在时域和频域中分析声音的能力。这是一个复杂的系统，涉及感知建模和实时源分离，远远超出了基本的频率分解。

原文

Let’s talk about how the cochlea computes!

The tympanic membrane (eardrum) is vibrated by changes in air pressure (sound waves). Bones in the middle ear amplify and send these vibrations to the fluid-filled, snail-shaped cochlea. Vibrations travel through the fluid to the basilar membrane, which remarkably performs frequency separation: the stiffer, lighter base resonates with high frequency components of the signal, and the more flexible, heavier apex resonates with lower frequencies. Between the two ends, the resonant frequencies decrease logarithmically in space.

Resonant frequencies of the basilar membrane. Outer, larger numbers are frequencies (Hz). Inner, smaller numbers are distance along the unrolled basilar membrane (mm). From lecture slides.

The hair cells on different parts of the basilar membrane wiggle back and forth at the frequency corresponding to their position on the membrane. But how do wiggling hair cells translate to electrical signals? This mechanoelectrical transduction process feels like it could be from a Dr. Seuss world: springs connected to the ends of hair cells open and close ion channels at the frequency of the vibration, which then cause neurotransmitter release. Bruno calls them “trapdoors”. Here’s a visualization:

It’s clear that the hardware of the ear is well-equipped for frequency analysis. Nerve fibers serve as filters to extract temporal and frequency information about a signal. Below are examples of filters (not necessarily of the ear) shown in the time domain. On the left are filters that are more localized in time, i.e. when a filter is applied to a signal, it is clear when in the signal the corresponding frequency occurred. On the right are filters that have less temporal specificity, but are more uniformly distributed across frequencies compared to the left one.

Filters as a function of time. Left: mostly high temporal precision (short duration), but less uniform tiling of frequencies. Right: mostly low temporal precision (long duration), but more uniform tiling of frequencies. Lewicki 2002.

Wouldn’t it be convenient if the cochlea were doing a Fourier transform, which would fit cleanly into how we often analyze signals in engineering? But no 🙅🏻‍♀️! A Fourier transform has no explicit temporal precision, and resembles something closer to the waveforms on the right; this is not what the filters in the cochlea look like.

We can visualize different filtering schemes, or tiling of the time-frequency domain, in the following figure. In the leftmost box, where each rectangle represents a filter, a signal could be represented at a high temporal resolution (similar to left filters above), but without information about its constituent frequencies. On the other end of the spectrum, the Fourier transform performs precise frequency decomposition, but we cannot tell when in the signal that frequency occurred (similar to right filters). What the cochlea is actually doing is somewhere between a wavelet and Gabor. At high frequencies, frequency resolution is sacrificed for temporal resolution, and vice versa at low frequencies.

In each large box, each rectangle represents a filter. The human ear does not perform a Fourier transform, but rather employs filters that are somewhere between a wavelet and Gabor. From Olshausen & O’Connor 2002.

Why would this type of frequency-temporal precision tradeoff be a good representation? One theory, explored in Lewicki 2002, is that these filters are a strategy to reduce the redundancy in the representation of natural sounds. Lewicki performed independent component analysis (ICA) to produce filters maximizing statistical independence, comparing environmental sounds, animal vocalizations, and human speech. The tradeoffs look different for each one, and you can kind of map them to somewhere in the above cartoon.

ICA on environmental sounds (rustling brush, rain, etc.) and human speech (various American English dialects) result in wavelets, while animal vocalizations (rainforest mammals) result in something closer to a Fourier transform. From Lewicki 2002.

Examples of filters shown above.

It appears that human speech occupies a distinct time-frequency space. Some speculate that speech evolved to fill a time-frequency space that wasn’t yet occupied by other existing sounds.

+: animal vocalizations, x: environmental sounds, o: human speech. Lewicki 2002.

To drive the theory home, one that we have been hinting at since the outset: forming ecologically-relevant representations makes sense, as behavior is dependent on the environment. It appears that for audition, as well as other sensory modalities, we are doing this. This is a bit of a teaser for efficient coding, which we will get to soon.

We’ve talked about some incredible mechanisms that occur at the beginning of the sensory coding process, but it’s truly just the tiny tip of the ice burg. We also glossed over how these computations occur. The next lecture will zoom into the biophysics of computation in neurons.

耳朵不做傅里叶变换。 How the cochlea computes (2024)

耳朵不做傅里叶变换。
How the cochlea computes (2024)