特韦尔斯基神经网络 (Tèwèierskǐ shénjīng wǎngluò)

特韦尔斯基神经网络 (Tèwèierskǐ shénjīng wǎngluò)
Tversky Neural Networks

原始链接: https://gonzoml.substack.com/p/tversky-neural-networks

## Tversky 神经网络：一种新的相似性方法本文介绍了一种名为 Tversky 神经网络的新架构，它挑战了深度学习中关于相似性的传统几何假设——即概念通过点积或余弦相似度关联。作者受到 Amos Tversky 1977 年在认知心理学领域的工作启发，提出了一种更细致、*不对称*的相似性函数，该函数基于共同和独特特征，模拟了人类感知。核心创新在于 Tversky 相似性的可微参数化，通过将对象表示为向量和特征集合来实现。这使得创建两个新的构建块成为可能：**Tversky 相似性层**和**Tversky 投影层**，取代标准线性层。实验表明，这些层可以模拟复杂的函数，如 XOR——单线性层无法完成的任务——并提高图像识别（ResNet-50）和语言建模（GPT-2）的性能，有时还能减少参数数量。重要的是，Tversky 网络具有内在的**可解释性**。可视化显示，学习到的原型和特征与人类可理解的特征一致（例如，手写数字中的笔画），并且该模型始终学习支持 Tversky 最初的“显著性假设”的参数——优先考虑不太显著对象的独特特征。这项工作为将这些原则扩展到 Transformer 架构和注意力机制开辟了令人兴奋的途径。

## Tversky 神经网络：Hacker News 讨论一篇关于“Tversky 神经网络”的近期文章在 Hacker News 上引发了讨论。核心思想是在神经网络中使用不同的相似度度量（Tversky 相似度比），可能比传统方法具有优势。一些人认为它可能是一个有用的工具，而另一些人则质疑它是否代表着重大的范式转变，认为它只是用一种技术替换另一种技术。一个关键的讨论点围绕着**可解释性**，一些人认为可解释性提高的说法被夸大了——深度网络的复杂性仍然是主要障碍。另一些人则认为 Tversky 方法*确实*提供了更清晰的表示，从而可以更清楚地理解预测背后的推理。进一步的讨论探讨了将该技术应用于回归任务，并将其与 DINOv3 等更强大的基础模型进行基准测试，提倡一种“爬行、行走、奔跑”的验证方法。总体情绪倾向于谨慎乐观，对探索其潜力感兴趣，尤其是在与现有嵌入模型结合使用时。

原文

Authors: Moussa Koulako Bala Doumbouya, Dan Jurafsky, Christopher D. Manning
Paper: https://arxiv.org/abs/2506.11035

Once a year, some interesting architecture inevitably appears where they change some fundamental building block. This happened with KAN last year, where they changed the parameterization of the neuron activation function (though it's unclear what the outcome is after a year — many follow-up works seem to have appeared, but KANs haven't displaced anyone anywhere yet). The same is true in this current work, where they change the proximity function from the classic scalar product as in transformers (or cosine similarity, which is roughly the same) to a more sophisticated asymmetric function named after Amos Tversky. Jurafsky and Manning are co-authors (and in KANs, Tegmark was a co-author), so these aren't exactly random people.

Modern deep learning architectures, from CNNs to transformers, are built on a fundamental but often overlooked assumption: similarity between concepts can be measured geometrically using functions like dot product or cosine similarity. While this approach is computationally convenient, cognitive psychology has long known that this geometric model poorly reflects human similarity judgments. As Amos Tversky noted in his landmark 1977 work, human perception of similarity is often asymmetric — we say that a son resembles his father more than the father resembles the son. This asymmetry violates the metric properties inherent in geometric models.

Tversky proposed an alternative: a feature matching model where similarity is a function of common and distinctive features. Despite its psychological plausibility, this model relied on discrete set operations, making it incompatible with the differentiable, gradient-based optimization that underlies modern deep learning. The authors of this paper have elegantly bridged this gap.

The key innovation is a differentiable parameterization of Tversky similarity. The authors propose a dual representation where objects are simultaneously both vectors (as usual, of R^d dimensionality) and feature sets (this is new). A feature (from a given finite set Ω) is considered "present" in an object if the scalar product of the object vector and the feature vector is positive. This construction allows reformulating traditionally discrete intersection and difference operations as differentiable functions.

The Tversky similarity function is defined as:

S(a, b) = θf(A ∩ B) − αf(A − B) − βf(B − A), (1)

where A and B are the feature sets of objects a and b, and {θ, α, β} are learnable parameters. In this formula, the first term accounts for common features, the second for distinctive features of object a, and the third for distinctive features of object b.

The following functions are defined for features:

Salience or prominence of features of object A is the sum of positive scalar products for features present in the object. A less salient object (e.g., son) is more similar to a more salient object (father) than vice versa.

Intersection (common features) of objects A and B is defined through function Ψ, which aggregates features present in both objects. They tried min, max, product, mean, gmean, softmin as Ψ.

Difference (features present in the first object but absent in the second) is defined in two ways. The first, ignorematch, only considers features present in A but not in B. The other method, subtractmatch, also considers features present in both objects but more pronounced in A.

Next, Tversky neural networks are defined based on two new building blocks:

Tversky Similarity Layer, analogous to metric similarity functions like dot product or cosine similarity. Defines the similarity of objects a∈R^d and b∈R^d through the aforementioned function (1) with {θ, α, β}. Returns a scalar.
Tversky Projection Layer, analogous to a fully connected layer. A nonlinear projection of vector a∈R^d, computing the similarity of input a to each of p prototypes, each of which is R^d, so the output is vector R^p. Essentially, it projects the input vector onto a set of learned "prototypical" vectors Π. However, instead of simple scalar product, it uses a more complex and expressive nonlinear Tversky similarity function in the Tversky Similarity Layer.

In total, the learnable parameters in Tversky neural networks include:

prototype vectors, Π
feature vectors, Ω
weights α, β, θ

Π and Ω can be shared between different layers.

This new layer is inherently more powerful. The authors constructively show that a single Tversky projection layer can model the nonlinear XOR function, which is impossible for a single linear layer, demonstrating its increased expressive capacity.

However, there's a nuance here. The XOR example, where the input is a vector of two numbers between which XOR is performed, uses 11 learnable parameters. Well okay, one classical neuron can't do XOR, but a simple network with one hidden layer with two neurons and one output neuron already can, and it has only 6 learnable weights — two for each neuron, bias doesn't seem necessary. Even with bias, that's still 9 weights, less than 11. So it's a questionable advantage; I too can embed all this into one neuron of a new type and say it's more powerful.

And the learnable prototype and feature vectors in Tversky networks still need to be learned; not every initialization succeeded in the experiments. And the hyperparameters (number of vectors) need to be chosen somehow.

Replacing standard linear layers with Tversky projection layers leads to notable improvements in various domains:

Image Recognition: When adapting frozen ResNet-50 for classification tasks, using a Tversky projection layer instead of linear at the output (=Tversky-ResNet-50) led to accuracy improvements from 36.0% to 44.9% (NABirds) and from 57.4% to 62.3% (MNIST). On unfrozen networks, it's not as noticeable; when training from scratch, it's slightly more noticeable.
Language Modeling: When training GPT-2 small from scratch on the Penn Treebank dataset with shared (tied) prototypes, replacing linear layers with Tversky layers (=TverskyGPT-2) simultaneously reduced perplexity by 7.5% and reduced parameter count by 34.8%.

Prototype and feature vectors were initialized randomly everywhere. Product and ignorematch were used as reductions in intersections and differences.

Another useful property of the model is interpretability. Most modern XAI methods, such as LIME or Grad-CAM, are post-hoc, meaning they try to explain model decision-making from the outside, after it has been trained as a black box. Unlike them, the Tversky framework is interpretable by design. Its fundamental operations are based on psychologically intuitive concepts of common and distinctive features, providing a built-in, transparent language for explaining model reasoning.

The authors present a new visualization technique in the input data space that allows visualizing learned prototypes and features directly in the input space. Using MNIST as an example, they show that the Tversky layer learns prototypes and features that correspond to recognizable, human-interpretable strokes and curves of handwritten digits. In contrast, the baseline linear layer learns opaque, uninterpretable textural patterns. This enables principled explanation of model decisions in terms of common and distinctive features.

The authors even discovered that trained models consistently learn parameters where α > β. This means that distinctive features of the input ("son") are given greater weight than distinctive features of the prototype ("father"). This is direct confirmation of Tversky's original salience hypothesis and shows that the model not only works well but also learns in a psychologically consistent manner.

Cool work overall. It begs for extension to transformers and attention mechanisms. In the paper, they only applied it to projection blocks.

特韦尔斯基神经网络 (Tèwèierskǐ shénjīng wǎngluò) Tversky Neural Networks

特韦尔斯基神经网络 (Tèwèierskǐ shénjīng wǎngluò)
Tversky Neural Networks