通过揭开变压器背后的数学原理来了解变压器的工作原理

通过揭开变压器背后的数学原理来了解变压器的工作原理
Understand how transformers work by demystifying the math behind them

原始链接: https://osanseviero.github.io/hackerllama/blog/posts/random_transformer/

1.位置编码的目的是为神经网络提供空间感知。如果没有它，模型就无法区分“the”、“cat”、“sat”和“on”，因为它们具有相似的模式。位置编码提供了一种映射单词在句子中的相对位置的方法，尽管由于自然语言的变化，句子之间的确切顺序可能会有所不同。 2. 自注意力仅关注输入本身，而编码器-解码器注意力同时关注输入本身和前一层的输出。换句话说，自注意力计算所有输入对之间的相似度分数，而编码器-解码器注意力通过使用早期的隐藏状态来计算当前输入和先前输入的中间表示之间的注意力权重。两种方法都涉及将查询向量与关键向量相乘，然后对这些值应用注意力权重。 3. 如果注意力维度太小，可能会导致表征能力不足，因为模型将难以区分不同单词的含义。另一方面，注意力维度太大会导致计算成本很高，因为维度越大需要更高的矩阵分解。 4. 简要描述前馈层的结构，涉及将数据传递到多个层，这些层主要由卷积层和 ReLU 或 sigmoid 等激活函数交替组成。每层在将输出馈送到后续层之前都会应用非线性变换。 5. 由于两个原因，解码器比编码器慢。首先，解码器接受编码器生成的潜在状态，因此应该考虑整个历史记录而不仅仅是当前帧。其次，解码器必须计算跨模式交互，同时对输入帧和目标帧进行建模。这使得解码器的计算成本昂贵。 6. 残余连接允许直接访问输入特征，同时跳过前面的特征提取阶段。它们还减少了深层网络中由多层引起的梯度消失问题。此外，层归一化可以实现更快的收敛，有助于稳定训练损失并提高模型性能，特别是在处理具有挑战性的输入等情况下。关于第三个问题，从解码器输出到概率是通过应用softmax函数来获得每个可能输出对应的概率分布

听起来很有趣！感谢您分享这个例子，它与我们所听到的关于变压器如何改善包括电子产品在内的各个领域的说法完全一致。关于您之前关于寻找资源以更好地理解注意力机制的问题，Allen Zhu 题为“META：用于问答的多头变压器”的论文提供了针对 QA 任务的变压器架构的详细分析。此外，他的配套博客文章“通过示例理解注意力机制”提供了对注意力机制的实用见解，分别探索了图像和文字中的视觉和文本注意力。这两个资源为掌握 Transformer 中注意力机制的机制提供了一些宝贵的经验教训，并且对于获得更全面的理解特别有用。

原文

In this blog post, we’ll do an end-to-end example of the math within a transformer model. The goal is to get a good understanding of how the model works. To make this manageable, we’ll do lots of simplification. As we’ll be doing quite a bit of the math by hand, we’ll reduce the dimensions of the model. For example, rather than using embeddings of 512 values, we’ll use embeddings of 4 values. This will make the math easier to follow! We’ll use random vectors and matrices, but you can use your own values if you want to follow along.

As you’ll see, the math is not that complicated. The complexity comes from the number of steps and the number of parameters. I recommend you to read the The Illustrated Transformer blog before reading this blog post (or reading in parallel). It’s a great blog post that explains the transformer model in a very intuitive (and illustrative!) way and I don’t intend to explain what it’s already explained there. My goal is to explain the “how” of the transformer model, not the “what”. If you want to dive even deeper, check out the famous original paper: Attention is all you need.

Prerequisites

A basic understanding of linear algebra is required - we’ll mostly do simple matrix multiplications, so no need to be an expert. Apart from that, basic understanding of Machine Learning and Deep Learning will be useful.

What is covered here?

An end-to-end example of the math within a transformer model during inference
An explanation of attention mechanisms
An explanation of residual connections and layer normalization
Some code to scale it up!

Without further ado, let’s get started! Our goal will be to use the transformer model as a translation tool, so we’ll pass an input to the model expecting it to generate the translation. For example, we could pass “Hello World” in English and expect “Hola Mundo” in Spanish.

Let’s take a look at the diagram of the transformer beast (don’t be intimidatd by it, you’ll soon understand it!):

Transformer model from the original “attention is all you need” paper

The original transformer model has two parts: encoder and decoder. The encoder focus is in “understanding” or “capturing the meaning” of the input text, while the decoder focus is in generating the output text. We’ll first focus on the encoder part.

Encoder

The whole goal of the encoder is to generate a rich embedding representation of the input text. This embedding will capture semantic information about the input, and will then be passed to the decoder to generate the output text. The encoder is composed of a stack of N layers. Before we jump into the layers, we need to see how to pass the words (or tokens) into the model.

Embeddings are a somewhat overused term. We’ll first create an embedding that will be the input to the encoder. The encoder also outputs an embedding (also called hidden states sometimes). The decoder will also receive an embedding! 😅 The whole point of an embedding is to represent a token as a vector.

0. Tokenization

ML models can process numbers, not text. soo we need to turn our input text into numbers. That’s what tokenization does! This is the process of splitting the input text into tokens, each with an associated ID. For example, we could split the text “Hello World” into two tokens: “Hello” and “World”. We could also split it into characters: “H”, “e”, “l”, “l”, “o”, ” “,”W”, “o”, “r”, “l”, “d”. The choice of tokenization is up to us and depends on the data we’re working with.

Word-based tokenization (splitting the text into words) will require a very large vocabulary (all possible tokens). It will also represent words like “dog” and “dogs” or “run” and “running” as different tokens. Character-based vocabulary will require a smaller vocabulary, but will provide less meaning (in can be useful for languages such as Chinese where each character carries more information).

The field has moved towards subword tokenization. This is a middle ground between word-based and character-based tokenization. We’ll split the words into subwords. For example, we could split “tokenization” into “token” and “ization”. How do we decide how to split the words? This is part of training a tokenizer through a statistical process that tries to identify which subwords are the best to pick given a dataset. It’s a deterministic process (unlike training a ML model).

For this blog post, let’s go with word tokenization for simplicity. Our goal will be to translate “Hello World” from English to Spanish. Given an example “Hello World”, we’ll split into tokens: “Hello” and “World”. Each token has an associated ID defined in the model’s vocabulary. For example, “Hello” could be token 1 and “World” could be token 2.

1. Embedding the text

Although we could pass the token IDs to the model (e.g. 1 and 2), these numbers don’t carry any meaning. We need to turn them into vectors (list of numbers). This is what embedding does! The token embeddings map a token ID to a fixed-size vector with some semantic meaning of the tokens**. These brings some interesting properties: similar tokens will have a similar embedding (in other words, calculating the cosine similarity between two embeddings will give us a good idea of how similar the tokens are).

Note that the mapping from a token to an embedding is learned. Although we could use a pre-trained embedding such as word2vec or GloVe, transformers models learn these embeddings as part of their training. This is a big advantage as the model can learn the best representation of the tokens for the task at hand. For example, the model could learn that “dog” and “dogs” should have similar embeddings.

All embeddings in a single model have the same size. The original transformer used a size of 512, but let’s do 4 for our example so we can keep the maths manageable. I’ll assign some random values to each token (as mentioned, this mapping is usually learned by the model).

Hello -> [1,2,3,4]

World -> [2,3,4,5]

After releasing this blog post, multiple persons raised questions about the embeddings above. I was a bit lazy and just wrote down some numbers that will make for some nice math below. In practice, these numbers would be learned by the model. I’ve updated the blog post to make this clearer. Thanks to everyone who raised this question!

We can estimate how similar these vectors are using cosine similarity, which would be too high for the vectors above. In practice, a vector would likely look something like [-0.071, 0.344, -0.12, 0.026, …, -0.008].

We can represent our input as a single matrix

\[ E = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 2 & 3 & 4 & 5 \end{bmatrix} \]

Although we could manage the two embeddings as separate vectors, it’s easier to manage them as a single matrix. This is because we’ll be doing matrix multiplications as we move forward!

2 Positional encoding

The embedding above has no information about the position of the word in the sentence, so we need to feed some positional information. The way we do this is by adding a positional encoding to the embedding.

There are different choices on how to obtain these - we could use a learned embedding or a fixed vector. The original paper uses a fixed vector as they see almost no difference between the two approaches (see section 3.5 of the original paper). We’ll use a fixed vector as well. Sine and cosine functions have a wave-like pattern, and they repeat over time. By using these functions, each position in the sentence gets a unique yet consistent positional encoding. Given they repeat over time, it can help the model more easily learn patterns like proximity and distance between elements. These are the functions they use in the paper (section 3.5):

\[ PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

\[ PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

The idea is to interpolate between sine and cosine for each value in the embedding (even indices will use sine, odd indices will use cosine). Let’s calculate them for our example!

For “Hello”

i = 0 (even): PE(0,0) = sin(0 / 10000^(0 / 4)) = sin(0) = 0
i = 1 (odd): PE(0,1) = cos(0 / 10000^(2*1 / 4)) = cos(0) = 1
i = 2 (even): PE(0,2) = sin(0 / 10000^(2*2 / 4)) = sin(0) = 0
i = 3 (odd): PE(0,3) = cos(0 / 10000^(2*3 / 4)) = cos(0) = 1

For “World”

i = 0 (even): PE(1,0) = sin(1 / 10000^(0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84
i = 1 (odd): PE(1,1) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^0.5) ≈ cos(0.01) ≈ 0.99
i = 2 (even): PE(1,2) = sin(1 / 10000^(2*2 / 4)) = sin(1 / 10000^1) ≈ 0
i = 3 (odd): PE(1,3) = cos(1 / 10000^(2*3 / 4)) = cos(1 / 10000^1.5) ≈ 1

So concluding

“Hello” -> [0, 1, 0, 1]
“World” -> [0.84, 0.99, 0, 1]

Note that these encodings have the same dimension as the original embedding.

While we use sine and cosine as the original paper, there are other ways to do this. BERT, a very popular transformer, use trainable positional embeddings.

3. Add positional encoding and embedding

We now add the positional encoding to the embedding. This is done by adding the two vectors together.

“Hello” = [1,2,3,4] + [0, 1, 0, 1] = [1, 3, 3, 5] “World” = [2,3,4,5] + [0.84, 0.99, 0, 1] = [2.84, 3.99, 4, 6]

So our new matrix, which will be the input to the encoder, is:

\[ E = \begin{bmatrix} 1 & 3 & 3 & 5 \\ 2.84 & 3.99 & 4 & 6 \end{bmatrix} \]

If you look at the original paper’s image, what we just did is the bottom left part of the image (the embedding + positional encoding).

4. Self-attention

4.1 Matrices Definition

We’ll now introduce the concept of multi-head attention. Attention is a mechanism that allows the model to focus on certain parts of the input. Multi-head attention is a way to allow the model to jointly attend to information from different representation subspaces. This is done by using multiple attention heads. Each attention head will have its own K, V, and Q matrices.

Let’s use 2 attention heads for our example. We’ll use random values for these matrices. Each matrix will be a 4x3 matrix. With this, each matrix will transform the 4-dimensional embeddings into 3-dimensional keys, values, and queries. This reduces the dimensionality for attention mechanism, which helps in managing the computational complexity. Note that using a too small attention size will hurt the performance of the model. Let’s use the following values (just random values):

For the first head

\[ \begin{align*} WK1 &= \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}, \quad WV1 &= \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}, \quad WQ1 &= \begin{bmatrix} 0 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{bmatrix} \end{align*} \]

For the second head

\[ \begin{align*} WK2 &= \begin{bmatrix} 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}, \quad WV2 &= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}, \quad WQ2 &= \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 1 \end{bmatrix} \end{align*} \]

4.2 Keys, queries, and values calculation

We now need to multiply our input embeddings with the weight matrices to obtain the keys, queries, and values.

Key calculation

\[ \begin{align*} E \times WK1 &= \begin{bmatrix} 1 & 3 & 3 & 5 \\ 2.84 & 3.99 & 4 & 6 \end{bmatrix} \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} \\ &= \begin{bmatrix} (1 \times 1) + (3 \times 0) + (3 \times 1) + (5 \times 0) & (1 \times 0) + (3 \times 1) + (3 \times 0) + (5 \times 1) & (1 \times 1) + (3 \times 0) + (3 \times 1) + (5 \times 0) \\ (2.84 \times 1) + (3.99 \times 0) + (4 \times 1) + (6 \times 0) & (2.84 \times 0) + (4 \times 1) + (4 \times 0) + (6 \times 1) & (2.84 \times 1) + (4 \times 0) + (4 \times 1) + (6 \times 0) \end{bmatrix} \\ &= \begin{bmatrix} 4 & 8 & 4 \\ 6.84 & 9.99 & 6.84 \end{bmatrix} \end{align*} \]

Ok, I actually do not want to do the math by hand for all of these - it gets a bit repetitive plus it breaks the site. So let’s cheat and use NumPy to do the calculations for us.

We first define the matrices

import numpy as np

WK1 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1], [0, 1, 0]])
WV1 = np.array([[0, 1, 1], [1, 0, 0], [1, 0, 1], [0, 1, 0]])
WQ1 = np.array([[0, 0, 0], [1, 1, 0], [0, 0, 1], [1, 0, 0]])

WK2 = np.array([[0, 1, 1], [1, 0, 1], [1, 1, 0], [0, 1, 0]])
WV2 = np.array([[1, 0, 0], [0, 1, 1], [0, 0, 1], [1, 0, 0]])
WQ2 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1]])

4.3.1 Dot product of query with each key

4.3.2 Divide by square root of dimension of key vector

4.3.3 Apply softmax function

4.3.4 Multiply value matrix by attention weights

4.3.5 Heads’ attention output

5.3.1 Mean and variance

5.3.2 Normalize

1. Embedding the text

2. Positional encoding

3. Add positional encoding and embedding

4. Self-attention

5. Residual connection and layer normalization

6. Encoder-decoder attention

7. Residual connection and layer normalization

8. Feed-forward layer

9. Encapsulating everything: The Random Decoder

1. Linear layer

2. Softmax

3. The Random Encoder-Decoder Transformer