标签: transformer

2024-09-30发表2024-11-05更新6 分钟读完 (大约854个字)

为了详细说明 Transformer 模型在机器翻译中的向量计算过程，我们将从输入的英文句子 “I went to Northeastern University” 和对应的中文标注 “我去东北大学” 入手，介绍从词嵌入、注意力计算、解码器中的多头注意力到最终生成的详细步骤。为了让具体过程更具可操作性，我将向量的维度和数量缩小化，以便说明整个过程。

1. 词嵌入（Embedding）

假设我们的模型使用了词嵌入向量的维度 $ d_{\text{model}} = 4 $（简化，实际中通常是 512 或 1024），句子中的每个词通过嵌入层映射为长度为 4 的向量。以下是简化的词嵌入向量（这些向量在实际中是通过预训练或模型学习得到的）：

英文句子：

“I”: $ \mathbf{E}_{I} = [0.1, 0.3, 0.2, 0.4] $
“went”: $ \mathbf{E}_{went} = [0.5, 0.6, 0.7, 0.8] $
“to”: $ \mathbf{E}_{to} = [0.9, 0.1, 0.3, 0.5] $
“Northeastern”: $ \mathbf{E}_{Northeastern} = [0.2, 0.4, 0.6, 0.8] $
“University”: $ \mathbf{E}_{University} = [0.3, 0.2, 0.1, 0.5] $

这些向量组合成一个矩阵，表示整个句子的嵌入表示：
[
\mathbf{E}_{\text{英}} = \begin{bmatrix}
0.1 & 0.3 & 0.2 & 0.4 \
0.5 & 0.6 & 0.7 & 0.8 \
0.9 & 0.1 & 0.3 & 0.5 \
0.2 & 0.4 & 0.6 & 0.8 \
0.3 & 0.2 & 0.1 & 0.5
\end{bmatrix}
]
维度：$5 \times 4$ (5个词，每个词的嵌入维度是4)

中文句子：

“我”: $ \mathbf{E}_{我} = [0.3, 0.1, 0.2, 0.5] $
“去”: $ \mathbf{E}_{去} = [0.6, 0.3, 0.4, 0.7] $
“东北大学”: $ \mathbf{E}_{东北大学} = [0.8, 0.5, 0.6, 0.9] $

中文句子嵌入为：
[
\mathbf{E}_{\text{中}} = \begin{bmatrix}
0.3 & 0.1 & 0.2 & 0.5 \
0.6 & 0.3 & 0.4 & 0.7 \
0.8 & 0.5 & 0.6 & 0.9
\end{bmatrix}
]
维度：$3 \times 4$ (3个词，每个词的嵌入维度是4)

2. 位置编码（Positional Encoding）

为了注入序列信息，加入位置编码（Positional Encoding）。假设我们的位置编码矩阵如下：

对于5个英文词：
[
\mathbf{P}_{\text{英}} = \begin{bmatrix}
0.1 & 0.2 & 0.3 & 0.4 \
0.2 & 0.3 & 0.4 & 0.5 \
0.3 & 0.4 & 0.5 & 0.6 \
0.4 & 0.5 & 0.6 & 0.7 \
0.5 & 0.6 & 0.7 & 0.8
\end{bmatrix}
]
对于3个中文词：
[
\mathbf{P}_{\text{中}} = \begin{bmatrix}
0.1 & 0.2 & 0.3 & 0.4 \
0.2 & 0.3 & 0.4 & 0.5 \
0.3 & 0.4 & 0.5 & 0.6
\end{bmatrix}
]

经过位置编码后，最终的输入是词嵌入向量与位置编码的逐元素相加：

英文向量 + 位置编码：

[
\mathbf{E}{\text{英}} + \mathbf{P}{\text{英}} = \begin{bmatrix}
0.1+0.1 & 0.3+0.2 & 0.2+0.3 & 0.4+0.4 \
0.5+0.2 & 0.6+0.3 & 0.7+0.4 & 0.8+0.5 \
0.9+0.3 & 0.1+0.4 & 0.3+0.5 & 0.5+0.6 \
0.2+0.4 & 0.4+0.5 & 0.6+0.6 & 0.8+0.7 \
0.3+0.5 & 0.2+0.6 & 0.1+0.7 & 0.5+0.8
\end{bmatrix} = \begin{bmatrix}
0.2 & 0.5 & 0.5 & 0.8 \
0.7 & 0.9 & 1.1 & 1.3 \
1.2 & 0.5 & 0.8 & 1.1 \
0.6 & 0.9 & 1.2 & 1.5 \
0.8 & 0.8 & 0.8 & 1.3
\end{bmatrix}
]

中文向量 + 位置编码：

[
\mathbf{E}{\text{中}} + \mathbf{P}{\text{中}} = \begin{bmatrix}
0.3+0.1 & 0.1+0.2 & 0.2+0.3 & 0.5+0.4 \
0.6+0.2 & 0.3+0.3 & 0.4+0.4 & 0.7+0.5 \
0.8+0.3 & 0.5+0.4 & 0.6+0.5 & 0.9+0.6
\end{bmatrix} = \begin{bmatrix}
0.4 & 0.3 & 0.5 & 0.9 \
0.8 & 0.6 & 0.8 & 1.2 \
1.1 & 0.9 & 1.1 & 1.5
\end{bmatrix}
]

3. 编码器的多头自注意力机制

进入 编码器，每个词的向量会分别通过三个线性变换得到 Query (Q)、Key (K) 和 Value (V) 向量。假设每个向量的维度仍然是 4。

例如，对于“went”这个词，假设我们通过线性变换矩阵 $ W_Q $、$ W_K $、$ W_V $ 生成 Q, K, V：

[
Q{\text{went}} = W_Q \cdot \mathbf{E}{went}, \quad K{\text{went}} = W_K \cdot \mathbf{E}{went}, \quad V{\text{went}} = W_V \cdot \mathbf{E}{went}
]

假设 $ W_Q $, $ W_K $, $ W_V $ 的维度为 $ 4 \times 4 $，比如：
[
W_Q = \begin{bmatrix}
0.2 & 0.1 & 0.3 & 0.5 \
0.6 & 0.4 & 0.1 & 0.7 \
0.3 & 0.8 & 0.2 & 0.4 \
0.5 & 0.9 & 0.6 & 0.1
\end{bmatrix}
]

那么：
[
Q_{\text{went}} = W_Q \cdot [0.7, 0.9, 1.1, 1.3] = \begin{bmatrix} 0.2 & 0.1 &
0.3 & 0.5 \ 0.6 & 0.4 & 0.1 & 0.7 \ 0.3 & 0.8 & 0.2 & 0.4 \ 0.5 & 0.9 & 0.6 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 0.7 \ 0.9 \ 1.1 \ 1.3 \end{bmatrix} = \begin{bmatrix} 1.27 \ 1.84 \ 1.51 \ 1.61 \end{bmatrix}
]

通过类似的过程，可以计算出所有词的 Q, K, V。

4. 注意力计算

对于每个词，计算 Q 与所有其他词的 K 进行点积，生成注意力权重矩阵。然后用这些权重对 V 进行加权求和。

假设对于“went”，与其他词的 K 点积计算如下：
[
\text{Attention}{went} = \frac{Q{\text{went}} \cdot K_{\text{其他词}}}{\sqrt{d_k}}
]

具体注意力计算可以按照这个过程逐步实现。

整个过程需要多次矩阵乘法与注意力加权，最终解码器生成的词汇是基于这些计算生成的。

2024-09-30发表2024-09-30更新5 分钟读完 (大约824个字)

transformer中位置编码的计算方法

在 Transformer 模型中，位置编码（Positional Encoding, PE）是通过正弦和余弦函数生成的。它为输入的每个位置添加位置信息，确保模型能够利用序列的位置信息。这个过程在原始论文 Attention is All You Need 中的公式如下：

[
PE{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d{\text{model}}}}}\right)
]
[
PE{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d{\text{model}}}}}\right)
]
其中：

( pos ) 表示词的位置。
( i ) 是嵌入维度的索引。
( d_{\text{model}} ) 是词嵌入的维度。
奇数索引的位置使用正弦函数，偶数索引的位置使用余弦函数。

具体步骤

确定维度和序列位置：假设我们有一个嵌入维度 ( d_{\text{model}} = 4 )（简化，实际中通常是 512），输入句子长度为 5（例如，英文句子有5个词）。我们需要计算每个位置的向量，并为每个维度使用上述公式生成正弦和余弦值。
按照公式计算位置编码：我们需要分别计算每个位置 ( pos ) 和每个维度 ( i ) 上的正弦和余弦值。

示例：假设 ( d_{\text{model}} = 4 )，序列长度为 5。

对于每个位置 ( pos = 0, 1, 2, 3, 4 )，计算它在每个维度 ( i ) 上的编码。

计算公式细化

对嵌入维度 ( d_{\text{model}} = 4 )：

偶数位置 ( 2i )：使用正弦函数 ( \sin )
奇数位置 ( 2i+1 )：使用余弦函数 ( \cos )

我们来逐个计算每个位置 ( pos = 0, 1, 2, 3, 4 ) 的位置编码。

3. 具体计算

对于 ( pos = 0 )：

[
PE{(0, 0)} = \sin\left(\frac{0}{10000^{\frac{0}{4}}}\right) = \sin(0) = 0
]
[
PE{(0, 1)} = \cos\left(\frac{0}{10000^{\frac{0}{4}}}\right) = \cos(0) = 1
]
[
PE{(0, 2)} = \sin\left(\frac{0}{10000^{\frac{2}{4}}}\right) = \sin(0) = 0
]
[
PE{(0, 3)} = \cos\left(\frac{0}{10000^{\frac{2}{4}}}\right) = \cos(0) = 1
]

因此，位置 ( pos = 0 ) 的位置编码向量为：
[
[0, 1, 0, 1]
]

对于 ( pos = 1 )：

[
PE{(1, 0)} = \sin\left(\frac{1}{10000^{\frac{0}{4}}}\right) = \sin(1)
]
[
PE{(1, 1)} = \cos\left(\frac{1}{10000^{\frac{0}{4}}}\right) = \cos(1)
]
[
PE{(1, 2)} = \sin\left(\frac{1}{10000^{\frac{2}{4}}}\right) = \sin\left(\frac{1}{100}\right)
]
[
PE{(1, 3)} = \cos\left(\frac{1}{10000^{\frac{2}{4}}}\right) = \cos\left(\frac{1}{100}\right)
]

对于 ( pos = 1 ) 的具体数值：
[
\sin(1) \approx 0.8415, \quad \cos(1) \approx 0.5403
]
[
\sin\left(\frac{1}{100}\right) \approx 0.01, \quad \cos\left(\frac{1}{100}\right) \approx 0.99995
]

因此，位置 ( pos = 1 ) 的位置编码向量为：
[
[0.8415, 0.5403, 0.01, 0.99995]
]

对于 ( pos = 2 )：

[
PE{(2, 0)} = \sin\left(\frac{2}{10000^{\frac{0}{4}}}\right) = \sin(2)
]
[
PE{(2, 1)} = \cos\left(\frac{2}{10000^{\frac{0}{4}}}\right) = \cos(2)
]
[
PE{(2, 2)} = \sin\left(\frac{2}{10000^{\frac{2}{4}}}\right) = \sin\left(\frac{2}{100}\right)
]
[
PE{(2, 3)} = \cos\left(\frac{2}{10000^{\frac{2}{4}}}\right) = \cos\left(\frac{2}{100}\right)
]

对于 ( pos = 2 ) 的具体数值：
[
\sin(2) \approx 0.9093, \quad \cos(2) \approx -0.4161
]
[
\sin\left(\frac{2}{100}\right) \approx 0.02, \quad \cos\left(\frac{2}{100}\right) \approx 0.9998
]

因此，位置 ( pos = 2 ) 的位置编码向量为：
[
[0.9093, -0.4161, 0.02, 0.9998]
]

对于 ( pos = 3 )：

[
PE{(3, 0)} = \sin\left(\frac{3}{10000^{\frac{0}{4}}}\right) = \sin(3)
]
[
PE{(3, 1)} = \cos\left(\frac{3}{10000^{\frac{0}{4}}}\right) = \cos(3)
]
[
PE{(3, 2)} = \sin\left(\frac{3}{10000^{\frac{2}{4}}}\right) = \sin\left(\frac{3}{100}\right)
]
[
PE{(3, 3)} = \cos\left(\frac{3}{10000^{\frac{2}{4}}}\right) = \cos\left(\frac{3}{100}\right)
]

对于 ( pos = 3 ) 的具体数值：
[
\sin(3) \approx 0.1411, \quad \cos(3) \approx -0.9899
]
[
\sin\left(\frac{3}{100}\right) \approx 0.03, \quad \cos\left(\frac{3}{100}\right) \approx 0.99955
]

因此，位置 ( pos = 3 ) 的位置编码向量为：
[
[0.1411, -0.9899, 0.03, 0.99955]
]

对于 ( pos = 4 )：

[
PE{(4, 0)} = \sin\left(\frac{4}{10000^{\frac{0}{4}}}\right) = \sin(4)
]
[
PE{(4, 1)} = \cos\left(\frac{4}{10000^{\frac{0}{4}}}\right) = \cos(4)
]
[
PE{(4, 2)} = \sin\left(\frac{4}{10000^{\frac{2}{4}}}\right) = \sin\left(\frac{4}{100}\right)
]
[
PE{(4, 3)} = \cos\left(\frac{4}{10000^{\frac{2}{4}}}\right) = \cos\left(\frac{4}{100}\right)
]

对于 ( pos = 4 ) 的具体数值：
[
\sin(4) \approx -0.7568, \quad \cos(4) \approx -0.6536
]
[
\sin\left(\frac{4}{100}\right) \approx 0.04, \quad \cos\left(\frac{4}{100}\right) \approx 0.9992
]

因此，位置 ( pos = 4 ) 的位置编码向量为：
[
[-0.7568, -0.6536, 0.04, 0.9992]
]

4. 总结

最终，对于一个 5 个词组成的句子，其位置编码矩阵为：
[
PE = \begin{bmatrix}
0 & 1 & 0 & 1 \
0.8415 & 0.5403 & 0.01 & 0.99995 \
0.9093 & -0.4161 & 0.02 & 0.9998 \
0.1411 & -0.9899 & 0.03 & 0.99955 \
-0.7568 & -0.6536 & 0.04 & 0.9992
\end{bmatrix}
]

这个位置编码矩阵将与词嵌入矩阵相加，注入位置信息后用于 Transformer 的输入。

2024-09-29发表2024-09-30更新9 分钟读完 (大约1398个字)

transformer

我们可以通过一个简单的例子来理解查询向量（q）、键向量（k）和值向量（v）在注意力机制中的作用。