This document explains the mechanics of Multi-Head Attention in Transformers, using the analogy of converting a general biography into a contextual CV tailored for a specific job application.
Result: a (10, 512) Input Matrix (X).
Token | Vector (Size 512) |
---|---|
"I" | [0.1, 0.8, ...] |
"led" | [0.5, 0.2, ...] |
"team" | [0.9, 0.1, ...] |
Each head has its own lens via Wq, Wk, Wv matrices. With d_model = 512 and h = 8:
Q: the question. K: the description. V: the content.
Example: "led" might pay 70% attention to "team".
Use the softmax weights to compute a weighted sum of V:
Z = Softmax_Scores Γ V β (10, 64)
"led" now means "leadership in the context of a team".
Repeat steps 2β4 across 8 heads:
This results in the final Contextual CV β tailored, enriched embeddings per token.
Let's make this more concrete. Imagine our input is "She is a great leader" and we have 2 attention heads. Each head has its own set of Wq, Wk, Wv matrices, allowing them to learn different types of relationships.
This head might learn to link pronouns to their roles. When processing the word "She", its Query (Q1) seeks out words defining her role. Through the dot product with all Key (K1) vectors, it finds the highest alignment with "leader".
Attention Weights for "She" (Head 1):
The resulting context vector (Z1) for "She" is thus heavily influenced by the Value (V1) vector of "leader". The model understands "She" in the context of being a leader.
This head, with its different Wq, Wk, Wv matrices, might learn to focus on descriptive attributes. When processing "She", its Query (Q2) looks for adjectives or qualities. It discovers the highest alignment with the word "great".
Attention Weights for "She" (Head 2):
The context vector (Z2) for "She" from this head is primarily a representation of the Value (V2) of "great". The model understands "She" in the context of being great.
The model now has two different contextual understandings of "She":
These two vectors are concatenated: Concat(Z1, Z2)
. This combined vector is then passed through a final linear layer (Wo) to produce the final, enriched output vector for "She". This final vector simultaneously understands that "She" is both a "leader" and "great", capturing a much richer meaning than a single attention head could alone.
You might have seen models like BERT using a dimension of 768. This number isn't arbitrary; it's the result of combining the outputs of all attention heads. The core idea is to divide the model's total representational power into smaller, specialized subspaces (the heads) and then merge their findings.
Let's break it down with a standard example:
d_v = d_model / h = 768 / 12 = 64
Concatenated Vector = [Z1, Z2, ..., Z12]
Total Size = h * d_v = 12 * 64 = 768
So, we arrive back at a 768-dimensional vector for each token. This vector, which has aggregated the "perspectives" from all 12 heads, is then passed through one final linear projection layer (Wo) to produce the final output of the multi-head attention block, which also has a dimension of 768, ready for the next layer in the Transformer.
This is an excellent question. Simply sticking vectors together doesn't automatically create a meaningful representation. The magic isn't in the concatenation itself, but in the final linear layer (Wo) that processes the combined vector.
Imagine you are a CEO making a critical decision and you consult a panel of 12 specialists: a financial analyst, a legal expert, a marketing guru, a technical lead, and so on. Each attention head is like one of these specialists. Because each head has its own unique Wq, Wk, and Wv matrices, it learns to focus on a specific type of relationship in the text.
Each 64-dimensional Z vector is a concise "report" from one specialist.
When you concatenate these vectors, you are not blending or averaging them. You are placing all 12 specialist reports side-by-side. The 768-dimensional vector is this "master table" of information.
Concat_Vector = [Report_from_Head1, Report_from_Head2, ..., Report_from_Head12]
The information from the "Syntactician" (Head 1) is preserved in dimensions 1-64. The information from the "Descriptor Analyst" (Head 2) is preserved in dimensions 65-128. The perspectives remain distinct and un-merged at this stage.
This is the crucial step that creates the unified meaning. The concatenated 768-dim vector is passed through one final, learnable linear layer, represented by the weight matrix Wo.
This Wo matrix acts like the CEO. During training, it learns how to read the "master table" of concatenated reports and synthesize them into a single, coherent final output. It learns the complex interactions between the specialists' reports, figuring out the optimal way to mix, weigh, and combine the different perspectives. For instance, it might learn patterns like:
In summary, the concatenation is meaningful because it preserves the distinct perspectives from each specialized head. The subsequent linear layer (Wo) then intelligently synthesizes these diverse perspectives into a unified, high-dimensional representation that captures a rich and multi-faceted understanding of each token.