Mastering Tensor Dimensions in Transformers

Community Article Published January 12, 2025

Prerequisites

Setup

Embedding Layer

Positional Encoding

Decoder Layer
Masked Multi-Head Attention

Add and normalize

Feed-Forward

Language-Model Head

Transformers and Cross-Attention

Final Words

Prerequisites

you need to understand what are some valid shapes and dimensions for matrix multiplication, it is highly recommended you go through this space to gain some understanding of this topic beforehand.

Setup

Most generative AI models consist of only a decoder stack in this blogpost we will go through a simple text generation model as depicted in the picture below

First, let's take an input example that we will take as a reference. The sentence Hello world ! can be divided into 3 tokens Hello world and !. It is also essential to know that there are 2 auxiliary tokens that will be attached to the sentence which are the <bos> and <eos> tokens which represent respectively the beginning of sentence and the end of sentence tokens. This will allow the input to be shifted right

If we now tokenize the input we will have a tensor that looks like this [12, 15496, 2159, 5145] This will be passed to the model in a batch adding an extra dimensionality to the tensor making it look like this [[12, 15496, 2159, 5145]] for the sake of simplicity, we will only focus on the tensor dimensionalities meaning that the previous input will now be represented as $[1, 4]$ with 1 being the batch size and 4 being the sentence length.

Embedding Layer

The input tensor will go through each layer affecting the tensor values and in some layers even the shapes are affected, for starters once the tensor reaches the embedding layer the tensor will become of shape $[1, 4, 768]$ with 768 being the embedding dimension. The embedding layer is one of the most crucial layers in the architecture because :

the embedding dimension will later propagate across the neural network and will be used extensively in the attention layer
the embedding layer serves to transform tokens into vectors representing these words (example king-> 8848 and man-> 9584 in which we can't find a correlation but in a high dimensional space we can find similarities between their respective vectors) see picture below

Positional Encoding

This step layer does not affect the tensor dimensions, but in this step we inject positional values into the input. The reason behind this is that the input will go through parallel computations later in the architecture and it is essential to incorporate some information about the position in the tensor.

Decoder Layer

a generative model can have multiple consecutive decoder layers, each one of them having :

masked multi-head attention layer
add and normalize
feed-forward

Masked Multi-Head Attention

The Multi-Head Attention layer is what allows the model to attend to different parts if the input data weighing it and representing each token by itself in the entire sentence. While the Masked Multi-Head Attention allows each token to attend to itself and to the previous tokens only, by masking them from the attention weights.

the data first goes through 3 parallel linear layers having embedding_dim as their input and output dims meaning that we would have nn.Linear(768,768) and the output will have the same shape as the input.

Each of these outputs is now called Query, Key and Value and each of them has a tensor shape of $[1, 4, 768]$

then we split the embedding_dim based on the following equation: $768 = 8 * 96$ where 8 represents the number of heads and 96 is the head_size. the tensor now becomes of shape $[1, 4, 8, 96]$ after that we transpose the dimensions that have 4 and 8 to ensure that the matrix multiplication is done along the sequence_length and the head_size each of these tensors making their shape $[1, 8, 4, 96]$

the attention equation as introduced by the original paper is as follows : $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$

first we need to apply $QK^\top$ :

$K^\top$ : key transpose will be of shape $[1, 8, 96, 4]$
$QK^\top$ : $[1, 8, 4, 96] * [1, 8, 96, 4] = [1, 8, 4, 4]$ (reminder what you are looking at is the tensor shapes and not the actual values)

please visit the matrix multiplication embedded above to verify all dimensionalities and get a better visual understanding of what is going on.

MASKING the mask is applied to allow each token to be represented by itself and by the previous tokens only, this way the generative AI does not cheat when generating the next token

Attention weights The attention weights are calculated using the formula $\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)$

where $\sqrt{d_k} = \sqrt{96}$ we divide using the head_size to scale the values and avoid leaving a huge gap between them. while the $\text{softmax}$ makes it so that each vector representation of each token will sum up to 1, by doing so the -inf will zero out and the rest of the values will stay in the equation and will all have positive values. this will not affect the tensor shapes only their values

calculate the attention

we then multiply the attention weights against the values $\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$ $[1, 8, 4, 4] * [1, 8, 4, 96] = [1, 8, 4, 96]$

concat : Here we restore the number of heads to their original place in the tensor by applying a transpose along the dimensions that have 4 and 8 $[1, 8, 4, 96] \Rightarrow [1, 4, 8, 96] \Rightarrow [1, 4, 768]$
projection layer

this is a linear layer with the same embedding dimension as input as output nn.Linear(768, 768) meaning that the dimension stays at $[1, 4, 768]$

Observation we have returned to the same shape that we started with before the attention, this is crucial as we need to conserve the same input shape for future calculations.

Add and normalize

In here, there's a skip layer and the values from the before and after the attention layer are added together and normalized, the reason why we add these 2 tensors together is because we want to update the values of the tensor and not replace them as for the normalization it's to avoid having exponential values.

These add-and-normalize operations are applied after each layer to preserve the original tensor characteristics.

Feed-Forward

This is usually composed of 2 consecutive linear layers 1 expands and the other retracts with a possibility to have a dropout layer as well. These linear layers add non-linear transformation to the tensor ensuring the the embed_dim (number_heads & head_size) go, enabling the model to capture richer and more complex patterns

the expansion is usually of factor $1 \Rightarrow 3$ and the retraction is of proportions $3 \Rightarrow 1$ meaning that we would have 2 linear layers that look something like this

nn.Linear(768, 3* 768)
nn.Linear(3*768, 3)

and the final output shape will go back to the same input shape meaning it will be $[1, 4, 768]$ conserving the shape allows us to also apply an add and normalization layer after this feed-forward layer, and since the final shape of the decoder is equal to the same shape that it was passed to it this will allow us to have multiple consecutive decoder layers

Language-Model Head

after a series decoder layers we finally reach a final linear layer that will transform embed_dim $\Rightarrow$ vocab_size.

meaning that our tensor will be of shape $[1, 4, 9735]$ (in case our vocab size was 9735), where :

1 : batch_size
4 : sequence_length
9735 : vocab_size

if we apply a softmax function and calculate the loss between the model output and the ground truth as depicted in the picture below we get the error (or you can call it the loss) which can be used later by the optimizer to update the model weights.

if we go back to the Masked multi-head attention layer we mentioned that the attention for each token will be calculated using only the current INPUT token and the previous tokens in that sentence, and since the input is shifted right means that to generate a single new token the model will have to look at the vector representation of all the previous tokens in that sentence represented by themselves and their preceding tokens which explains how generative AI models work as represented in the gif below :

Transformers and Cross-Attention

The transformer architecture is composed of an encoder-decoder architecture. This architecture is used generally when the context and the output are not related to each other example (translation), this architecture is not adopted as much in translation in favor of the decoder-only architectures.

In the encoder layers shapes propagate similarly to as what we mentioned above with the only exception is that in the attention layer we do not apply an attention mask allowing each token to be represented by itself and all of the surrounding tokens in that sentence both before and after.

Let's take an example and try to check the tensor shapes. In case our context was I am at home the target would be "je suis à la maison", if we shift that to the right that would be <bos> je suis à la maison

I am at home $\Rightarrow [1, 4]$
<bos> je suis à la maison : $\Rightarrow [1, 6]$

you might have noticed that the input and the target do not have the same sequence lengths so let's so how this is handled by the cross-attention layer. To save up some time we the encoder output would be of shape $[1, 4, 768]$ and the masked multi-head attention layer output would be $[1, 6, 768]$ . Note that the key and Value came from the encoder and the Query came from the decoder It is highly recommended that you try calculating the tensor dimensions on your own in the following step, but in case you need t verify your calculations you can find the solution below.

[CLICK HERE FOR THE SOLUTION]

the attention formula is as follows :

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Query : $[1, 6, 768]$
key & Value : $[1, 4, 768]$

split and transpose dims :

Query : $[1, 8, 6, 96]$
Key & Value : $[1, 8, 4, 96]$

Calculate Attention :

$K^\top$ : $[1, 8, 96, 4]$
$QK^\top$ : $[1, 8, 6, 96] * [1, 8, 96, 4] = [1, 8, 6, 4]$
$\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = [1, 8, 6, 4]$
$\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V = [1, 8, 6, 4] * [1, 8, 4, 96] \Rightarrow [1, 8, 6, 96]$

Concat

$[1, 8, 6, 96] \Rightarrow [1, 6, 8, 96] \Rightarrow [1, 6, 768]$

Observation

we have gone back to the same shape as that of the masked multi-head attention layer output.

Final Words

Hope this blogpost helped you gain a bit more understanding on how attention works as well as how tensor shapes propagate along the architecture.

Consider clicking on the upvote button if you find this blogpost helpful 🤗

Do not hesitate to reach out reach out through any of my contact information as mentioned in my portfolio not-lain.github.io if you have any feedback or questions.

Upvote