Visual Guide to Transformer Neural Networks - Position Embeddings
- Shivam Chhetry
- Mar 15, 2022
- 4 min read

In this blog we'll discuss the input that gets fed to the transformer encoder we'll start with basic input processing then move on to the embedding layer and finally to a very important concept of position embeddings.
let us understand these concepts better by creating a fun dialogue completer
using transformers we feed our model the first half of thousands of incomplete movie dialogues and it learns to complete them by generating their second halves.

let us pick one example dialogue to make
things more concrete . . . . .

The first step towards any machine learning is input data processing as decided the input in our case is going to be the first half of dialog. Computers don't understand english language the only language they understand is that of matrices and numbers we will therefore have to transform the input text in that language so as to do that first we take all the words present in our training data and create a vocabulary dictionary out of it if our training data is as big as the whole wikipedia our vocabulary can be composed of all the words in english language.

next we assign a numeric index ( indices ) next to each word.

Then we pick only the words that occur in the current input text.

what gets fed into the transformers are not the english words but their corresponding indices let us denote each index with the letter "X

we're done with processing the input next these inputs are passed on to the
next layer that is the embedding layer and embedding layer 2 has an index for every word in our vocabulary and against each of those indices a vector is attached initially these vectors are filled up with random numbers later on during training phase the model updates them with values that better help them with the assigned tasks now i've chosen word embedding size of 5 just to let you know the original transformer paper on the other hand went with the embedding size of 512.

What is Embedding?

Well these are just vector representation of a given word each dimension of the word embedding tries to capture some linguistic feature about that word these could be things like whether the word is verb or an entity or something else but in reality since the model decides these features itself during training it can be non-trivial to find out exactly what information do.

graphically the values of these dimensions represent the coordinates of the given board in some hyperspace if two words share similar linguistic features and appear in similar contexts their embedding values are updated to become closer and closer during the training process.
For example . . .

consider these two words play and game initially their embeddings are randomly
initialised but during the course of training they may become more and more similar
since these two words often appear in similar contrast.
Hence the embedding layer selects the embedding corresponding to the input test and passes them further on let's denote these embeddings by the letter "e".

recap

The embedding layer takes input indices and converts them into word embeddings then these get passed further on to the next layer.
The final component to be discussed which is
Positional Embedding
I will come into later what is Positional Embedding but first
why do we need Positional Embedding
Consider this if a LSTM were to take up these embeddings it would do so sequentially one embedding at a time which is why they're so slow there is a positive side to this however since LSTM take the embeddings sequentially in their designated order they know which word came first which word came second and so on transformers on the other hand take up all embeddings at once now even though this is a huge plus and
makes transformers much faster the downside is that they lose the critical information
related to word ordering in simple words they are not aware of which word came first in the sequence and which came last now that is a problem here is why position information matters.
now lets come to positional embedding .
consider below examples

Notice how the position of a single word "not" not only changed the sentiment but also the meaning of the sentence so what do we do to bring back the word order information back to transformers without having to make them recurrent
like LSTM.

How about we introduce a new set of vectors containing the position information let us call them the position embeddings we can start by simply adding the word embeddings to their corresponding position embeddings
and create new order aware word
embeddings simple enough but what values should our position embeddings contain ????
Now question comes what values should position embeddings contains.
How about we start by literally filling in the word position numbers so the first position embedding has all zeros the next says once and so on will that work not really adding the position information like that may significantly distort the embedding information
especially those of the ones that appear later in the text.
what if instead we added fractions that is if the text is composed of four words our position embeddings can simply represent word position as fractions of the total length that way the maximum position embedding value will never surpass 1 that doesn't work either this is because making the position embeddings a function of the total text length would mean if the sentences differ in
length which they often do they would possess different position embeddings for the same position this may in turn confuse our model and we don't want that either.
Ideally the position embedding values at a given position should remain the same irrespective of the text total length or any other factor. so what should we be doing well the authors of the original transformer paper came up with a clever idea they used
wave frequencies to capture position information.
Amazing blog to understand positional embedding.