Positional embedding refers to a technique where the tokens in a sequence are encoded while using NLP. Such embedding is necessary in machine translation and text summarization since the model has to understand the order of the tokens.
There are two main types of positional embeddings. Absolute and relative positional embeddings. Absolute positional embedding encode absolute position of each element in the sequence. Here a simple counter is used. Relative positional embeddings encode relative position of each element to other elements in the sequence. Here a matrix is used which stores the distances between each pair of elements.
Absolute positioning is used for NLP tasks, and relative positioning for tasks involving computer vision.
Positional embeddings facilitates the model’s ability to learn long-range dependencies.
Such embeddings are required for transformers using self-attention mechanism where the content of each word’s information is obtained. There is no information about its position. Without this information, the model struggles to understand the order of the words and their relationship within a sentence.
In positional embeddings, one can use sine and cosine functions to encode the tokens. It is called Sinosoidal Positional Embeddings. Here the long range dependencies in sequences are captured. Alternatively, learned positional embedding relies on data, and is effective for some tasks. These encodings are learned by the model during training. Though more effective than sine and cosine encoding, it requires more training data. Of course, such embeddings require more data.
Positional enbeddings are added to the token embeddings before being fed into the transformer. Positional embedding are learned vectors that are added to word embeddings. It makes the model learn relationships between tokens in the sequence, meaningwise and positionwise.
The embedded tokens are passed through a series of self-attention and feedforward layers.
In sinsusoidal positional embeddings, each position in the sequence is represented as a vector of sine and cosine functions with different frequencies. The frequencies are chosen to increase exponentially. As the position increases, the positions that are closer have similar embeddings. The positions that are far apart are different.
Let us consider an illustration. I love machine learning.
I Position 1 Position Embedding 0.1, 0.2, 0.3, 0.4, 0.5
love Position 2 Position Embedding 0.2, 0.4, 0.6, 0.8, 1.0
machine Position 3 Position Embedding 0.4, 0.8, 1.2, 1.6, 2.0
learning Position 4 Position Embedding 0.8, 1.6, 2.4, 3.2, 4.0
The benefits of positional embeddings are performance improvement of the model, say in machine translation and language modelling. The models generalize better. The results become more interpretable.
However, positional embedding adds complexity to the model. The training time of the model too increases. These positionings are not effective for long sequences. It is necessary to consider these limitations before using it in the models.