Transformer-based models or transformers work with numbers and linear algebra, rather than processing natural language directly. Therefore, they convert textual inputs into numerical representations, especially by applying self-attention mechanism. This is called embedding or encoding. Such numerical representations from input text are called transformer embeddings.
Vectors or numerical representations using word2vec suffer from one major drawback — lack of contextual information, since these are static embeddings and pre-date transformers. Transformers overcome this issue by producing their own context-aware embeddings. Fixed word embeddings are augmented by positional information — the order of occurrence of the words in input, and contextual information — how the words are used.
There are two mechanisms to do this — positional encoder and self-attention blocks. The result is more powerful vector representation of words.
Transformers store the initial vector representation for each token in the weights of a linear layer. In transformer, they are learned embeddings. Though in practice, they are similar, a different nomenclature focuses on the fact these representations are just a starting point. They are not the end products.
The linear layer contains only weights and no biases. There is 0 bias for every neuron.
The layer weights=Matrix of the size Vxd_model where v is the vocabulary size — the number of unique words in the training data d_model is the number of embedding dimensions.
The original transformer model was proposed with a d_model size of 512 dimensions. In practice, we can use any reasonable value.
It is the training that distinguished static and learned embeddings. Static embeddings are trained using Skip-Gram or Continuous Bag of Words architectures. Learned embeddings are an integral part of the transformer. They are trained using backpropagation.
Training Process for Learned Embeddings
The embedding layer (with weights for each neuron and O bias) stores learned embeddings.
The weights=matrix of size Vxd_model. The word embeddings for each word is stored along rows — the first word in the first row, second in the second row and so on.
In the training, for an input word the aim is to predict the next word in the sequence.
It is called Next Token Prediction (NTP). Initially, the predictions are poor. They are improved after compensating for loss function. There are several iterations. The learned embeddings then are a strong vector representation of each word in the vocabulary.
Input sequences in new sequences are tokenized — they have an associated token ID. It corresponds to the position of the token in tokenizer’s vocabulary. Say the word ‘cat’ may have a token ID 349.
Token IDs create one-hot encoded vectors that extract the correct learned embeddings from the weights matrix.
After training learned embeddings, the weights in the embedding layer do not change.
To preserve the word order, positional encoding vectors are generated and added to the learned embeddings of each word.
Last step is to add contextual information using self-attention.
Vaswani’s original transformer model proposed the following positional encoding.
PE (pos, 2i) Sin= ( pos/10000 2i divided by d_model )
PE (Pos, 2i+1) Cos ( pos/10000 2i divided by d_model )
The positional embedding corresponds to a sinusoid.
The positional encodings shown above are predictable or deterministic. They are fixed. It is possible to use learned positional encodings by randomly initializing some positional encodings and training them with backpropagation.
Self-attention mechanism modifies the vector representation of words to capture the context of their usages in an input sequence.
Self-attention has ‘self’. It uses surrounding words within a single sequence to provide context.
Therefore, all words are processed in parallel. It enhances the performance.
Another type of attention is cross-attention. Self-attention operates in a single sequence. Cross-attention compares each word in the output sequence to each word in the input sequence.
In self-attention, similarity between words is calculated by using the dot product. The similarity scores are then scaled. Attention-weights are calculated by using SoftMax function. Lastly, transfer embedding is calculated.
There are no trainable parameters in simple weighted sums. These can be introduced. Self-attention inputs are used thrice to calculate new embeddings. When pre-multiplied by their respective inputs, these form key, query and value matrices (K, Q and V). A query is a database you are looking for performing a search. Keys are the attributes or columns that are being searched against. Values correspond to the actual data in the data base.
Self-attention is expanded to Multi-Head Attention in the original paper. We have covered it in a separate blog.