‘The quick brown fox’ is the input sequence. It is vectorized. The vector is used to calculate the attention weights. The attention weights are used to create a weighted sum or hidden states of the encoder. It is then passed to the decoder. The attention mechanism focuses on the words ‘quick’ and ‘brown’. The output vocabulary is used to generate a probability distribution over the possible next words. The word ‘jumps’ is predicted as the next word. While deciding probability of each word, the output of attention mechanism is used. The decoder repeats this process. The predicted word is taken as input in the sequence and is used for the next reiteration. The process continues until the decoder predicts the end of the sequence.
While training, decoder could be subjected to masked language modelling. Some words in the input sequence are masked out. The decoder predicts the masked words. It helps the decoder to focus on the context of the current word, while predicting the next word.
After Vaswani’s paper on Attention Mechanism (2017), transformer model is used. It has since then changed to decoder-only transformer (2019). This transformer is less accurate, and is used where accuracy is not as important. The decoder-only transformer takes the previous words in the sequence as input. The decoder only transformer produces a sequence of hidden states which are used to predict the next word in the sequence. It does so depending on the previous words in the sequence. Here first a score for each token in the input sequence is computed. the score is based on how well the token matches the current score of the decoder, the tokens with the highest score are then used to generate the next token in the output sequence. The most common scoring function is the dot product.
The attention weights are calculated for the hidden states. They indicate how much attention is to paid to individual words in the sequence. The attention weights are used to combine the hidden states in a single representation. This representation is used to predict the next word in the sequence.
LLMs use deep learning techniques by analyzing and learning from vast amounts of text data. They use this data to learn the relationships between words and phrases. It is called ‘transfer learning’ where a pre-trained model is adapted to a specific task. In training, they process vast amounts of text. They learn the structure and meaning. They are trained to identify meanings and relationships between words. They deal with large swaths of text to understand the context better.
Vectorised models can use a distributed representation where different words with similar meanings have similar representation and are close in vector space.
A unigram model works taking each word in a sentence independently. Bigram models examine the probability of each word in a phrase depending upon the probability of the previous word. A trigram model considers two previous words. An n-gram model considers n-1 words of the previous context.