Transformers

The transformer arrived in 2017. It represented a new architecture of sequence transduction models. A sequence model transforms an input sequence to an output sequence. The input sequence may consist of words, characters, tokens, bytes, numbers, phonemes or could be multi-modal.

Prior to the advent of transformers, sequence models were based on RNNs, LSTM, gated recurrent units (GRUs) and CNNs. To account for context, they contained some form of an attention mechanism.

Transformer relies entirely on the attention mechanism, and does away with recurrence and convolutions.

Attention is used to focus on different parts of input sequence at each step of generating output.

Transformer allowed parallelization (without sequential processing). It resulted into faster training without losing long-term dependencies.

The important components of a transformer are tokenization, embedding layer, attention mechanism, encoder and decoder. Each token is represented by an embedding capturing some kind of meaning. It is represented as a vector. An appropriate embedding dimension is determined — it corresponds to the size of the vector representation of each token.

The dimensions of the embedding matrix, for vocabulary size V and embedding dimension D becomes VxD, making it a high-dimensional vector.

Positional encodings are added to these embeddings (since the transformer does not have a built-in-sense of the order of tokens).

Attention Mechanism

Self-attention mechanism is that where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens irrespective of where they are placed.

Attention scores result into a new set of representations for each token. They are used in next layer of processing. During training, the weight matrices are updated through backpropagation. The model than accounts better for relationships between tokens.

Multi-head attention is just an extension of self-attention. Different attention scores are computed. The results are concatenated and transformed. The consequent representation enhances a model’s ability to capture complex relationships between tokens.

Input embeddings with positional encodings are fed into encoders. Input embeddings are 6 layers — each layer has 2 sub-layers; these are multi-head attention and feed forward networks.

The output of an encoder is a sequence of vectors which are contextualized representations of inputs (after accounting for attention scored). These are fed to the decoder.

Output embeddings with positional encodings are fed into decoder. It contains 6 layers. The output embeddings go through masked multi-head attention. It means embedding from subsequent positions in a sequence are ignored when computing attention scores. The reason being that we generate the current token (in position i) and should therefore ignore all output tokens at positions after i. In addition, output embeddings are offset to the right by one position.

The second multi-head attention layer in decoder takes in contextualized representation of inputs before being passed into the feed-forward network. It ensures output representation captures the full context of input tokens and prior outputs.

We want to figure out what the next token is using contextualized target representations.

The linear layer projects the sequence of vectors into a logits vector (of the same length as the model’s vocabulary).

The linear contains weight matrix which when multiplied with decoder outputs and added with a bias vector produces a logits vector of the size 1xL.

Each cell is the score of a unique token. SoftMax layer than normalizes this vector so that the entire vector sums to one. Each cell then represents the probabilities of each token. The highest probability token is chosen and that is the predicted token.

While training the model, the predicted token probabilities and actual token probabilities are compared. We calculate loss function for each token prediction and average this loss over the entire target sequence. This loss is backpropagated over all model’s parameters to calculate gradients and model’s parameters are updated.

The GPT architecture was introduced by OpenAI in 2018. GPT’s do not contain an encoder stack in their architecture. It has been designed to focus on generative capabilities. It is trained on a large corpus of text. There is unsupervised learning of relationships between all words and tokens.

print

Leave a Reply

Your email address will not be published. Required fields are marked *