Transformers are pre-trained in GPT models and such other large language models. Transformers undergo semi-supervised learning involving un-supervised pretraining followed by supervised fine tuning.
Pre-training data set is larger than the one for fine tuning.
Transformers use attention mechanism processing all tokens simultaneously. They calculate attention weights between them in successive layers. Attention mechanism only uses information about other tokens from the lower layers. Hence it can be computed for all tokens in parallel. This improves training speed.
Adaptive optimizers are used to train transformers, e.g. Adam. It recursively estimates momentum and the learning rate separately for each parameter at each step. In practice, the batch sizes are larger. More than 1000 are usually employed.
Semi-Supervised Learning
This is a machine learning (ML) technique that falls between supervised and unsupervised learning. IT uses small amount of labelled data to train a model. Its aim is to learn a function that can adequately predict the output variable based on the input data.
Unsupervised Learning
These are algorithms that learn patterns of unlabelled data. Here three main tasks are performed –clustering, association and dimensionality reduction.
Clustering is a data mining technique. It groups unlabelled data based on their similarities or differences. Clustering algorithms are used to process raw, unclassified data objects into groups represented by structures/patterns in the data.
Association is also a data mining technique that discovers the probability of co-occurrence of items in a collection. It is used to find patterns in the data that can be represented as rules. Association rules are if-then statements. These help to show probability relationships between data items within large data sets.
Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into features selection and feature extraction. Feature selection is a sub-set of relevant features for use in model construction. Feature extraction is the process of transforming data from a high-dimensional space into a space of lower dimensions.
Training the transformer involves passing the input sequence through transformer encoder and generating the output sequence using the decoder block. During the training, the model is optimized using backpropagation and a loss function, such as cross-entropy.
Attention mechanism tells us that we need to process the information about what is important and ignore the rest. Attention is equal to similarity (q, k) where q represents a query and k represents a key. The key that matches most to query should be detected. It is accessing a database.
Attention is equal to softmax into q, k divided by underroot d and multiplied by v. Underroot d refers to dimensionality of the vectors. Then we consider a softmax function. We are telling algorithm where to pay attention. If the attention score is 0, then no attention is paid to those parts. If the attention score is 1, then we care about the pixel or one hot vector. When we have probability distribution, it is easier to do computation.
In vision transformers, we need a few more things for transformers, though Vaswani contends ‘Attention Is All That You Need.’
We start at the bottom and move up. There are input embeddings. It requires some experimentation. Take the data and represent it in some different way. You can do this by patching, convolutions, linear networks or something else. The embedded data is referred to as tokens.
After this, there is positional embedding (PE). It is a way to express how data lines up in a positional relationship. It is necessary because we deal with a sequence of data. If we do not add information about how that data lies within a sequence, it is going to be difficult to learn about the relationships in a sequence.
PE is commonly done in terms of sine and cosine.
Multi-headed Attention
Attention shows relations between pairs of data. If there are two heads of attention, it shows relationships of pairs of data. Three heads means relations among pairs of pairs of pairs. And so on. It is a hyper-parameter. The code is modified to multi-head attention (MHA layer).
It is then added and normalized. It is sent through a feedforward network. It is added with a residual layer and normalized again. This is repeated.