Training Large Language Models (LLMs)

LLMs are trained on a massive amount of data in pre-training. It is unlabelled text data, e.g. web pages, articles, books. It is unsupervised training. The idea is to make the model learn the statistical patterns and structure of language.

The most common pre-training exercise is the prediction of the next word. Here the LLM is given a sequence of words and is asked to predict the next word in the sequence. This actually teaches LLM to learn the relationships between words, and how they are used in different contexts. Alternatively, certain words are masked in the sequence, and the LLM has to predict these words.

After pre-training, the LLMs are time tuned for a specific task, e.g. translation from one language to another or answering a question. Here such labelled data specific to the task is used to train the LLM. It is supervised learning. Here the LLM learns task-specific patterns and relationships between the words.

In both pre-training and fine tuning, forward pass and backpropagation are used. In forward pass, the input data is fed into LLM and the output is computed. The data passes through layers of neurons. Weights and activation function is applied to each neuron. The output of the forward pass is LLM’s prediction.

Backward propagation is the process of using error between LLM’s prediction and the true label to adjust LLMs weights. This is accomplished by computing the gradient of the error with respect to the weights, and then using the gradient to update the weights in a way that reduces the error.

In an LLM’s training, both forward pass and back propagation are important. The forward pass allows it to make predictions. and backpropagation allows it to learn from its mistakes and improve its predictions over a period of time. In other words, it improves its accuracy.

This training process is iterative. In each iteration the LLM receives batch of data, and then the forward and backpropagation processes are applied. The weights are updated in the light of the results of backpropagation process. Then the next iteration begins.

First, the training data is prepared. There is cleaning of data, removal of noise and tokenization of data. The model weights are initialized. It is done randomly or by using pretrained weights from another model. Now the batch of data is fed to the model. In forward pass, it passes through layers of neurons. The LLMs weights and activation functions are applied to each neuron. The output of the forward pass is LLMs prediction for the input data. At this stage, loss is calculated between LLM’s prediction and the true label (desired loss). This loss actually indicates how bad the LLMs prediction was.

In backward propagation, the loss is used to compute the gradient of the loss with respect to LLMs weights. The gradient is used to update the weights so as to reduce the loss.

The process of forward pass, loss calculation and backward propagation is repeated till the LLM has learned to make the accurate predictions for the training data. Later it is fine tuned for specific tasks on smaller amount of data specific to the task.

This process is continuous, since predictions are constantly updated by receiving new input data. In the further iterations, LLM uses predicted word, and the current context to generate the next prediction. It learns range dependencies in the language and make more accurate predictions.

Geoffrey Hinton, a British computer scientist is known for his work on developing backpropagation. LeCun, a French computer scientist pioneered CNNs well suited for image recognition and NLP tasks including machine translation and question answering. Bengio, a Canadian scientist made significant contributions for training neural networks. Nitish Srivastava, a Canadian scientist is known for his regularization technique — dropout — which prevents overfitting. Thomas Mikola is known for his work on word2vec. He is a Czech scientist. Cho, a Korean, working for NY University is known for his work on RNNs. They are well-suited for sequential data and are used for NLP including machine translation and speech recognition.

Training Large Language Models (LLMs)

Comments

Leave a Reply Cancel reply

More posts

Fragile Language Models

AI Infrastructure in the UK

Quantum Theory

Bots Which Rot