LLM Parameters and Learning

Often we hear that an LLM has a billion or even trillion plus parameters. These parameters are numerical values, say weights and biases which influence the connections and activations of neurons in the model. The more the parameters, the more expressive and effective an LLM is. However, it requires more data and computational resources for training.

OpenAI’s GPT4 has 175 billion parameters. Google’s PaLM2 has 3.6 trillion parameters.

Weights are assigned based on the relevance and context of words. Words are converted into word embeddings, or numerical representation of words as vectors. Weights are assigned to these vectors. An LLM learns these words embeddings through massive data sources, internet, books, news articles etc.

LLMs use attention mechanism — this mechanism computes weights for vectors by using multiple attention heads. Each attention head has its relevance criterion. This mechanism makes LLM focus on most important words both in the input and output sequences. This captures long-range dependencies and complex relationships between words.

Weights give us the weighted average of the word embeddings. This weighted average is fed into feedforward neural network. The output could be a probability distribution over the next word in the sequence or a vector representation of the input sequence that could be used for NLP tasks.

A backward pass in a neural network is used to adjust weights of the network so as to minimize loss function. It is also called backpropagation. In fact, it propagates error from the output layer to the input layer. Along the path, the weights are adjusted using gradient descent.

In the training of LLM, the first step is a forward pass. Backward pass is the second step. Forward pass computes output from input data. It was the currently assigned weights. Its outcome is a prediction, which is compared to the desired target value to calculate the loss function.

Backward pass finds the optional weights to reduce the loss function. It has to assess how much each weight contributes to the loss function. It uses a chain rule to compute derivative of a complex function by breaking it into simpler functions. The derivative indicates how a small change in a variable affects another variable.

Backward pass moves from output to input layer. Each weight is adjusted by substracting a fraction of its derivative from its current value. This fraction is known as learning rate. It shows how fast or slow the network learns.

Backward pass is repeated for each batch of data till network reaches a satisfactory level of performance or a maximum number of iterations.

Vaswani et al (2017) paper Attention Is All That You Need made the researchers use Transformer model for NLP, leaving the old RNNs and CNNs. There is parallel processing here. It is a scalable model. It can capture long-term dependencies and context. It is a scalable model. It can capture long-term dependencies and context. It talked about scaled dot-product attention and multi-head attention. It injects positional information into the word embeddings to preserve the order of words in a sequence. It is called positional embedding and sinusoidal position representation. It also introduces the concept of self-attention and cross-attention. There is layer normalization and there are residual connections.

print

Leave a Reply

Your email address will not be published. Required fields are marked *