By parameters of a neural networks, we mean the internal values or weights the model uses to make predictions or generate text. Parameters are learnt during training. They are weights and biases. They decide how the processes transform input data (words and tokens) into output (the next word or sentence).
The number of parameters is a rough measure of the size and capacity of the model. More parameters give better ability to understand language patterns. They also give greater memory and ability to generalize. Of course, more parameters also mean the requirement of computational power and hardware is more.
Small language model has 100 million parameters, e.g GPT-2. Medium language model has 1-6 billion parameters and gives balanced performance. Any large language model has 10 to 100 billion plus parameters. It is a powerful model. Some frontier models have 100 billion to 1 trillion parameters, e.g. GPT-4, Gemini 1.5, Claude 3 Opus.
In transformer models, words are turned into vectors or embeddings. There are layers of computation. Each layer has millions to billions of weights or parameters. Each layer may be a combination of attention, feedforward and normalization blocks. These parameters are tuned during training across trillions of words to capture grammar, logic, facts etc. Small models have few layers, and large models have many deep layers.
Parameters are learnt using a (training) dataset which may consist of billions of sentences. Once the model makes a prediction, it could be a wrong prediction and so a loss function is calculated. Backpropagation adjusts weights to minimize the error. The process updates each parameter little by little across millions of iterations. Such adjustments of parameters make predictions better. Inference is the use of parameters to generate or predict text.
Leave a Reply