These days large language models are being used for natural language processing. Still, researchers feel that it is necessary to develop smaller models, though trained on massive data. Smaller models consume less computational resources, e.g. LLaMA model has 7 billion parameters, and has been trained on 1 trillion tokens. It produces results superior to GPT-3 model, though it is 25 times smaller.
LLMs are compressed so that they fit into devices such as laptops, mobiles and tablets. This they should do without diluting their generative ability. LLMs are sequential, and hence even trivial errors can affect the outputs. The quantization can make the model smaller — say 3 to 4 bit quantization techniques with 1 to 10 billion parameters. Instead of 16 bit model, the low quantization can make model smaller, but this should not affect the accuracy.
Sparse-Quantized Representation (SpQR) is the answer to this problem. Here there is lossless compression of a pretrained LLM to 3-4 bits per parameter. The end-to-end accuracy error is less than 1%.
SpQR first starts locating outlier weights. These when quantized lead to high errors. These weights are stored in high precision. The remaining weights are stored in a lower format (say 3 bits). Alternatively, SpQR can make use of group quantization. The small group size of 16 of contiguous elements can be represented in a 3-bit format.
LLM is converted into SpQR format in post training quantization (PTQ) approach. Thus the LLM when quantized runs on a single 24 GB GPU without any deterioration of performance.
Leave a Reply