Fine-tuning LLMs

Since its release in November 2022, ChatGPT has stirred the users so much that they wonder about the capabilities of Large Language models (LLMs) in particular and AI in general. It is difficult to come across someone who has not experienced the power of ChatGPT. While all these tools such as GPT, Gemini or Claude are powerful with hundreds of billions of parameters and pre-trained on vast corpora of text, they are not omnipotent. These models fall short for specific tasks — however these models can be used for speafic tasks by fine-tuning them by using techniques such quantizatron and LORA. There are some libraries for fine tuning.

Fine-tuning is an expensive process, especially a model with a large number of parameters. Models with less than 10 billion parameters can be fine tuned without any significant infrasructure changes. For larger models, we require approximately 1.5 terrabytes of GPU vRAM. It is equivalent to a cluster of 20 Nvidia A 100s, each with 80 GB of vRAM. This set up costs $ 4 lac. The assumption is the hardware is available.

Alternatively, one can use one of cloud providers (AWS, Azure or GCP). This approach too is expensive. An hour of using A 100 GPU on AWS costs $40. If a model is fine-tuned on 20 GPU for 5 days, it would cost about $1 lac.

That is why researchers use smaller LLMs with less than 10 billion parameters. A Mistral can be fine-tuned using Nvidia A10 on AWS. It takes 10 hours, costing less than $20. However, the model requires quantization.

Quantization converts model’s parameters to low-precision data types — 8-bit or 4-bit. This reduces memory consumption and speeds up execution. All 32-bit values are mapped to a smaller range of finite values — 256 for 8-bit conversion.

Another technique LoRA is low-rank adaptation. Here model’s weights are updated using matrix dimensionality reduction. Transformers, as we know, rely on matrices. Here parameters are adjusted within these matrices. In LoRa, two smaller matrices are created for updation.

A Matrix with 1000×1000 parameters (totaling 1 million parameters) can be decomposed to 1000×1000 multiplied by 100×1000 matrices. This reduces parameter count to 2*100k (a reduction of 80 per cent in parameters). The approximation is less precise but still it improves memory and computational efficiency significantly.

Quantization and LoRA can be used in combination. It is called QLoRA.

To begin fine-tuning anew, Unsloth Python library is used. After pre-training an LLM, there is supervised fine-tuning (SFT). The tools used are SFT and PEFT (Parameter Efficient Fine-tuning) from Hugging Face. In addition, LoRA and quantization can be easily applied by using BitsAndBytes (by Tim Dettmers).

Lastly, after pre-training and supervised fine-tuning, the model is informed which generated outputs are desirable and which are not. It is called preference optimization. The techniques used are RLHF — reinforcement learning from human feedback and Direct Preference Optimization (DPO).

In March 2024, a new technique has emerged called Odds Ratio Preference Optimization (ORPO) combining supervised alignment.

Comments

Leave a Reply Cancel reply

More posts

AI Infrastructure in the UK

Quantum Theory

Bots Which Rot

Quantum Technology: A New Revolution