We have already learnt that an LLM is trained on vast amounts of data consisting of mountains of text. It then learns to predict the next word. Each prediction requires small adjustments to improve its chances of getting the prediction right. All this training gives an LLM a statistical understanding of proper language. All this is a part of pre-training. However, an LLM fumbles when asked to crack a joke to elevate the mood. Here reinforcement learning through human feedback (RLHF) helps. OpenAI introduced this technique in March 2022. (As you know, ChatGPT was released in November 2022 eight months later.)
There are three steps in RLHF. To a given prompt, human volunteers are asked to choose two potential LLM responses. This is repeated thousands of times. The data is used to train a second LLM. It is called the reward model. It assigns higher scores to responses a human would like (and lower to everything else). In RLHF, knobs and levers are tweaked of the original LLM to help reinforce the behaviours that earn it a reward.
It takes time and is cumbersome. The same results can be achieved with less effort. It is called Direct Preference Optimization (DPO). Archit Sharma and Eric Mitchell presented DPO in December 2023.
There is an observation. For every reward model, there is a specific theoretical LLM that scores full marks. Each LLM conceals an implicit reward model. Researchers can tinker with it. LLM, instead of learning from another LLM, can learn directly from the data. Thus, the intermediary is removed. It makes the process efficient. DPO is being used extensively by leading LLMs. Facebook has integrated DPO in its model. French model Mistral uses DPO.