There are always discussions about the role of data in model training in terms of data quality and synthetic data. A recent Microsoft paper Textbooks Are All You Need emphasizes training models to write Python code. However, the paper has implications beyond coding.
The models examined in the paper do not owe their success to any pathbreaking design or training methods. Architecture and training methods are conventional. The innovating aspects are drawn from the training data. The data improves the learning efficiency of language models for code.
Scaling Laws for Neural Language Models (2020) paper focused on model size — large models being trained on modest quantum of data. DeepMind’s Training Compute-Optimal LLMs focused on data size. It talked about the present large models which are undertrained. In 2023, the focus shifts to data quality where a leaked Google memo asserted that data quality scales better than data size.
Microsoft’s paper Textbooks Are All You Need made analysis of this movement of data quality.The paper demonstrates the feasibility of training a powerful LLM for Python code with a selection of ‘textbook quality’ data from the web and synthetically generated data.
Less is More for Alignment (LIMA) also shows small high-quality data set produces impressive results.
There is an issue of synthetic data or the output of data of the models. Smaller-models have been trained on the output of larger models, e.g. Alpaca and Vicuano. Thought should be given to whether larger models can benefit by training them on their own output. LLM data for training should have sufficient diversity.
Textbooks are All You Need brings out evidence that data quality compensates for data quantity and model size. The discussion around synthetic data will persist. It has shown good results in image processing. The paper also found that language models trained on synthetic data of textbook quality were able to achieve state-of-the-art performance on a variety of tasks. Thus synthetic data can be a valuable resource for training language models, especially when high-quality real world data is not available.