Epoch is a no profit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of altruism. It makes research on the supply of training data for LLMs. Its new study released in June 2024 predicts that the supply of the data available for training will get exhausted roughly by the turn of the decade — sometime between 2026 and 2032. As such the data is finite resource now, and there is going to be a literal ‘gold rush’ for it.
As we know, as the LLM are getting trained on more amount of data, they are getting smarter. The amount of text fed into LLMs has been growing about 2.5 times every year. As we know, LLMs require two key ingredients — vast stores of internet data and computing power. Computing power has grown about 4 times per year. The latest upcoming version of face book’s Llama 3 has been trained on up to 15 trillion tokens.
AI companies keep on signing deals with publishers to enhance their data requirements. There is a steady flow of sentences out of Reddit forums and news media outlets. In long term, there will not be sufficient blogs, news articles and social media comments to sustain the current trajectory.
AI can then try to tap the private data such as emails and text messages. It will have to rely on less-reliable synthetic data generated by the chatbots. OpenAI experimenting with generating lots of synthetic data for training. However, LLMs require high-quality data, whereas synthetic data is low-quality data.
At present, models are scaled up to expand their capabilities and improve the quality of their output. Researchers predicted 2020 as the cut-off-year to get high-quality text data. New techniques were employed to make better use of existing data. Models are sometimes overtrained on the same sources multiple times.
It is debatable how much it is worth to be concerned about data paucity. Is it necessary to keep on training larger and larger models? Models can be trained for specialized areas for specific tasks.