AI is transforming our world. ChatGPT and Gemini are known for generating human-like text.
Researchers have come across AI models trained on data that is generated by previous versions of the models. It is moving away from the original data distribution. The output could be distorted and unreliable. It is called model collapse.
As we know, AI models are trained vast volumes of data. It has data scraped form internet. The data is, to begin with, generated by humans. The model learns patterns from this data. The future generations of the model use data from two sources — human-generated data and data generated by the previous models. There is degradation in the output as the data quality successively goes down. Each version dilutes the original detail. The final output is a hazy and less accurate description of the world around. It is a slow process, but it happens. It is like making a copy of a copy, and so on. There is a loss in output. It is inevitable.
The content is less creative, more stereotyped and less useful.
It is not a limited problem. It has far-reaching impacts. There is decline in the efficiency of the model. The model becomes less reliable. It may commit costly errors. The issue of bias raises its ugly head.
One solution is to restrict the model’s training on human-generated data. However, of late, much of the internet content is model generated. And there is an issue of distinguishing human-generated content and AI-generated content.
At times, AI-generated content mimics human-generated content.
The use of human-generated data has its own problems — there are ethical and legal issues. The first-mover advantage is enjoyed by the pioneering models, as there is less contaminated data for training. Thus, early adopters are at an advantageous position.
It is crucial to have access to human-generated data. This has to be balanced with the rights of those whose data is being used. There should be cooperation at the industry-level. Models should be continuously exposed to fresh human-generated data.
Though AI models are powerful, they are still dependent on the quality of data they are trained on.
Though model collapse poses a problem, we can overcome it by implementing the right strategies.