Issues of Data Lifting by Gen AI Models

Generative AI and ChatGPT are magical and still throw many issues. If they are pressed hard for prolonged period, they may provide erroneous responses called hallucinations. OpenAI which launched its generative AI model scraped the internet without bothering about the consent of the content creators. Such scraping is necessary for training its model. OpenAI’s competitors in generative AI field did the same thing. The issue is whether these companies have the right to scrape content on the internet without express permission. Later this content is used for generating fresh content through the models. There is no payment made to content creators.

Prominent authors have sued OpenAI and many more are likely to join the suit. Even content creators, painters and photographers are aggrieved.

Google has now provided the website publishers a switch that enables websites to be available for web crawlers for searches but not for generative AI training. By using the tool, websites will have the freedom to be available only for search purposes or be available for search as well as training the model purposes.

Google may be sincere in its effort to bestow control of published material to websites, or it might have done so to avoid litigation. In any case, the issue of content ownership is highlighted in the midst of generative AI models.

Media abroad has taken steps to protect their websites and have incorporated tools not to allow lifting of their content for being used for model training. Indian media too have safeguarded their websites from the crawlers of generative AI companies.

Though such denial is good news, for the development of AI models they must be fed with content to train and refine them. Big tech has the capability to bypass the safeguards introduced by the websites. Still doing so will weaken their legal position.

Another option is to compensate the authors, content creators and web publishers. The issue here is whether this is practical — can they afford to pay for all the content they need for training and fine tuning? AI models are hungry for more and more content.

The idea of using synthetic data was put forward. It means to use machine generated content. However, it has not yielded good results. Ultimately, the solution of a compensation being paid to access the content from the net legally will be accepted by AI companies though not willingly but under pressure from court room nudging. It may not benefit small time content creators, and may work in favour of big fish.

Countries are grappling with framing laws to take care of such issues, but India lags behind here. Indian policy makers must realise that India generates huge digital data, second only to China. This is valueable for AI companies. Therefore, there should be a legal framework to deal with this.

print

Leave a Reply

Your email address will not be published. Required fields are marked *