Data for Training Generative AI

Generative AI is being criticized for using data to train its models because the data does not belong to the AI companies. Many websites block the AI companies from scraping their content. A short code tells the AI company to go away. Many websites shut the door this way. Many websites add the code to their ‘robot.txt’ file to regulate bots or spiders that scour the internet.

How can an AI company take the content that does not belong to them? It will take much more than just a line of code to stop this happening.

There are other online resources to exercise the block on crawlers. StackOverflow blocks the crawler but it does not block GitHub, owned by Microsoft.

OpenAI’s offer giving sites the ability to prevent siphoning off the content by using a code is to be juxtaposed to a long data gathering exercise it has done for quite sometime. It is trying to be helpful when the horses have bolted. The burglary is already over.

At the same time, it must be appreciated that OpenAI expects other companies to follow suit.

Web crawlers are not the only method used by AI companies to collect the data. These companies use bulk datasets provided by third parties, say Books3 has a dataset of 2 lac books. Authors should sue over the use of books this way.

There is no way we can do something about data already scooped. There is no guarantee that despite the code to block, data will not be collected by alternative methods. OpenAI has acknowledged in future consent will be vital for undertaking future scraping.

There are so many bots out there. They do not provide any kind of safeguard and the sites have no option to opt out.

Bard of Google is interested in a discussion to develop a mechanism for administering consent on AI.

What has already been done cannot be undone. The data is in the digital blender.

print

Leave a Reply

Your email address will not be published. Required fields are marked *