AI Data Thefts

LLMs are built on a corpus of vast data, and in order to build more capable models the entire public web has been used. All other public and private data sources have also been tapped (books, research papers, private data).

It is possible that in order to search for more data, firms could have taken transcripts of YouTube videos, though doing so could breach the law. In fact, when OpenAI was questioned about this, its CTO responded by saying that she was not sure. Google itself has scraped transcription data (private data) from YouTube videos to train its own models.

In fact, data harvesting has been ingrained in the business models of large tech firms such as Google and Facebook (Meta). Though licensing of copyright book material is possible, it takes a long time.

The crucial difference between an efficient model such as ChatGPT and other models is the data volume.

In past, there were instances of Facebook sharing user data with third parties. Of course, Facebook is in an advantageous position in the AI race as it sits on mountain of data available on Facebook and Instagram.

It is also important to train LLMs on unique data. On social media, one comes across billions of posts, images and videos contributed by the public. Social media tries to remain transparent about the ways such data is used to build products and features.

This traffic of data sharing runs across the players-one using the data of the other player and vice versa. It is just scratch and grab mentality. Mining data has become a multi-trillion business.

After all, all these firms live in the glass houses. Who will throw stones on others? Maybe, the thrower could be at the receiving end tomorrow.

print

Leave a Reply

Your email address will not be published. Required fields are marked *