LLMs are fed on massive data. Stuart Russel, University of California, Berkeley feels that soon there will be no data left to ingest and bots like ChatGPT may hit a brick wall. In near future, the whole field of generative AI may be adversely affected by paucity of data. It is this anxiety that compels companies to resort to data harvesting. The data collection processes as it is are under a radar of those whose copyright material is being used. Much of data collection is being done without consent. The most worrying factor is the shortage of data — all high quality data could be exhausted by 2026. Such high quality data is sourced from books, news articles, scientific papers, encyclopedias and web-content.
OpenAI has bought datasets from private sources. We can infer that there is acute shortage of high quality of data.
GPT-4 has been created by making use of public data as well private data. OpenAI has, however, not revealed the sourcing of data for GPT-4
Sam Altman, CEO, OpenAI has no plans to offer an IPO, as there could be conflicts with investors, in view of the unorthodox structure and decision making in the company.