OpenAI justifies its data collection for training LLMs by holding that those models are not feasible in the absence of copyrighted material to train them. This submission was made to the UK’s House of Lords, Communications and Digital Select Committee.
Copyright covers virtually every sort of human expression, say blogs, posts, photos, software code and government documents. It covers both fiction and non-fiction.
Generative AI tools such as ChatGPT and image generation tools such as Stable Diffusion are trained on vast amounts of data. Most of the data is covered under Copyright Act. This agitates the original publishers of such data. The creators of the data are also agitated. Their contention is that their work is being used without consent and/or compensation.
The NYT legal suit against OpenAI and Microsoft is before the judiciary. OpenAI argues that copyright does not deprive them of their right to train the models. Some authors have also approached the courts.
The generative AI companies act as if copyright laws do not exist. The line that divides just adaptation and creation of something new is blurry. AI allows us to see how poorly the adaptations and new content are defined.
Data mining should be treated as a safe harbour. The use of data to develop AI technology is fundamentally not an act of infringement. Just material lifting is not enough for infringement. There should be unauthorized use of the work for its expressive purpose. Use of material for technical purposes, for non-communicative purposes are not the uses for expressive purposes. Therefore, they are not copyright violatrous.
Models do not redistribute the same data or recommunicate it. Models use data to recognize patterns and associations. Machines are trained to learn, reason and act as humans do. The models generate new text, learning from the data ingested.