The basic idea behind copyright is to ensure that creators have an incentive to do new work. It also leaves some space for derivative work — say fair usage for criticism, comment, reporting, teaching, scholarship or research among others. Here a small sample of copyright work can be duplicated.
Of late AI is testing the boundaries of copyright. It generates music, visuals, lyrics and scripts, based on ingesting previous work of the creative people. On receiving a prompt, AI processes ingested material and delivers the outcome.
The New York Times has filed a suit against OpenAI and Microsoft alleging that both these companies have used LLMs that were trained on the copyright articles from NYT. This deprives NYT of the audience that could have reached it. The attempt substitutes the content without permission or payment. According to NYT, this is not fair usage since these models compete with NYT and closely mimic it by using the content to train them.
The lawsuit sites examples of articles reproduced from NYT word by word. It bypasses the subscription paywall, which is critical for its survival.
Even Bing search engine of Mircrosoft generates detailed summaries and excerpts from the articles. It is far beyond fair usage.
The NYT demands not only compensation and restriction but also the destruction of all tools and models that incorporate their work.
The issue that can be raised is — why the NYT did not block the access to its content. The answer is that ChatGPT went live in November 2022, and by that time it has already been trained on 175 billion parameters, and about 45 terabytes of data from various datasets. By the time this was realized, the model has already ingested the data.
The NYT points out that the information is public. However, it does not mean it is free to copy.
Apple, by contrast, is negotiating with publishers and offering them monetary compensation for licensing the content for training its AI tools. If this transaction happens, it will strengthen the NYT case.
The NYT case has been filed three months after Authors of Guild of America went to court against OpenAI.
Content created by someone is being used without either acknowledgement or payments. Both Authors Guild and NYT have accused the tech companies of freeriding.
Traditional AI used data for pattern recognition. These models were mostly predictive. Generative AI creates or generates content. It takes technology to another level. To do this, generative AI uses extraordinarily large dataset — ChatGPT-like apps use 45 terabytes. It is trained on content created by others. Its answers closely resemble the original content. It substitutes the original.
The issue is not stifling the innovation. It is the use of data without express permission and payment. The principle is pay for what you use.
This is an untested legal area.
The vernacular models being developed in India must respect copyrights. The government can frame laws to prevent freeriding. Original content creators must not be taken for a ride.