As we have already observed that the New York Times has filed a copyright violation suit against OpenAI and Microsoft, here we shall discuss certain additional points with respect to this suit.
It is felt that the models indulge in memorization of the copyright material. That brings another term to the center stage — approximate retrieval. The model does not repeat exactly the same information that has gone into it. The crux of approximate retrieval lies in the fact that LLMs do not fit into the mould of traditional databases. Here precision and exact matches are paramount. LLMs operate as n-gram models and inject an element of unpredictability into the retrieval process.
Prompts are not keys to a structured database. Just they serve as cues for the model to generate the next token based on the context. LLMs do not promise exact retrieval.
The lawsuit revolves around the issue of memorization. There is no verbatim reproduction through LLMs. However, the extensive context window and network capacity leaves space for potential memorization. That becomes unintended plagiarism. If prompted again and again, LLM could generate exact sentences. Besides there is fine tuning. The task of LLM becomes memory-based retrieval. There is no autonomous planning on the part of LLM. The expanded context window makes memorization even worse.
In legal discussions, the focus should be LLMs inability to achieve exact retrieval. It is a defense against copyright violation.
LLMs may behave both as memorization devices and on its own generation devices. In news generation, if an LLM is too creative, it can generate fake news or inaccurate news. If it offers exact news, it violates copyright. It is a dilemma. Another concept has emerged — retrieval-augmented-generation. LLMs have a structured approach to information retrieval. It strikes a balance between spontaneity of LLMs and the disciplined traditional search engine methods (to reduce hallucinations).
The material from the NYT is converted into vector databases. That facilitates RAG or retrieval-augmented-generation.
The way the n-grams models work, the possibility of exact NYT article being reproduced unaltered is not there. The case sustains if there is actual lifting of the NYT articles, and such a thing creates a dent at the revenue of the NYT.