Content Deal with AI Companies

After a commotion about the use of copyright materials to train the LLMs, especially a suit of the NYT against OpenAI, the content deals between AI companies and top publishers are coming fast and furious. On Wednesday, May 22, OpenAI signed a deal with News Corp. of Murdoch, said to be worth $250 million over five years. It is the biggest content deal.

It is an acknowledgement of the maxim that there is a premium for premium journalism. The terms of the deal are not known. But it is assumed that the sum of the deal includes compensation (in the form of cash and credits) for use of OpenAI technology.

In April 2024, the Financial Times struck a deal with OpenAI. Other OpenAI deals have included Associated Press, Axel Springer and Le Monde. Without a deal, OpenAI and other AI companies are likely to use their content clandestinely without paying for it. AI.

Though deals have been struck now, it is highly likely that OpenAI long ago may have ingested content. The current payment to ‘get access’ to part of its brazen scraping of information in the public domain. In fact, these deals are settlements, and one of the terms of the settlement could be that the publishers would not drag OpenAI to court.

Litigation is expensive and time consuming. The NYT suit against OpenAI till it is fought up to SC will take a long time. It is sensible for the publishers to take the deal. There is no opting out of AI. It guzzles their content whether there is a deal or no deal.

These deals are not transparent and ethical. There should be an equitable system. There should be public policy, rather than secretive deals. There could be a centralized platform controlled by a consortium of publishers. They could aggregate content. They make the content available to AI companies. It will be against a fee. The fee will be based on an agreed criterion.

The principle that should prevail is that those who create content should have control over it.

OpenAI has signed content and product partnerships with the Atlantic and Vox Media. Several media firms are signing deals with OpenAI which will give it access to their news content and archives. Earlier it had signed a deal with media conglomerate and the Wall Street Journal owner News Corp.

LLMs: Labour Intensive

Generative AI does facilitate the work and improves productivity. It can generate reports and write the code. However, at the same time LLMs by themselves do require more human labour than what effort is saved.

Peter Cappelli, a management professor of Ivy League institute Wharton School, Pennsylvania spoke at an MIT event. He was of the opinion that LLMs create more work for people. AI is praised as a game-changing technology. There are rosy predictions about autonomous cars and tracks. They have not yet seen the light of the day. Many things are lost in the details — issues such as regulation of such vehicles, insurance liability, software issues. Besides, a truckdriver just does not stop at driving. He does a lot of other tasks. Similarly, programmers do many other things apart from programming — setting project goals, negotiating budgets etc. There are technological possibilities. However, the roll-out is slow on account of realities on the ground.

AI generates new work. Databases are to be managed. Materials need to be organized. Reports are to be made. There are issues of validity. To do all this, one has to complete many new tasks.

We have been familiar with operational AI. It is still work-in-progress. ML is still underused. There are data science issues. Data is to be analyzed. Data in silos should be integrated.

LLMs can do many tasks. However, they should avoid doing certain tasks. Letters generated by LLMs having legal implications require vetting by lawyers. Can that be a time saver then?

LLMs are expensive. They require space, electricity and manpower. It is not necessary to replace rote automation with AI.

Generative AI output requires validation of accuracy. These are to be vetted by an expert. At times, there are hallucinations and quirky responses. It is an issue of reliability.

There could be different and varied responses from LLMs. This is also a reliability issue.

People still prefer to make decisions on the basis of gut feelings or personal preferences.

In near future, most generative AI will be used to sift data and do analysis to facilitate decision-making processes.

Personality Rights

In India, celebrities such as Rajnikanth, Anil Kapoor and Jackie shroff have approached courts over personality rights. These are also called publicity rights. They are a part celebrity rights — the name, voice, signature, images or any other feature identified by public of a celebrity’s personality lie at the heart of personality rights. These could include poses, mannerisms or any distinct aspect of their public persona.

Celebrities at times register aspects of their personalities as trademarks to use them commercially.

The unauthorized use of these characteristics for commercial purposes not only infringes upon these rights but also dilutes the brand equity.

These rights are not defined under the Indian law, but are seen under the rights to privacy and property. Even concepts such as IRP (passing off or deception) are applied to ascertain if protection is warranted. The court can grant an injunction restraining the violators of personality rights. They cannot use tools such as AI, face morphing and GIFs for commercial purposes.

LLMs have used copyrighted data of media, and the media sues them. The Authors Guild of America filed a suit on behalf of writers such as George R R Martin and John Grishan alleging illegal use of the authors’ copyrighted works by OpenAI. There was a strike of Hollywood actors. The actors believed their faces and voices could be used to create new films and shows without their consent or without compensation. Finally, the studios assured actors that their consent would be sought.

Of late, Hollywood actor Scarlett Johansson has been shocked that OpenAI’s GPT-4o has sound ‘eerily similar’ to her voice. She has earlier declined request of OpenAI to use her voice. GPT-4o’s Voice Mode feature allows users to have voice conversations with the AI chatbot and lets them choose from five kind of voices. One of the voices, called SKY, resembles Johansson’s voice. Open AI later paused the availability of SKY. They added it is not Johannson’s voice but the voice of another actor. It was not intended to resemble hers. OpenAI’s behaviour is emblematic of the haughty feeling of impunity that governs AI industry.

AI-powered deepfake voices have emerged as a potent tool in political campaighs, corporate espionage and cyber frauds.

In its landmark judgement (Pultaswamy, 2017), SC has held to that an individual has a fundamental right to protect privacy under Article 21 of the constitution. In the Ritesh Sinha case (2019), the SC ruled that voice samples (protected under the right to privacy) can be legally compelled for criminal investigation in public interest.

A voice per se cannot get copyright protection, but an artist’s voice can be copyright protected as a performance right.

It is not clear whether voice (recorded or cloned) will be treated as digital personal data. If so, it will require explicit consent (under yet to be implemented Digital Data Protection Act, 2023).

The IT Act and its Rules do not permit misuse of illegal content. It requires labelling of synthetic content and removal of deepfake content by 24-36 hours upon receiving a report from either a user or government authority. If not complied with, there are remedies in the IT Act and the IPC.

Pharma Quality

India has been described as the ‘pharmacy of the world’. Though quality is of paramount importance in all the industries, it is a must in pharma industry as here we are dealing with life. The department of pharmaceuticals (DoP) has taken several steps to upgrade the manufacturing practices. The approach paper on National Pharmaceutical Policy, 2023 intends to promote Indian pharma exports. India manufactures 60,000 plus generic brands. It accounts for 20 per cent of the global supply of generics. India’s pharma exports are $27.9 billion in 2024.

The pharma manufacturing hubs in India are Gujrat, Maharashtra, HP and Telangana.

Recently, the apex drug regulatory body (Central Drugs Standards Control Organization: DSCO) withdrew the power of state licensing authorities (SLAs} to issue clearances to export-only drug manufacturing units. SLAs used to issue NOCs.

Banned drugs are those that are banned for sale in India but allowed in importing country. Unapproved drugs are those that are not yet approved in India but are approved elsewhere.

SLAs are inadequate to monitor manufacturing standards — there are ill-equipped testing labs, paucity of drug inspectors, poor understandings of specific rules, patchy surveillance and lack of legal experience to take action against violators.

Schedule M guidelines are integrated to WHO-GMP (good manufacturing practices) standards. Last year about 2000 out of 10,500 manufacturing units were compliant with WHO-GMP standards.

India is a key player in generics market. With about $250-billion worth of new drugs are going off patent in the coming decade India must, therefore, strengthen its manufacturing practices to avail of this opportunity.

No Work on LLMs, LeCun

There was an annual technology conference for startups at Paris. It is called Viva Tech. Yann LeCun, chief of Meta AI advised young students not to work on LLMs if they are interested AI systems.

Youngsters should be interested in building next generation AI systems. Hence, they should not work on LLMs which are controlled by Big Tech. Instead, they should build AI systems that overcome the limitations of LLMs.

Mufeed, a young creator of Devika (a Devin alternative) spoke about moving away from Transformer architecture and develop new architecture, e.g. RMKV (a RNN architecture). It has an expanded context window and inference. Such an approach could lead to building something as impressive as GPT-4.

LeCun also recommends open source. Ultimately, all our interactions with the digital world will be through AI assistants, and there should be a large number of AI assistants.

Though LeCun is not in favour of Transformer models, they too are evolving. To illustrates, there is GPT-4o. It understands video and audio natively.

How much smarter AI can get? Much much smarter. Sam Altman says data would not be a problem anymore, thus addressing the concerns for training LLMs.

Pichai’s Advice to Software Engineers

At a recent I/O conference of Google (May 2024), there was an interview of Sunder Pichai, CEO, Google with Varun Mayya, a content creator. They talk about AI. Pichai tells Mayya that there is an entire industry in India to help young Indians to crack the FAANG interviews. Pichai complains that many students who are smart do not focus on the fundamentals. Mayya asks Pichai for his advance to budding software engineers. Many of them have ‘competitive mindset’ and they have to prepare themselves for the future. They should come out of the competitive mindset. According to Pichai, real success comes when things are understood deeply. There is a difference between knowing something and understanding it. Once a technology is fully understood, one can transition, one can do things.

They talk about AI in India, wrapper startups and creative adoption of AI. They utter real gold.

Windows-Mac Rivalry

There is a long-running rivalry between Windows PC and Apple’s Mac. This will grow further on account of AI-powered chips resulting into more efficient performance. Microsoft is looking forward to bringing real competition back to the Windows versus Mac.

AI-powered Microsoft PC called Copilot PC. These machines will be more powerful and 58 per cent faster. (than Apples MacBook Air M3). The new hardware will start at $1000. They will be ready for shipping from June 18, 2024.

Apple is trying to catch up. The company is placing high-end chips in cloud-computing servers created to process most advanced AI tasks coming to Apple devices. AI-related features will be processed directly on iPhones, iPads and Macs.

Microsoft is leveraging its tie-up with Open-AI to pioneer an early entry into generative AI. The government officials are likely to examine whether this impinges on competition or whether these should be regulated as mergers. According to Microsoft, the tie up is stimulating competition rather than suppressing it.

Rich Embeddings of Transformer Models

Transformer-based models or transformers work with numbers and linear algebra, rather than processing natural language directly. Therefore, they convert textual inputs into numerical representations, especially by applying self-attention mechanism. This is called embedding or encoding. Such numerical representations from input text are called transformer embeddings.

Vectors or numerical representations using word2vec suffer from one major drawback — lack of contextual information, since these are static embeddings and pre-date transformers. Transformers overcome this issue by producing their own context-aware embeddings. Fixed word embeddings are augmented by positional information — the order of occurrence of the words in input, and contextual information — how the words are used.

There are two mechanisms to do this — positional encoder and self-attention blocks. The result is more powerful vector representation of words.

Transformers store the initial vector representation for each token in the weights of a linear layer. In transformer, they are learned embeddings. Though in practice, they are similar, a different nomenclature focuses on the fact these representations are just a starting point. They are not the end products.

The linear layer contains only weights and no biases. There is 0 bias for every neuron.

The layer weights=Matrix of the size Vxd_model where v is the vocabulary size — the number of unique words in the training data d_model is the number of embedding dimensions.

The original transformer model was proposed with a d_model size of 512 dimensions. In practice, we can use any reasonable value.

It is the training that distinguished static and learned embeddings. Static embeddings are trained using Skip-Gram or Continuous Bag of Words architectures. Learned embeddings are an integral part of the transformer. They are trained using backpropagation.

Training Process for Learned Embeddings

The embedding layer (with weights for each neuron and O bias) stores learned embeddings.

The weights=matrix of size Vxd_model. The word embeddings for each word is stored along rows — the first word in the first row, second in the second row and so on.

In the training, for an input word the aim is to predict the next word in the sequence.

It is called Next Token Prediction (NTP). Initially, the predictions are poor. They are improved after compensating for loss function. There are several iterations. The learned embeddings then are a strong vector representation of each word in the vocabulary.

Input sequences in new sequences are tokenized — they have an associated token ID. It corresponds to the position of the token in tokenizer’s vocabulary. Say the word ‘cat’ may have a token ID 349.

Token IDs create one-hot encoded vectors that extract the correct learned embeddings from the weights matrix.

After training learned embeddings, the weights in the embedding layer do not change.

To preserve the word order, positional encoding vectors are generated and added to the learned embeddings of each word.

Last step is to add contextual information using self-attention.

Vaswani’s original transformer model proposed the following positional encoding.

PE (pos, 2i) Sin= ( pos/10000 2i divided by d_model )

PE (Pos, 2i+1) Cos ( pos/10000 2i divided by d_model )

The positional embedding corresponds to a sinusoid.

The positional encodings shown above are predictable or deterministic. They are fixed. It is possible to use learned positional encodings by randomly initializing some positional encodings and training them with backpropagation.

Self-attention mechanism modifies the vector representation of words to capture the context of their usages in an input sequence.

Self-attention has ‘self’. It uses surrounding words within a single sequence to provide context.

Therefore, all words are processed in parallel. It enhances the performance.

Another type of attention is cross-attention. Self-attention operates in a single sequence. Cross-attention compares each word in the output sequence to each word in the input sequence.

In self-attention, similarity between words is calculated by using the dot product. The similarity scores are then scaled. Attention-weights are calculated by using SoftMax function. Lastly, transfer embedding is calculated.
There are no trainable parameters in simple weighted sums. These can be introduced. Self-attention inputs are used thrice to calculate new embeddings. When pre-multiplied by their respective inputs, these form key, query and value matrices (K, Q and V). A query is a database you are looking for performing a search. Keys are the attributes or columns that are being searched against. Values correspond to the actual data in the data base.

Self-attention is expanded to Multi-Head Attention in the original paper. We have covered it in a separate blog.

Safe Superintelligence: SSI

OpenAI cofounder Ilaya Sutskever after quitting it announced a new AI startup Safe Superintelligence on June 19, 2024. It’s focus is safe superintelligence — one goal and one product. The company is abbreviated as SSI.

Sutskever was a part of OpenAI’s Super-alignment team with Jan Leike who too left the company along with him in May, 2024 to join Anthropic. Though the team was steering and controlling AI systems, it was dissolved shortly after both these gentlemen left.

The new startup wants to concentrate on safety, and wants to keep it immune to short-term commercial pressures.

His other associates in the new company are Daniel Gross(Apple AI) and Daniel Levy(OpenAI).

Fake Science Studies

Scientific research is documented through research papers published in peer-reviewed journals. A large number of papers — somewhere between 2 million and 6 million — are published every year. It is estimated that at least 2 per cent of papers are fake — this number adds up to a lot.

These fake papers are churned out by the so-called paper mills. These papers have either full data fudged, or part of the data fudged. The paper mills operating in the field approach the scientists offering to write papers with credit given to the scientists for a price.

Paper mills have proliferated as the quantity of research is rewarded, rather than the quality of research. These paper mill studies get cited in legitimate review papers if the review writing authors are not careful and are after volume of work.

The funding agencies are impressed by the bolstered resumes of such scientists and precious resources are routed to such scientists, rather than genuine scientists.

Some papers are generated by AI. These days they use tools like ChatGPT.

There is no rigorous evaluation system for these papers. The evaluation is done by recruitment committees or grant committees. They do not have the wherewithal to make an actual evaluation. Scientists are rewarded on the basis of the number of papers they write and the number of publications that cite them.

Even legitimate research papers do not advance the state of knowledge. Researchers add some additional data in an ongoing project. It is converted into a new paper. Majority of these papers make no contribution. They are not worth reading.

The fake papers use a template. Only data and words are filled. Paper mills fabricate papers in those fields which tend to be formulaic (nano technology, computer science and mRNAs).

Many a time, the fake papers are retracted. However, their impact persists as these are cited and mentioned in review papers.

Funding should be made available rigorous peer-review.