Commoditization of AI

In the IT field, it does not matter which brand of PC, laptop or smart phone one is using. All these devices could be easily swapped. Even the databases and cloud systems are not unique. Could AI also go the same way?

Commoditization occurs when products or services are interchangeable. The offerings are not distinguishing enough.

AI tools and platforms are extensively used. There could not be significant advantage to any one firm, as AI is available to all. Is AI in the danger of being commoditized?

AI will cease to be a buzzword or a novelty. It will be taken for granted.

Some corporates will leverage AI faster and better than their competitors. However, this is going to be short-lived. The same tools and techniques will be soon used by others.

Business edge will result from innovative culture and forward-looking thinking. You can use a smart phone to make movies, but one cannot be a Spielberg by using a smart phone.

MLOps and LLMOps

MLOps(Machine Learning Operations) and LLMOps are concepts concerned with managing the life cycles of ML models. The areas of focus differ.

MLOps is a broader term that cover the operations processes for all types of ML models — efficient development, deployment and monitoring of these models.

LLMOps is specifically designed for LLMs. These models are used for NLP and LLMOps addresses the unique challenges associated with the lifecycle of these complex models.

Both these concepts of have some common goals — efficiency, reliability and fairness. They have some distinct considerations. The metrics relied upon are accuracy, precision and recall for MLOps. LLMs use more nuanced metrics such as BLEU and ROUGE to assess language fluency and coherence. LLMs, in addition, put a premium on interpretability, fairness and bias mitigation.

MLOps are adaptable across various ML domains. LLMOps are specialized.

Frameworks for LLMs

To interact with LLMs and make them more accessible for various applications, there are frameworks such as LangChain, Llama Index and frameworks for LLM serving.

LangChain provides standardized interface for interacting with multiple LLMs. It offers tools for building apps with LLMs.

Llama Index helps to organize and curate data sources for the LLMs.

LLM serving frameworks are designed to optimize the process of deploying LLMs in production environment. They handle tasks such as model loading, inference and routing requests.

Essentially, LLM frameworks are toolkits that help developers to interact with and leverage the LLMs more effectively.

There could be standardized interfaces with different LLMs (irrespective of their architecture or API). There is prompt engineering to get the desired output from an LLM. There are tools and libraries to help developers design and optimize prompts. There is performance optimization by using frameworks that provide tools to optimize LLM inference for better performance.

Some frameworks enable chaining multiple LLMs together to create more complex workflows and apps. Frameworks integrate with other development tools and libraries,

Apart from LangChain and LlamaIndex, we have OpenLLM and Ray Serve frameworks.

Ries Passed Away

A well-known name in the marketing field for his work on positioning, AI Ries passed away on October7, 2022 at his home in Atlanta at the age of 95. AI Ries and his colleague Jack Trout (at Trout and Ries, Manhattan) proposed to their clients that creative advertising was not enough to persuade consumers to buy. They advocated positioning — to find a slot in the mind and be the first to occupy that slot. As IBM did by owning that slot — computers. As Volvo did it — safety. As FedEx did it — overnight. Burger King’s burgers were boiled, and not fried.

In 2005, Ad Age ranked the most important marketing ideas of the last 75 years. Positioning stood at No.56. In 2009, Ad Age conducted a survey on the best books on marketing. The No.1 ranked book was Ries and Trout’s book ‘Positioning : The Battle for Your Mind’ (1981).

Mr. Ries was inducted into American Marketing Association’s Marketing Hall of Fame in 2016. Positioning as a concept is a milestone in the evolution of modern marketing. It has influenced a whole generation of marketers.

Ries had started an advertising firm in 1963. Trout joined in 1967. They implemented positioning as concept. In essence, it was a position in the mind.

Trout passed away in 2017 at the age of 82.

In 1979, the firm was renamed Trout and Ries. They converted it to a consulting firm in 1989. They moved to Greenwich, Conn.

Ries is survived by a daughter, Laura and wife Mary Lou Ries, two other daughters Dorothy and Barbara and a son Charles.

Ries and Trout separated in 1994. Both set up separate firms.

Laura added emphasis on visual imagery while extending her father’s concept of positioning. Laura referred to her father’s contention — own a word in the mind. She says mere words are not enough — a visual is much more powerful.

Patenting of Medicines

In healthcare, about 50 per cent cost is that of medicines used. Some medicines cost exorbitantly high on account of patenting. These days generic medicines of certain pharma companies compete with patented medicines, bringing the costs of patented medicines down to some extent. Generics make the medicines affordable.

In early 2024, Indian patent rules have been modified — the objections to patents at pre-grant stage have become difficult. That has made patenting easy, and that has increased the prices of drugs.

There are no provisions in the Indian Patent Act to oppose patenting. If there is successful opposition, it results into generic companies being allowed to produce the same drug. There is competition. There could be opposition to patent even after it has been granted.

In early 1970s, the Patent Act was changed, making drugs affordable. India granted process patents, and not product patents. Any Indian company thus could produce the patented drug by using an alternative process. It made generics available, which were exported. India emerged as a leading generic exporter by late 1980s, and leading generic producer by 1990s.

In 1995, there was TRIPS Agreement (trade-related intellectual property rights). It reintroduced product patents which are novel and inventive. In 2005, the Patents Act was amended in the light of TRIPS. Most of the drugs patented in the US and Europe, however, were only new forms (me-too drugs). There was no significant increase in the therapeutic benefit. India was slipping back to pre-2005 situation.

Political parties and civil society here introduced an amendment to the Patent Act — Section 3(d). It was to ensure that an old drug in a new form would not be patented — unless its therapeutic efficiency is significantly better.

Indian Patent Act was amended to allow opposition to patent at pre-grant and post-grant stages. There could be revocation of patents. Rules were framed to accommodate this. The government can issue licences to other companies without the consent of the patent holder. The flexibility of TRIPS was leveraged.

The rules are sought to altered because of the pressure of big pharma. There is a demand for the repeal of Section 3(d). This is the demand while negotiating FTAs with the US, UK and EU.

PGO (pre-grant opposition) comes from civil society and patient’s groups. Genetic companies are reluctant to file PGOs. PGOs were replied to by the applicant. The opponent filed a rejoinder. It is then the decision of the patent controller. The modified rules allow the patent controller to decide whether PGO is maintainable. It could be dismissed. It is an arbitrary power. In past, opposition scuttled many patents (as they were frivolous). Amendments allow non-deserving patents. The opponent has also to pay fees. It is a financial burden.

The patent holder was to report to the patent controller how the patent is worked out every year. This has to be done now every three years. Now working of patents is one of the bases for seeking a compulsory licence. The public is in the dark. It makes compulsory licensing difficult.

Microsoft’s Deal with G42

Microsoft’s CEO Satya Nadella has taken an important step to sign a deal with G42, Abu Dhabi-based AI company. The deal is on the lines of a deal with OpenAI. Nadella is known for his negotiating and deal signing skills. His deal with OpenAI has left the likes of Facebook, Google, Apple and even Tesla’s Musk amazed. As far as AI is concerned, Microsoft happens to be in the driving seat. Nadella is not a man who prefers to rest on his laurels. He signed a deal with Inflection AI and hired Mustafa Suleyman (co-founder of DeepMind) and CEO Inflection AI to head the independent division of AI at Microsoft. He thus gets the best technical brains in the world to work for Microsoft.

A new feather Nadella has added to his cap by signing a $1.5 billion deal with G42, Abu Dhabi-based AI company. This company has Chinese connections, including much maligned Huawei. You can imagine how difficult it must have been to broach this subject with the US authorities. Despite this he got the deal and, in the bargain, persuaded G42 to sever its Chinse connection. There is co-ordination between the governments both UAE and the USA.

G42 is furthering its mission to deliver cutting-edge AI techniques at scale leveraging the strategic investment of Microsoft.

Backpropagation

A key algorithm in the training of artificial neural networks is backpropagation. It calculates the gradient of the loss function — with reference to the weights of the network.

The gradient is used to update the weights. That enables the model to learn.

Backpropagation is initialized during the training phase after the forward pass — the input data is propelled further through the network to make predictions. Backpropagation is done before the backward pass (where gradients are calculated, and weights updated).

The network’s output or conjecture and the actual correct output are compared. Iteratively, the weights are adjusted. The network learns to map the input to the desired correct output. The network learns from its mistakes.

Say an input image is identified (prediction). This is compared to the actual answer. The difference between the prediction and the answer is then propagated backward through the network. While thus travelling backward, the weights between neurons are adjusted to minimize the error for future predictions.

Here calculus is leveraged. The chain rule is utilized. It determines how much each weight contributes to the overall error. These gradients are calculated. This way the algorithm identifies how the weights are to be adjusted so as to minimize the error and improve the network’s performance.

It is complex math but the idea is to make network learn iteratively to refine its internal connections based on the error it makes.

The weights are adjusted using gradient descent (a common optimization algorithm). The specific calculations use the chain rule of calculus repeatedly to differentiate the error function through the layer of the network.

Gradients provide the direction and magnitude for how much the function changes in response to changes in the input. In this context, the inputs are the weights connecting neurons.

Issue of Vanishing Gradient

In neural network training, we come across the problem of vanishing gradient. The gradients become extremely small during the backpropagation. It hinders the training, especially when there are many layers, since the weights of these layers are not effectively updated.

Consequently, these layers are very slow to learn, or do not learn at all. It results into suboptimal performance, or a failure to converge.

Some techniques are used to overcome this issue — careful weight optimization, batch normalization, and skip connections.

The issue generally occurs in RNNs at the time of training. Here the gradients used to update weights of the network become so small while they are being backpropagated through the network layers. This is particularly so when activation function such as sigmoid function and process information are used.

ReLU and leaky ReLU are less prone to vanishing gradients. Weights should be so initialized that they help the flow of gradients easily through the network. One can use techniques such as Xavier initialization or He initialization. There should be gradient clipping to limit the magnitude of the gradients (to prevent them from becoming too small or too large).

As ReLU have gradients which are either 1 or 0, the vanishing effect is prevented.

On this blog, this is 2800th write-up. Hidden Layers in a Transformer

The hidden layers refer to the layers between the input and output layers in a transformer. These hidden layers are where most of the computation and transformations occur.

In the transformer architecture, the hidden layers consist of the self-attention mechanisms followed by feed forward neural networks.

Thus, there is self-attention mechanism. It contains multiple self -attention heads. It weighs the significance of different input tokens (when processing each token). It enables capturing relationships between tokens (in the input sequence).

Soon after the self-attention mechanism, there is feedforward network layer. It consists of two linear transformations with a non-linear activation function (usually a ReLU) in between. The model thus learns more complex pattern and relationships.

At the end, both self-attention and feedforward layers are augmented with residual connections and layer normalization. Residual connections facilitate the flow of gradients during training. Layer normalization stabilizes the training process.

Hidden layers transforms an input sequence into a representation that captures its semantic and contextual information.

Parallel Computations in Transformer Model

Parallel computations occur in transformer model during self-attention mechanism and the feedforward neural network layers.

The self-attention mechanism involves computing attention scores between all pairs of tokens in the input sequence. Here it is done through matrix multiplication operations.

Within the self-attention mechanism, the scaled dot-product attention computation occurs to calculate attention scores. These are parallelized across all tokens in the sequence.

In feedforward, there is linear transformation — matrix multiplication and element-wise non-linear activation function (ReLU). This is parallelized across multiple tokens in the sequence.

In training with multiple layers of the transformer, there is parallel processing across the different layers. It makes the process efficient and speeds it up.

Thus, parallelization facilitates efficient processing of input sequences and faster training.

Illustration

We have the four tokens in the input sequence — [Tokens1, Token2, Token3, Token4]

We compute three matrices — Query, Key and Value (from input embeddings). Each matrix has dimensions (sequence length) x (embedding dimension).

Let us assume embedding dimension of 3.

Input Embeddings:

Token 1: [1, 2, 3]

Token 2: [4 ,5, 6]

Token 3: [7, 8, 9]

Token 4: [10, 11, 12]

Query Matrix (Q):

[ 1, 4, 7, 10 ]

[ 2, 5, 8, 11 ]

[ 3, 6, 9, 12 ]

Key Matrix (K):

[1, 4, 7, 10]

[2, 5, 8,11]

[3, 6, 9,12]

Value Matrix (V):

[ 1, 4, 7, 10 ]

[ 2, 5, 8, 11 ]

[ 3, 6, 9, 12 ]

Calculate the attention scores by taking the dot product of the Query and Key matrices and scale them.

Finally, compute the weighted sum of the value matrix based on these attention weights.

Parallelization

At each step of the self-attention computation, perform parallel operations across all tokens in the sequence. Say while computing attention score of Token 1, simultaneously calculate scores of other tokens, viz. Token 2, Token 3 and Token 4.

This parallelization brings in an element of efficiency. The models are scalable.

Transformer models are more amenable to parallel computation as compared to RNNs (where information processing happens one step at a time).

Feedforward layer is more amenable to parallelization (feedforward in encoder and decoder). In training, computations across different sequences within a batch can also be parallelized.

It should be noted that all aspects of transformers are not parallelizable, (e.g. self-attention layer has sequential dependencies).