Blog

Old AI Models and New LLMs

ChatGPT has appreared in November in 2022 and almost a year elapsed. Till then machine learning was used for a specific task, say protection from frauds or approval of loans. Then came the LLMs, and the old approach took a back seat. The LLMs are generalized models capable of performing many tasks. Still the task-based models have not gone away and are still alive. Amazon’s CTO calls them the ‘good old-fashioned AI’, but still it solves many problems on hand.

Before the advent of the LLMs, we had a task specific world. These days, in enterprises, people plugin LLMs via APIs. LLMs will become better and more robust, and will develop power of reasoning and other emerging abilities.

Still, there is a role for task models, since they are smaller, faster, cheaper, and are performers suited to a particular task.

However, it makes no sense to train different specific task models when all purpose models are available for the enterprise.

Machine learning platforms are still a preserve of data scientists. It is not a preserve of developers. It is not necessary to give up these large number of machine learning models. The appearance of new technology still keeps the old one relevant for some time.

Before LLMs, the task models were the flavour of the day. The enterprises used the services of data scientists to build them. Data scientists will focus on data critically. They will help people understand the relationship between AI and data within the organisations.

Both AI and data has pros and cons. Both have relevance whether an LLM or a task model is being developed. The two will co-exist for some time to come, since bigger does not always mean better.

2nd December 2023
Backpropagation in LLMs
Weights play a major role in deciding an LLM’s behaviour and performance. The weights represent the strength of connections between neurons. These weights are adjusted during training to improve the performance of the model.

Each connection between two neurons has a specific weight associated with it. It represents the importance or strength of the connection. A high positive weight will have big effect on the activation of the second neuron. On the other hand, a negative weight indicates that the output of the first neuron will decrease activation of the second neuron.

In essence, backpropagation is a training algorithm used to train neural networks. It is a supervised learning method. It involves calculating the gradient of loss function with respect to parameters of the model. This gradient is used to update model’s parameters in the opposite direction of the gradient. It got established in the 1980s.

First, an error in model’s output is calculated. This error is propagated backwards through the network –layer by layer. It allows us to know how much each layer contributed to the entire error.

This information is used to update weights in each layer of the network.

The formula for this update is weight _new= weight__old — learning rate* data_weight

where
- weight_ new is the updated weight
- weight_old is the current weight
- learning rate is a hyperparameter that controls the size of the update.
- Delta_weight is the rate of change of the error with respect to the weight.
- Delta weight is calculated using calculus chain rule. It takes into account the contributions of weight in all subsequent layers of the network.
- Adjustment of weights improves learning and performance of the model. Weight are updated iteratively based on training data. The LLM slowly learns to associate patterns and relationships of the data, and the model makes better predictions and generalizations on unseen data.
- Weight initialization and optimization play a vital role in effectiveness of backpropagation. One has to choose appropriate initial values for weights. Optimizing hyperparameters such as learning rate are carefully chosen. These affect training significantly and also the overall performance of the LLM.
2nd December 2023
Positional Embeddings

Positional embedding refers to a technique where the tokens in a sequence are encoded while using NLP. Such embedding is necessary in machine translation and text summarization since the model has to understand the order of the tokens.

There are two main types of positional embeddings. Absolute and relative positional embeddings. Absolute positional embedding encode absolute position of each element in the sequence. Here a simple counter is used. Relative positional embeddings encode relative position of each element to other elements in the sequence. Here a matrix is used which stores the distances between each pair of elements.

Absolute positioning is used for NLP tasks, and relative positioning for tasks involving computer vision.

Positional embeddings facilitates the model’s ability to learn long-range dependencies.

Such embeddings are required for transformers using self-attention mechanism where the content of each word’s information is obtained. There is no information about its position. Without this information, the model struggles to understand the order of the words and their relationship within a sentence.

In positional embeddings, one can use sine and cosine functions to encode the tokens. It is called Sinosoidal Positional Embeddings. Here the long range dependencies in sequences are captured. Alternatively, learned positional embedding relies on data, and is effective for some tasks. These encodings are learned by the model during training. Though more effective than sine and cosine encoding, it requires more training data. Of course, such embeddings require more data.

Positional enbeddings are added to the token embeddings before being fed into the transformer. Positional embedding are learned vectors that are added to word embeddings. It makes the model learn relationships between tokens in the sequence, meaningwise and positionwise.

The embedded tokens are passed through a series of self-attention and feedforward layers.

In sinsusoidal positional embeddings, each position in the sequence is represented as a vector of sine and cosine functions with different frequencies. The frequencies are chosen to increase exponentially. As the position increases, the positions that are closer have similar embeddings. The positions that are far apart are different.

Let us consider an illustration. I love machine learning.

I Position 1 Position Embedding 0.1, 0.2, 0.3, 0.4, 0.5

love Position 2 Position Embedding 0.2, 0.4, 0.6, 0.8, 1.0

machine Position 3 Position Embedding 0.4, 0.8, 1.2, 1.6, 2.0

learning Position 4 Position Embedding 0.8, 1.6, 2.4, 3.2, 4.0

The benefits of positional embeddings are performance improvement of the model, say in machine translation and language modelling. The models generalize better. The results become more interpretable.

However, positional embedding adds complexity to the model. The training time of the model too increases. These positionings are not effective for long sequences. It is necessary to consider these limitations before using it in the models.

1st December 2023
Emergent Abilities in LLMs

Through the fifties the journey of artificial intelligence has started. To begin with we dealt with, Artificial Narrow Intelligence (ANI) restricted to a specific skill. Later we reached the stage of generative AI where large language models generated output by using their ability to learn from datasets they were trained on. We are on our way to have artificial general intelligence (AGI) where the models will perform as well as the human beings and at times even surpass them. When AI matches the intelligence of humans, we reach singularity and when AI surpasses singularity, it is called superintelligence.

Let us learn against this background that LLMs have started showing some surprising unpredictable behaviours, and these behaviours are referred to as ’emergent abilities’. Some of these pertain to basic math skills, some to computer coding and some other to decoding movies based on emojis.. It is interesting to learn about these emergent abilities and why and how they arise.

By emergent abilities, I mean these are the abilities for which the model has not been programmed. These emerge from the way the model processes and generates language. They arise from the model’s ability to learn from the data.

The examples of emergent abilities are the question answering ability using search engines and keeping the model aligned with search results. Another emerging ability is the ability to summarize text into smaller concise pieces. Then there is ability to translate from languages different from each other. LLMs also create beautiful poems, code, scripts, musical pieces etc.

It is a moot point to what extent the LLMs show emergent abilities. Some say these are only pattern-matching models, and their abilities cannot be truly called emergent.

It is to be noted that emergent abilities in LLMs are not on par with intelligence. In intelligence, we acquire the ability to apply knowledge and skills. Some of the behaviuors of LLM border on intelligence, but still lack the level of understanding or reasoning the humans have.

We are not able to predict these emergent abilities of LLMs. It could develop an ability that was unforeseen by its designer. This makes LLMs fascinating, and at the same time risky.

30th November 2023
LS Digital

LS Digital is India’s largest digital marketing services firm. It positions itself as a digital transformation company. They offer services such as media, user interface (UI), user experience (UX), creative and communication, data and insights, consumer experience (CX) and tech and innovations.

They have offices in the middle east and have set up office in the UK. They intend to set up office in the US too.

Ad agencies need to evolve continuously. There are disruptions such as AI. Here there should be fundamental change. The agencies will have to alter their own business models in areas such as people, processes, services, delivery and pricing. Agencies who cannot transform will be redundant.

Rather than being just an advertising company and looking at growth in term of digital media spend, we have to focus on spendings on data analytics, AI and martech solutions. All solutions are now integrated.

30th November 2023
StreamLLM

We have already studied this concept in a previous blog. Here we discuss some additional points. StreamLLM framework enables LLMs to tackle infinite-length inputs. It is a mechanism that improves context by using attention sinks — generally the initial tokens — which capture the attention (even if they are not semantically relevant, say they are punctuation marks). KV of these tokens are retained so as to maintain the performance of the model when the text length exceeds the cache size.

LLMs when streamed generalize longer texts than their training sequence length. It is not necessary to fine tune the model for this.

In this model , some tokens are evicted. It means removing them from the cache of the previous key and value states, which consumes extensive memory. There are two approaches to eviction — token by token eviction and batched token eviction.

30th November 2023
Influencer Marketing

Influencer marketing industry in India is estimated to be at over Rs.12 billion in 2022, and it will show a CAGR of 25 per cent over the next five years. It will touch Rs. 28 billion by 2026.

There are more than 150 registered and unregistered influencer agencies in India. Larger agencies use strategy development and downstream agencies to execute them for campaigns.

Bigger agencies acquire smaller influencer marketers to stay relevant in a rapidly changing landscape. Marketers in India spend over 10 per cent of their digital marketing budgets on influencer marketing to tap their highly local reach.

While a bigger agency acquires a small one, the focus should be on integration, especially on integration of technology. It is possible that the smaller agency loses some of its attributes that makes it dynamic and growth oriented.

29th November 2023
The Early Days of AI

Long before the word artificial intelligence or AI found its place in technological history, human beings conceived machines with human-like attributes. This fascination to build autonomous machines can be traced backed to history. There was a seminar organized at Dartmouth, USA in 1956, which is considered to be the pioneering event of modern AI research and development.

Since than AI has advanced steadily, with applications in finance, banking, insurance and healthcare. Large Language Models (LLMs) ultimately gave us a chatbot — ChatGPT and made AI a prominent term.

29th November 2023
StreamingLLM

LLMs handle long text sequences, but while doing so they may reach their context limit. At times, it is necessary to extend the context of the model to longer sequences.

The solutions are computationally intensive, memory intensive or not so precise. One breakthrough solution is StreamingLLM which has been developed by a team of researchers from Facebook, Carnegie Mellon and MIT. This technique extends the context to millions of tokens — there is no need to consume vast compute resources and memory resources. The performance of the model is preserved. Streaming-LLM is very useful to process long-sequence text.

Context-windows, we have already studied, as a concept. A Llama-2 model has context-window of 4000 tokens or about 3000 words. As long as the interaction of the model remains limited to this context, the performance is not affected. The limit is finite, and the model has restrictions.

We can extend the context window length. This approach modifies the architecture of the model. It requires retraining. This is expensive. Many organizations may not afford this. Context-window lengthening increases the quadratic costs — double the window size, you will have to quadruple its memory and compute costs.

Alternatively, a sliding context window could be used. Here a context window of 4000 token size is fed 4000-x tokens where x is the number of tokens it is expected to generate. The practice has certain drawbacks. Auto-regressive LLMs have ‘KV caching’ to improve efficiency. It is a mechanism that computes and stores the value of attention heads of previous tokens. It eliminates the need to recompute the values for each new token. The attention value of each token is dependent on its preceding token. On shifting the context window, one has to recompute the whole KV cache. It reduces the model’s throughput.

There is one more solution. Move the window but keep the cached values as they are for the tokens that overlap between the new and old context. It is a better method but it has its fallacy. The quality of the model declines quickly, once the context starts to deviate from the initial setting.

Researchers now focus on Attention Sinks. A substantial proportion of attention score in autoregressive LLMs is allocated to the initial tokens (irrespective of their relevance to the language modeling task). Such initial tokens are called Attention Sinks.

A model becomes much more perplex when the text length increases more than the cache size (on account of the exclusion of initial tokens). The perplexity is uncertainty in model’s predictions. If it is low, the model has higher precision. Thus the attention sinks (irrespective how far they are from the tokens being predicted) play a vital role in maintaining the stability of the LLMs. This is intuitive. Language modeling is autoregressive. Initial tokens are visible to almost all subsequent tokens. Thus initial tokens are readily trained to act as attention sinks. They thus capture a disproportionate amount of attention.

Remove the attention values of the first few tokens from the context, and the result is deterioration in model’s performance — there is a significant loss of attention value. These attention sinks are preserved in the StreamingLLM technique.

It allows the model to perform well without fine tuning. Attention sinks are preserved. The attention-score distribution is thus near-normal. When the interaction surpasses the model’s context length, StreamingLLM retains KV cache for the attention sink tokens — four initial tokens are enough. It extends the model’s context and stabilizes its performance. There is no need to recompute the entire KV values.

Under StreamingLLM framework, KV cache consists of the attention sinks, and rolling KV cache that retains most recent tokens ( vital for language modeling ). It is a versatile design. It can be incoporated in any autoregressive language model that employs relative positional encoding.

Generation of token 6

0123 (4 5 ) 6. Here 0 1 2 3 are attention sinks. 4 5 are KV cache rolling. 6 is token generated.

Generation of token 7

0 1 2 3 ( 4 5 6 ) 7. Here 0 1 2 3 are attention sinks. 4 5 6 is KV cache rolling. 7 is token generated.

Generation of token 8

0 1 2 3 4 ( 5 6 7 ) 8. Here 1 2 3 are attention sinks. 4 is evicted. 5 6 7 is rolling cache. 8 is token generated.

Generating token 9

0 1 2 3 4 5 ( 6 7 8 ) 9. Here 0 1 2 3 are attention sinks. 4 5 are evicted. 6 7 8 are rolling cache. 9 is token generated.

The Code for StreamingLLM is accessible on GitHub. Hugging Face is closely monitoring development of StreamingLLM.

28th November 2023
In-Context Learning (ICL) in Transformer Neural Networks

In-Context Learning refers to the ability of attention-based neural networks such as transformers to predict the response to a query by learning from illustrative examples presented in context to them. To illustrate, we can give a few examples of translation matter from English to Marathi, and the transformer learns from it. It then translates on its own, though it was not trained previously for this. This is called ICL. An MIT study says that GPT-3 can learn a new task from a few examples without the need for any new training. There are smaller linear models inside the hidden layers, which get trained to complete a new task by using simple learning algorithms. A model infers using inputs without updating its weights to tackle problems not encountered in training. It is called In-Context Learning or ICL. It is the ability to infer from a short prompt of tokens from an unseen task. It formulates relevant per-token and next-token predictions. A model performs well by remembering exemplar-label meanings from context to make predictions.

GPT-3 demonstrated ICL post-training by auto-regression. Research indicates emergent ICL in transformers. It is influenced by linguistic data characteristics, e.g. burstiness and skewed distribution. Transformers learn in weight (IWL) when they are trained on data lacking these characteristics. Here data stored in model’s weight is used.

When training data is bursty, the objects appear in clusters and has a large number of tokes or classes, it is necessary to investigate ICL. ICL capability arises when training losses keep declining. Research indicates, though ICL is an emergent phenomenon, there is a possibility that it may last temporarily. This is called transience.

ICL as a concept is different from the concept of context-window which we have already explained. It is fixed-size window that slides over the input-sequence to capture the context of each token.

28th November 2023