Author: Shabbir Chunawalla

Redesigning Transformer Architecture

As we know, LLMs such as ChatGPT consume extensive memory and has huge computational demands. Thus such models are expensive, and the operating costs are very high.

It is possible to simplify the architecture of LLMs to make them economical. The underlying architecture consists of a transformer. ETH Zurich researchers have targeted the transformer architecture. They have come out with a new streamlined design of a transformer, while retaining its accuracy and its inference making ability.

LLMs, as we know, operate on a foundation of transformer blocks. These process the data sequences. In each transformer block, there are two key sub-blocks — attention mechanism and multi-layer perception (MLP). As we know, the attention layer focuses selectively on different parts of input data (say tokens in sequence) to get the context and relative importance. Though the tokens are apart, the model learns how they relate to each other.

Sequential data is processed by the transformer block. The sub-block of attention-mechanism further refines and processes highlighted information. The whole thing captures relationships.

There are additional features such as residual connections and normalization layers. These speed up learning and reduce the severity of issues.

Transformers stack up to increase their capacity to capture complex relationships in training data. However, the fundamental design of the transformer block has remained unaltered since its making.

Given the excessive training cost and costs of their deployment, the efficiency we can bring about in training and inference making ability result into substantial savings.

The transformer block can be simplified by eliminating unnecessary components. It reduces parameter count and increases throughput of the model.

The stripped-down version of the transformer, as per the research team, does not compromise either the training speed or performance on downstream tasks.

There are multiple attention heads in a transformer model — key (K), query (Q) and value (V) parameters go with these components. These together map the interface of the input tokens. If V parameters are eliminated, there could be projection layer which synthesizes the values for the MLP block, there is no loss of efficiency.

At the same time, the researchers removed skip connections (which avoid vanishing gradients). These vanishing gradients make training difficult (the gradient is too small to bring about significant learning in preceding layers).

The transformer block has been redesigned to process attention heads and MLP concurrently (rather than one after the other.) It is this parallel processing which deviates from the conventional architecture.

The reduction of parameters has been compensated by adjusting, non-learnable parameters, by refinement of training, and by implementing architectural tweaks. Put together, these alterations maintain a model’s learning capabilities, despite the leaner structure.

The new transformer block has been tested by the researchers. The transformer has shrunk in size by as much as 16 per cent without diluting its capabilities. If this is extended to a large model with billions of parameters, it could result into a massive memory saving.

The greater depth makes the model trainable faster and makes use of extra capacity the depth provides. Though tested on a smaller scale, the research still remains untested on larger models.

3rd December 2023
Old AI Models and New LLMs

ChatGPT has appreared in November in 2022 and almost a year elapsed. Till then machine learning was used for a specific task, say protection from frauds or approval of loans. Then came the LLMs, and the old approach took a back seat. The LLMs are generalized models capable of performing many tasks. Still the task-based models have not gone away and are still alive. Amazon’s CTO calls them the ‘good old-fashioned AI’, but still it solves many problems on hand.

Before the advent of the LLMs, we had a task specific world. These days, in enterprises, people plugin LLMs via APIs. LLMs will become better and more robust, and will develop power of reasoning and other emerging abilities.

Still, there is a role for task models, since they are smaller, faster, cheaper, and are performers suited to a particular task.

However, it makes no sense to train different specific task models when all purpose models are available for the enterprise.

Machine learning platforms are still a preserve of data scientists. It is not a preserve of developers. It is not necessary to give up these large number of machine learning models. The appearance of new technology still keeps the old one relevant for some time.

Before LLMs, the task models were the flavour of the day. The enterprises used the services of data scientists to build them. Data scientists will focus on data critically. They will help people understand the relationship between AI and data within the organisations.

Both AI and data has pros and cons. Both have relevance whether an LLM or a task model is being developed. The two will co-exist for some time to come, since bigger does not always mean better.

2nd December 2023
Backpropagation in LLMs
Weights play a major role in deciding an LLM’s behaviour and performance. The weights represent the strength of connections between neurons. These weights are adjusted during training to improve the performance of the model.

Each connection between two neurons has a specific weight associated with it. It represents the importance or strength of the connection. A high positive weight will have big effect on the activation of the second neuron. On the other hand, a negative weight indicates that the output of the first neuron will decrease activation of the second neuron.

In essence, backpropagation is a training algorithm used to train neural networks. It is a supervised learning method. It involves calculating the gradient of loss function with respect to parameters of the model. This gradient is used to update model’s parameters in the opposite direction of the gradient. It got established in the 1980s.

First, an error in model’s output is calculated. This error is propagated backwards through the network –layer by layer. It allows us to know how much each layer contributed to the entire error.

This information is used to update weights in each layer of the network.

The formula for this update is weight _new= weight__old — learning rate* data_weight

where
- weight_ new is the updated weight
- weight_old is the current weight
- learning rate is a hyperparameter that controls the size of the update.
- Delta_weight is the rate of change of the error with respect to the weight.
- Delta weight is calculated using calculus chain rule. It takes into account the contributions of weight in all subsequent layers of the network.
- Adjustment of weights improves learning and performance of the model. Weight are updated iteratively based on training data. The LLM slowly learns to associate patterns and relationships of the data, and the model makes better predictions and generalizations on unseen data.
- Weight initialization and optimization play a vital role in effectiveness of backpropagation. One has to choose appropriate initial values for weights. Optimizing hyperparameters such as learning rate are carefully chosen. These affect training significantly and also the overall performance of the LLM.
2nd December 2023
Positional Embeddings

Positional embedding refers to a technique where the tokens in a sequence are encoded while using NLP. Such embedding is necessary in machine translation and text summarization since the model has to understand the order of the tokens.

There are two main types of positional embeddings. Absolute and relative positional embeddings. Absolute positional embedding encode absolute position of each element in the sequence. Here a simple counter is used. Relative positional embeddings encode relative position of each element to other elements in the sequence. Here a matrix is used which stores the distances between each pair of elements.

Absolute positioning is used for NLP tasks, and relative positioning for tasks involving computer vision.

Positional embeddings facilitates the model’s ability to learn long-range dependencies.

Such embeddings are required for transformers using self-attention mechanism where the content of each word’s information is obtained. There is no information about its position. Without this information, the model struggles to understand the order of the words and their relationship within a sentence.

In positional embeddings, one can use sine and cosine functions to encode the tokens. It is called Sinosoidal Positional Embeddings. Here the long range dependencies in sequences are captured. Alternatively, learned positional embedding relies on data, and is effective for some tasks. These encodings are learned by the model during training. Though more effective than sine and cosine encoding, it requires more training data. Of course, such embeddings require more data.

Positional enbeddings are added to the token embeddings before being fed into the transformer. Positional embedding are learned vectors that are added to word embeddings. It makes the model learn relationships between tokens in the sequence, meaningwise and positionwise.

The embedded tokens are passed through a series of self-attention and feedforward layers.

In sinsusoidal positional embeddings, each position in the sequence is represented as a vector of sine and cosine functions with different frequencies. The frequencies are chosen to increase exponentially. As the position increases, the positions that are closer have similar embeddings. The positions that are far apart are different.

Let us consider an illustration. I love machine learning.

I Position 1 Position Embedding 0.1, 0.2, 0.3, 0.4, 0.5

love Position 2 Position Embedding 0.2, 0.4, 0.6, 0.8, 1.0

machine Position 3 Position Embedding 0.4, 0.8, 1.2, 1.6, 2.0

learning Position 4 Position Embedding 0.8, 1.6, 2.4, 3.2, 4.0

The benefits of positional embeddings are performance improvement of the model, say in machine translation and language modelling. The models generalize better. The results become more interpretable.

However, positional embedding adds complexity to the model. The training time of the model too increases. These positionings are not effective for long sequences. It is necessary to consider these limitations before using it in the models.

1st December 2023
Emergent Abilities in LLMs

Through the fifties the journey of artificial intelligence has started. To begin with we dealt with, Artificial Narrow Intelligence (ANI) restricted to a specific skill. Later we reached the stage of generative AI where large language models generated output by using their ability to learn from datasets they were trained on. We are on our way to have artificial general intelligence (AGI) where the models will perform as well as the human beings and at times even surpass them. When AI matches the intelligence of humans, we reach singularity and when AI surpasses singularity, it is called superintelligence.

Let us learn against this background that LLMs have started showing some surprising unpredictable behaviours, and these behaviours are referred to as ’emergent abilities’. Some of these pertain to basic math skills, some to computer coding and some other to decoding movies based on emojis.. It is interesting to learn about these emergent abilities and why and how they arise.

By emergent abilities, I mean these are the abilities for which the model has not been programmed. These emerge from the way the model processes and generates language. They arise from the model’s ability to learn from the data.

The examples of emergent abilities are the question answering ability using search engines and keeping the model aligned with search results. Another emerging ability is the ability to summarize text into smaller concise pieces. Then there is ability to translate from languages different from each other. LLMs also create beautiful poems, code, scripts, musical pieces etc.

It is a moot point to what extent the LLMs show emergent abilities. Some say these are only pattern-matching models, and their abilities cannot be truly called emergent.

It is to be noted that emergent abilities in LLMs are not on par with intelligence. In intelligence, we acquire the ability to apply knowledge and skills. Some of the behaviuors of LLM border on intelligence, but still lack the level of understanding or reasoning the humans have.

We are not able to predict these emergent abilities of LLMs. It could develop an ability that was unforeseen by its designer. This makes LLMs fascinating, and at the same time risky.

30th November 2023
LS Digital

LS Digital is India’s largest digital marketing services firm. It positions itself as a digital transformation company. They offer services such as media, user interface (UI), user experience (UX), creative and communication, data and insights, consumer experience (CX) and tech and innovations.

They have offices in the middle east and have set up office in the UK. They intend to set up office in the US too.

Ad agencies need to evolve continuously. There are disruptions such as AI. Here there should be fundamental change. The agencies will have to alter their own business models in areas such as people, processes, services, delivery and pricing. Agencies who cannot transform will be redundant.

Rather than being just an advertising company and looking at growth in term of digital media spend, we have to focus on spendings on data analytics, AI and martech solutions. All solutions are now integrated.

30th November 2023
StreamLLM

We have already studied this concept in a previous blog. Here we discuss some additional points. StreamLLM framework enables LLMs to tackle infinite-length inputs. It is a mechanism that improves context by using attention sinks — generally the initial tokens — which capture the attention (even if they are not semantically relevant, say they are punctuation marks). KV of these tokens are retained so as to maintain the performance of the model when the text length exceeds the cache size.

LLMs when streamed generalize longer texts than their training sequence length. It is not necessary to fine tune the model for this.

In this model , some tokens are evicted. It means removing them from the cache of the previous key and value states, which consumes extensive memory. There are two approaches to eviction — token by token eviction and batched token eviction.

30th November 2023
Influencer Marketing

Influencer marketing industry in India is estimated to be at over Rs.12 billion in 2022, and it will show a CAGR of 25 per cent over the next five years. It will touch Rs. 28 billion by 2026.

There are more than 150 registered and unregistered influencer agencies in India. Larger agencies use strategy development and downstream agencies to execute them for campaigns.

Bigger agencies acquire smaller influencer marketers to stay relevant in a rapidly changing landscape. Marketers in India spend over 10 per cent of their digital marketing budgets on influencer marketing to tap their highly local reach.

While a bigger agency acquires a small one, the focus should be on integration, especially on integration of technology. It is possible that the smaller agency loses some of its attributes that makes it dynamic and growth oriented.

29th November 2023
The Early Days of AI

Long before the word artificial intelligence or AI found its place in technological history, human beings conceived machines with human-like attributes. This fascination to build autonomous machines can be traced backed to history. There was a seminar organized at Dartmouth, USA in 1956, which is considered to be the pioneering event of modern AI research and development.

Since than AI has advanced steadily, with applications in finance, banking, insurance and healthcare. Large Language Models (LLMs) ultimately gave us a chatbot — ChatGPT and made AI a prominent term.

29th November 2023
StreamingLLM

LLMs handle long text sequences, but while doing so they may reach their context limit. At times, it is necessary to extend the context of the model to longer sequences.

The solutions are computationally intensive, memory intensive or not so precise. One breakthrough solution is StreamingLLM which has been developed by a team of researchers from Facebook, Carnegie Mellon and MIT. This technique extends the context to millions of tokens — there is no need to consume vast compute resources and memory resources. The performance of the model is preserved. Streaming-LLM is very useful to process long-sequence text.

Context-windows, we have already studied, as a concept. A Llama-2 model has context-window of 4000 tokens or about 3000 words. As long as the interaction of the model remains limited to this context, the performance is not affected. The limit is finite, and the model has restrictions.

We can extend the context window length. This approach modifies the architecture of the model. It requires retraining. This is expensive. Many organizations may not afford this. Context-window lengthening increases the quadratic costs — double the window size, you will have to quadruple its memory and compute costs.

Alternatively, a sliding context window could be used. Here a context window of 4000 token size is fed 4000-x tokens where x is the number of tokens it is expected to generate. The practice has certain drawbacks. Auto-regressive LLMs have ‘KV caching’ to improve efficiency. It is a mechanism that computes and stores the value of attention heads of previous tokens. It eliminates the need to recompute the values for each new token. The attention value of each token is dependent on its preceding token. On shifting the context window, one has to recompute the whole KV cache. It reduces the model’s throughput.

There is one more solution. Move the window but keep the cached values as they are for the tokens that overlap between the new and old context. It is a better method but it has its fallacy. The quality of the model declines quickly, once the context starts to deviate from the initial setting.

Researchers now focus on Attention Sinks. A substantial proportion of attention score in autoregressive LLMs is allocated to the initial tokens (irrespective of their relevance to the language modeling task). Such initial tokens are called Attention Sinks.

A model becomes much more perplex when the text length increases more than the cache size (on account of the exclusion of initial tokens). The perplexity is uncertainty in model’s predictions. If it is low, the model has higher precision. Thus the attention sinks (irrespective how far they are from the tokens being predicted) play a vital role in maintaining the stability of the LLMs. This is intuitive. Language modeling is autoregressive. Initial tokens are visible to almost all subsequent tokens. Thus initial tokens are readily trained to act as attention sinks. They thus capture a disproportionate amount of attention.

Remove the attention values of the first few tokens from the context, and the result is deterioration in model’s performance — there is a significant loss of attention value. These attention sinks are preserved in the StreamingLLM technique.

It allows the model to perform well without fine tuning. Attention sinks are preserved. The attention-score distribution is thus near-normal. When the interaction surpasses the model’s context length, StreamingLLM retains KV cache for the attention sink tokens — four initial tokens are enough. It extends the model’s context and stabilizes its performance. There is no need to recompute the entire KV values.

Under StreamingLLM framework, KV cache consists of the attention sinks, and rolling KV cache that retains most recent tokens ( vital for language modeling ). It is a versatile design. It can be incoporated in any autoregressive language model that employs relative positional encoding.

Generation of token 6

0123 (4 5 ) 6. Here 0 1 2 3 are attention sinks. 4 5 are KV cache rolling. 6 is token generated.

Generation of token 7

0 1 2 3 ( 4 5 6 ) 7. Here 0 1 2 3 are attention sinks. 4 5 6 is KV cache rolling. 7 is token generated.

Generation of token 8

0 1 2 3 4 ( 5 6 7 ) 8. Here 1 2 3 are attention sinks. 4 is evicted. 5 6 7 is rolling cache. 8 is token generated.

Generating token 9

0 1 2 3 4 5 ( 6 7 8 ) 9. Here 0 1 2 3 are attention sinks. 4 5 are evicted. 6 7 8 are rolling cache. 9 is token generated.

The Code for StreamingLLM is accessible on GitHub. Hugging Face is closely monitoring development of StreamingLLM.

28th November 2023