Emergent Abilities in LLMs

Through the fifties the journey of artificial intelligence has started. To begin with we dealt with, Artificial Narrow Intelligence (ANI) restricted to a specific skill. Later we reached the stage of generative AI where large language models generated output by using their ability to learn from datasets they were trained on. We are on our way to have artificial general intelligence (AGI) where the models will perform as well as the human beings and at times even surpass them. When AI matches the intelligence of humans, we reach singularity and when AI surpasses singularity, it is called superintelligence.

Let us learn against this background that LLMs have started showing some surprising unpredictable behaviours, and these behaviours are referred to as ’emergent abilities’. Some of these pertain to basic math skills, some to computer coding and some other to decoding movies based on emojis.. It is interesting to learn about these emergent abilities and why and how they arise.

By emergent abilities, I mean these are the abilities for which the model has not been programmed. These emerge from the way the model processes and generates language. They arise from the model’s ability to learn from the data.

The examples of emergent abilities are the question answering ability using search engines and keeping the model aligned with search results. Another emerging ability is the ability to summarize text into smaller concise pieces. Then there is ability to translate from languages different from each other. LLMs also create beautiful poems, code, scripts, musical pieces etc.

It is a moot point to what extent the LLMs show emergent abilities. Some say these are only pattern-matching models, and their abilities cannot be truly called emergent.

It is to be noted that emergent abilities in LLMs are not on par with intelligence. In intelligence, we acquire the ability to apply knowledge and skills. Some of the behaviuors of LLM border on intelligence, but still lack the level of understanding or reasoning the humans have.

We are not able to predict these emergent abilities of LLMs. It could develop an ability that was unforeseen by its designer. This makes LLMs fascinating, and at the same time risky.

LS Digital

LS Digital is India’s largest digital marketing services firm. It positions itself as a digital transformation company. They offer services such as media, user interface (UI), user experience (UX), creative and communication, data and insights, consumer experience (CX) and tech and innovations.

They have offices in the middle east and have set up office in the UK. They intend to set up office in the US too.

Ad agencies need to evolve continuously. There are disruptions such as AI. Here there should be fundamental change. The agencies will have to alter their own business models in areas such as people, processes, services, delivery and pricing. Agencies who cannot transform will be redundant.

Rather than being just an advertising company and looking at growth in term of digital media spend, we have to focus on spendings on data analytics, AI and martech solutions. All solutions are now integrated.

StreamLLM

We have already studied this concept in a previous blog. Here we discuss some additional points. StreamLLM framework enables LLMs to tackle infinite-length inputs. It is a mechanism that improves context by using attention sinks — generally the initial tokens — which capture the attention (even if they are not semantically relevant, say they are punctuation marks). KV of these tokens are retained so as to maintain the performance of the model when the text length exceeds the cache size.

LLMs when streamed generalize longer texts than their training sequence length. It is not necessary to fine tune the model for this.

In this model , some tokens are evicted. It means removing them from the cache of the previous key and value states, which consumes extensive memory. There are two approaches to eviction — token by token eviction and batched token eviction.

Influencer Marketing

Influencer marketing industry in India is estimated to be at over Rs.12 billion in 2022, and it will show a CAGR of 25 per cent over the next five years. It will touch Rs. 28 billion by 2026.

There are more than 150 registered and unregistered influencer agencies in India. Larger agencies use strategy development and downstream agencies to execute them for campaigns.

Bigger agencies acquire smaller influencer marketers to stay relevant in a rapidly changing landscape. Marketers in India spend over 10 per cent of their digital marketing budgets on influencer marketing to tap their highly local reach.

While a bigger agency acquires a small one, the focus should be on integration, especially on integration of technology. It is possible that the smaller agency loses some of its attributes that makes it dynamic and growth oriented.

The Early Days of AI

Long before the word artificial intelligence or AI found its place in technological history, human beings conceived machines with human-like attributes. This fascination to build autonomous machines can be traced backed to history. There was a seminar organized at Dartmouth, USA in 1956, which is considered to be the pioneering event of modern AI research and development.

Since than AI has advanced steadily, with applications in finance, banking, insurance and healthcare. Large Language Models (LLMs) ultimately gave us a chatbot — ChatGPT and made AI a prominent term.

StreamingLLM

LLMs handle long text sequences, but while doing so they may reach their context limit. At times, it is necessary to extend the context of the model to longer sequences.

The solutions are computationally intensive, memory intensive or not so precise. One breakthrough solution is StreamingLLM which has been developed by a team of researchers from Facebook, Carnegie Mellon and MIT. This technique extends the context to millions of tokens — there is no need to consume vast compute resources and memory resources. The performance of the model is preserved. Streaming-LLM is very useful to process long-sequence text.

Context-windows, we have already studied, as a concept. A Llama-2 model has context-window of 4000 tokens or about 3000 words. As long as the interaction of the model remains limited to this context, the performance is not affected. The limit is finite, and the model has restrictions.

We can extend the context window length. This approach modifies the architecture of the model. It requires retraining. This is expensive. Many organizations may not afford this. Context-window lengthening increases the quadratic costs — double the window size, you will have to quadruple its memory and compute costs.

Alternatively, a sliding context window could be used. Here a context window of 4000 token size is fed 4000-x tokens where x is the number of tokens it is expected to generate. The practice has certain drawbacks. Auto-regressive LLMs have ‘KV caching’ to improve efficiency. It is a mechanism that computes and stores the value of attention heads of previous tokens. It eliminates the need to recompute the values for each new token. The attention value of each token is dependent on its preceding token. On shifting the context window, one has to recompute the whole KV cache. It reduces the model’s throughput.

There is one more solution. Move the window but keep the cached values as they are for the tokens that overlap between the new and old context. It is a better method but it has its fallacy. The quality of the model declines quickly, once the context starts to deviate from the initial setting.

Researchers now focus on Attention Sinks. A substantial proportion of attention score in autoregressive LLMs is allocated to the initial tokens (irrespective of their relevance to the language modeling task). Such initial tokens are called Attention Sinks.

A model becomes much more perplex when the text length increases more than the cache size (on account of the exclusion of initial tokens). The perplexity is uncertainty in model’s predictions. If it is low, the model has higher precision. Thus the attention sinks (irrespective how far they are from the tokens being predicted) play a vital role in maintaining the stability of the LLMs. This is intuitive. Language modeling is autoregressive. Initial tokens are visible to almost all subsequent tokens. Thus initial tokens are readily trained to act as attention sinks. They thus capture a disproportionate amount of attention.

Remove the attention values of the first few tokens from the context, and the result is deterioration in model’s performance — there is a significant loss of attention value. These attention sinks are preserved in the StreamingLLM technique.

It allows the model to perform well without fine tuning. Attention sinks are preserved. The attention-score distribution is thus near-normal. When the interaction surpasses the model’s context length, StreamingLLM retains KV cache for the attention sink tokens — four initial tokens are enough. It extends the model’s context and stabilizes its performance. There is no need to recompute the entire KV values.

Under StreamingLLM framework, KV cache consists of the attention sinks, and rolling KV cache that retains most recent tokens ( vital for language modeling ). It is a versatile design. It can be incoporated in any autoregressive language model that employs relative positional encoding.

Generation of token 6

0123 (4 5 ) 6. Here 0 1 2 3 are attention sinks. 4 5 are KV cache rolling. 6 is token generated.

Generation of token 7

0 1 2 3 ( 4 5 6 ) 7. Here 0 1 2 3 are attention sinks. 4 5 6 is KV cache rolling. 7 is token generated.

Generation of token 8

0 1 2 3 4 ( 5 6 7 ) 8. Here 1 2 3 are attention sinks. 4 is evicted. 5 6 7 is rolling cache. 8 is token generated.

Generating token 9

0 1 2 3 4 5 ( 6 7 8 ) 9. Here 0 1 2 3 are attention sinks. 4 5 are evicted. 6 7 8 are rolling cache. 9 is token generated.

The Code for StreamingLLM is accessible on GitHub. Hugging Face is closely monitoring development of StreamingLLM.

In-Context Learning (ICL) in Transformer Neural Networks

In-Context Learning refers to the ability of attention-based neural networks such as transformers to predict the response to a query by learning from illustrative examples presented in context to them. To illustrate, we can give a few examples of translation matter from English to Marathi, and the transformer learns from it. It then translates on its own, though it was not trained previously for this. This is called ICL. An MIT study says that GPT-3 can learn a new task from a few examples without the need for any new training. There are smaller linear models inside the hidden layers, which get trained to complete a new task by using simple learning algorithms. A model infers using inputs without updating its weights to tackle problems not encountered in training. It is called In-Context Learning or ICL. It is the ability to infer from a short prompt of tokens from an unseen task. It formulates relevant per-token and next-token predictions. A model performs well by remembering exemplar-label meanings from context to make predictions.

GPT-3 demonstrated ICL post-training by auto-regression. Research indicates emergent ICL in transformers. It is influenced by linguistic data characteristics, e.g. burstiness and skewed distribution. Transformers learn in weight (IWL) when they are trained on data lacking these characteristics. Here data stored in model’s weight is used.

When training data is bursty, the objects appear in clusters and has a large number of tokes or classes, it is necessary to investigate ICL. ICL capability arises when training losses keep declining. Research indicates, though ICL is an emergent phenomenon, there is a possibility that it may last temporarily. This is called transience.

ICL as a concept is different from the concept of context-window which we have already explained. It is fixed-size window that slides over the input-sequence to capture the context of each token.

Deepfakes

A deepfake is a video (or audio recording) that has been edited using an algorithm to replace the person in the original with someone else. Deepfakes are not real. It is an example of manipulation with the help of AI technologies.

Some technologies used to create deepfakes are ML, deep learning and neural networks.

Recently the deepfake video of actress Rashmika Mandana surfaced over social media platforms. Pending a new law, Digital India Act, the government asked social media firms to take down content related to deepfakes and similar misinformation from their platforms, even in the absence of a formal complaint under the provisions of already existing Information Technology Act. The failure to do so will attract Section 66D of the IT Act, 2020 (punishment for cheating by personation using computer resources and imprisonment of up to 3 years and a fine of up to 1 lac.)

Deepfakes can upend society. There are instances of fake political propaganda, news and pornography. This happened when there was no AI to assist.

Deepfakes may be derived from ‘deep learning’ and ‘fake’. It is synthetic media — media that is manipulated or wholly generated by AI.

With the advent of AI, there are many apps now accessible which facilitate the creation of deepfakes. Even there is professional assistance available at meagre cost.

Social media and smart phones spread the deepfakes at lightning speed.

Most of deepfake produced is non-consensual porn or image-based abuse.

Deepfakes affect the elections. There are videos of candidates saying something unpalatable in the final moments of an election — having the potential to change the election.

Deepfakes can cause instability in the business sector.

Disinformation campaigns can threaten the democratic processes.

In a world of deepfakes, it is difficult to distinguish between reality and fakes. It can lead to ‘infocalypse’. Public can doubt the authenticity of real media. The line between real and fake becomes hazy. We the common people will not take anything on social media at its face value. That is infocalypse — the greatest danger posed by AI and deepfakes.

Gemini : A New LLM from Google

In 2024, Google is going to launch a new LLM called Gemini. The launch has been announced at Google I/O Developer Conference, in May 2023. The LLM will be a joint project of Google DeepMind team which combines the resources of Google Brain and DeepMind.

It will be a more powerful LLM having capabilities to generate text, code and other creative content, translation of languages, question answering in a comprehensive way, generating poems, code, scripts, musical pieces, email, letters etc.

Gemini wants to leverage training techniques borrowed from AlphaGo including reinforcement learning and tree search. Thus Gemini will be more efficient as well as effective,

Google’s Gemini is multi-modal model that goes beyond text. It integrates several aspects of LLMs which are algorithms of deep learning and huge datasets based on natural language. And they generate new content.

Google announced Gemini in May, 2023. Gemini will be helpful to API integrators.

Google’s Gemini could compete with ChatGPT supported by the rival Microsoft. Gemini’s multi-modal capabilities will be its USP. Even input could be multi-modal. This will turbocharge generative AI. It will result into superior model performance.

Gemini may get an edge on the existing models by using tree traversal and reinforcement Learning (RL).

Gemini could also use GANs or generative adversial networks consisting of a pair of competing neural networks. They are generative as well as discriminatory.

In fact, ChatGPT and Bard are on par. Bard has been integrated with Google Workspace. It is essentially a text-based AI system. Gemini differs by being multi-modal.

Inflection-2 LLM

Inflection went public in March, 2022. It was founded by Reid Hoffman, LinkedIn Founder, Mustafa Suleyman, co-founder, DeepMind and Karen Simonyan, a former DeepMind researcher.

As a startup Infletion AI has drawn funds from many stalwarts and companies.

Inflection-2 is a language model. It outperforms competitors such as PaLM-2 and Claude 2 and is a shade less than GPT-4. The current version is more powerful than Inflection1. Inflection1 can be compared to GPT-3.5 and PaLM-540B. Inflection 2 will soon catch up with GPT-4.

Inflection 2’s training was done on H100 GPUs of Nvidia. About 5000 chips were used. As compared to GPT-4, Inflection 2 scored 890 on the HellaSwag 10-shot, thus nearing GPT-4 score of 95.3. Inflection 2 outperforms Claude2 with chain-of-thought reasoning. Inflection 2 falls short of GPT-4 for coding and math tasks.

Inflection2 will soon run the company’s Pi chatbot.