Blog

Smaller Language Models

Generative AI chatbots are large language models. However, bigger and more capable AI requires huge processing power. Big Tech will dominate this field. It is difficult for researchers to access such processing power. Few researchers can have access to it.

A small number of industry labs with huge resources can afford to train models with billions of parameters on trillions of words. Thus we should think of more efficient language models. It is necessary to think in terms of functional language models which use datasets of much smaller size than those used by advanced large language models. A small model or mini-model which is nearly as capable is what is required. This project is called the BabyLM project.

LLMs are trained to predict the next word in a given sequence of words. The training uses a corpus of words drawn from websites, novels, newspapers. The model conjectures based on the example phrases. It later adjusts itself depending on how close it was to the correct answer. The process is repeated several times. A model then maps how the words relate to one another. In other words, the bigger the corpus of words it learns, the better it becomes. The sequence of words provide the model a context, and more context guides the model to get what each word means.

OpenAI’s GPT-3 (2020) was trained on 200 billion words. DeepMind’s Chinchilla (2022) was trained on a trillion words.

The learning of language by human being is different. Human anatomy have sensory mechanism. We can feel the pain, sarcasm, insult. We can smell flowers, the wet earth and we can taste chocolates and oranges. We can feel the caressing touch. In the early years, we learn so many words based on the sensory feelings. These words are grasped by us for the first time in written form. Language models can be brought closer to human understanding. Cognitive capacity of the brain is represented by a unit called ‘neuron’. The early models of AI were patterned after the human brain.

Later computer scientists realised that training language model was easier on the huge amount of data, rather than forcing them into psychologically informed structures. Of course, computer now generate human-like text, but the way computers learn and we learn the language is different.

It is now realised that we have to make AI smaller and smarter.

11th June 2023
Text Vectorization

NLP wants to create computer programmes which understand, generate and work with human languages. As we know languages with words and sentences, we also know that computers understand only the numbers.

Thus we map the words or sentences to vectors — a bunch of numbers. It is called text vectorization. However, this does not mean we can assign a number to each word. The idea is to have a vector of numbers which represents the words and information. In text vectorization, text is converted into a numeric vector. ML algorithms take numeric feature vectors as input.

In ML, the words in training data called vocabulary are considered. Let us say there are 2000 words. We will have an order from 0 to 2000. We then define the vector ith word as all zeros except for a 1 in the position of i.

If the vocabulary consists of three words, monkey, ape and banana, the vector for monkey will be (1,0,0). The vector for ape will be (0,1,0). The one for banana will be (0,0,1.).

Internet tells us that monkeys eat bananas. And apes eat fruits. Algorithm thus should understand the similarity of information. Intuitively, we feel they are similar.

Vectors of one hot encoders when compared indicate there is no word match between the phrases. Therefore, our programme will treat them as two completely different pieces of information. Human beings can guess the meaning of word through the context. The words of similar meanings can be interchanged — it is called distributional hypothesis.

It is the foundation of how word vectors are created. As Firth says, a word is characterized by the company it keeps.

In vector representation, similar words, ideally, should end up with similar vectors. We then create dense vectors, where the values are not only 0s and 1s, but any decimal between them. Monkey’s vector (1,0,0) could be (0.96, 0.55, 0.32). It will have a dimension (the amount of numbers) we choose.

Similar representations are required when some common properties are shared by the words — plural or singular, verbs or adjectives, or gender. All these can be encoded in vectors. This led to Word2vec (2013) and changed in text vectorization field.

We do require vectors to represent words better. There are some methods to do so. Some methods are explicit.

Human evaluation measures the distance between vectors, and takes the opinion of some linguist. This is, however, time consuming and not scalable. Besides, one will require large number of words to get new vectors.

Synactic analogues means that these can be added or substracted. Play is to playing, as eat is eating. The operation is Playing – play+eat and find the most similar vector to this result. It should be one correspondent to eating.

Semantic analogies are almost similar to synactic ones. The analogies are ‘ donkey is to animal as pineapple is to __, and the guess is fruit. Or ‘monkey is to animal, as banana is to fruit.’ Complex analogy could be monkey+food=banana.

The vectors we create must pass such tests. It could then be surmised that they are understanding the words.

We have considered explicit methods. Let us consider the implicit methods. Here the word vectors are created and their impact is measured. Let us consider sentiment analysis classifier. Here instead of hot-encoder, we can use word embedding vectors. If results are better, these suit our problem.

Such testing becomes expensive. It is time consuming and uses computational resources too.

Word Vectors Creation

Lot of text refers to billions of words. There are big families of word vectors.

In statistical methods, a co-occurance matrix is created. Then a method of matrix dimensionality reduction is applied. Some words occur very often. There we invoke PMI — pointwise mutual information.

In predictive methods, ML algorithms make predictions based on words and their context. Some weights are used to make the algorithm learn to represent each word. These are used in neural networks.

The networks are trained to predict the next word in a text given the previous N words.

Bengio’s 2003 paper used this approach. Here a hidden layer is used.

Breakthrough work is Word2vec.

Continuous Bag of Words (CBOW) sets up a sliding window of size N. On a huge text, network is trained to predict a word inputting N words on each side. Skipgram is similar to CBOW. The difference is instead of predicting the word in the middle given all others, all others are predicted given the one in the middle. At times, methods are combined to get good results.

The other popular methods of vectorization are Binary Term Frequency, Term frequency (L1), Normalized, Term Frequency (L2), Normalized, Term frequency (L2), Normalized, IF-IDF and Word2Vec.

11th June 2023
The DIA

India intends to regulate AI through the prism of the user harm. Guardrails will be created for AI, Web 3 or any other emerging technology. The provisions will come in the DIA (Digital India Act). It will replace the IT Act which is two decade plus old. There are going to be regulations for Big Tech. There will be separate rules for different intermediaries such as e-commerce, search engines, gaming gaming etc. There will be accountability for social media platforms. There are going to be deliberations on whether Sec 79 of safe harbour for intermediaries should be retained for social media. There could be definition of misinformation and weaponisation of misinformation. There could be regulations, but they should not undermine the safe harbour protection which preserves the free and open nature of internet. The proposed Act will have special provisions for safeguarding children.

11th June 2023
Transformers and Attention Mechanism

‘Attention Is All You Need’ is a pathbreaking research paper in the field of deep learning. It has affected NLP too. The paper was authored by Ashish Vaswani and 7 other authors and was published in 2017. Here the authors proposed a new architecture, the Transformer, which is entirely based on the attention mechanism. It dispenses with recurrence and convolution completely. The Transformer addresses the limitations of RNNs and CNNs. It is better than the previous state-of-the art models in various NLP tasks.

Transformer architecture relies entirely on attention mechanism to establish global dependence between input and output. It is a superior model and is parallelizable. It requires less time to train.

The attention mechanism has changed the way we work with deep learning algorithms. It has influenced NLP and computer vision drastically.

Which part of the inputs are important for producing a particular output — this is identified. It is a technique that focuses on specific parts of the input sequence while processing.

The model selectively concentrates on a few relevant things while ignoring others.

The Transformer model has been used for a variety of tasks such as machine translation, text summarization, question answering, and natural language inference.

The Transformer inspired new research in self-supervised learning methods for LLMs. .

The Transformer architecture consists of an encoder and a decoder. Both these consist of a stack of attention layers. Self-attention mechanism allows each position in a sequence to attend to all other positions in the sequence. It facilitates the learning of long-range dependencies. It is useful in sequence-to-sequence tasks.

The Transformer architecture is computationally more expensive to train. It is not equally effective for all NLP tasks. It is also a blackbox, where it is not clear how it is able to learn long-range dependencies.

Some other mechanisms used in deep learning are Convolutional Neural Networks (CNNs) , Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks.

10th June 2023
AI : Unstoppable

It is not possible to stop any technology or knowledge once they are out in the open. AI is now an unstoppable idea. The world tried its best to stop the development of nuclear weapons but could not do so. All advances are treated as a boon for society. Many such advances could become fundamental threats, e.g. AI could make many in employment redundant.

In an organisation, the manpower with highly-valued skills are wined and dined, e.g. those who have AI skills. The others doing routine mundane jobs make up the rest of the job pyramid.

Nations not blessed with a large workforce will tend to automate, and will make AI grow in size and scale, irrespective of what anyone thinks.

It is necessary to learn how to cope with the benefits and consequences of AI.

9th June 2023
Earnie’s Rollout

Baidu’s ChatGPT rival Ernie was launched on 16th March, 2023. It stands for Enhanced Representation through Knowledge Integration.

China has invested heavily in AI, and is in race with other countries such as the US. ChatGPT’s success accelerated Baidu’s rollout. In the rollout, Ernie summarised a science fiction novel and analysed Chinese idiom. The presentation was pre-recorded. It thus became a botched rollout. Yet it has a lot of work to do to catch up with the US. China’s strict censorship rules undermine the quality of data and restrict the development of chatbots. In the development of alogrithms, there are restrictions.

The US has cut off China from high-end computing chips, a key ingredient in technologies such as ChatGPT.

9th June 2023
LangChain to Build Apps in LLMs

LLMs are used to build apps — text generators, question answering, conversational bots and so on. Many new APIs are being created. LangChain is a Python library/framework for developing apps in LLMs. It gets connected to LLMs through APIs. It can also facilitate LLMs connection to a data source, and makes these aware of the environment.

LLMs on their own may not be very potent. LangChain connects them to external data sources and computation. It enhances their ability to provide better answers.

LangChain connects LLMs to our own data base. Apps are created around them. They reference the data. LLMs become data-aware and agentic.

The framework of LangChain have components which constitute LLM wrappers. These wrappers are popular language model APIs. (from OpenAI).

LangChain is Python library and is installed through pip command. Apart from LangChain package, packages like hugging face hub are installed to enable working with hugging face API’s.

To begin with, there was retraining of the entire LLM model or have to work with a different model for different tasks, say one model for translation. one for summarization and so on. However, these days we use Prompt Templates. With these templates, we can make LLM do anything, say different tasks such as translation, question answering, text generation, text summarization on different data.

For smaller apps, we can use LLMs in isolation while applying LangChain. For complex apps, it is better to chain LLMs, say two similar LLMs are chained or two different ones. LangChain is the standard interface between the chained LLMs.

LLMs do not have basic functionalities such as logic and calculation. Agents have access to tools and toolkits. Python Agent can use PythonREPL Tool to execute Python commands. LLM provides instructions to the agent on what code to run.

The components in LangChain Prompts, Chains and Agents can collaborate to generate powerful apps.

8th June 2023
Marketing Research

India lacks a professional Marketing Research (MR) firm dedicated to quality marketing research. In the absence of quality research, there is overestimation of the market size.

India’s MR professionals are second to none. In fact, international MR firms recruit their talents in MR from India. Many foreign firms are headed by Indian MR professionals.

MR starts with the brief. The better the brief, the better the research outcome. At times, the brief points out an outcome and MR firm then collects the data to support that outcome.

Assessing buying intentions of the consumers is tricky. There is a difference between an ‘intention to purchase’ and a real purchase. Sampling should be representative of the general population. It should be large enough to get validated.

MR firms here in India work on global projects too. They adapt to the changing technology and digitization. The profession has been evangelized.

The issue is how far we support MR to be a critical activity that is to be conducted by independent-minded professional services. In addition, it is difficult to assess a response to a technological breakthrough in consumer research. Some of these breakthroughs had predictions of run-away success but have vanished into thin air. Is MR the culprit? Not at all.

7th June 2023
Evolution of GPT

Prior to GPT models, text output was created by rearranging or extracting words from the input itself.

Generative models have the capacity to generate text that is cohesive and human-like. They do so by utilizing probability distributions to forecast the most likely word or phrase.

A pre-trained model has been trained on substantial dataset of samples before attempting any job.

GPT models are trained on unsupervised learning strategy on a substantial corpus of text data.

The model learns the structure and characteristics of a language broadly. It does so by learning from unstructured data. After learning, it utilizes this understanding by answering queries and summarizing text.

Transformer is a specific kind of neural network architecture. It deals with text sequences of different lengths. ‘Attention is All You Need’ paper in 2017 made transformers popular. It is a decoder-only architecture. The ‘self-attention’ mechanism of transformer captures the relationship between each word and other words in the same phrase.

Evolution of GPT Models

GPT-1

The model was trained on 40 GB of text data. It produced cutting edge results for modelling jobs such as LAMBADA. It saves data for relatively short phrases or documents. It entertains each request with a context length of 512 tokens or 380 words.

GPT-2

It descended from GPT-1. It has the same architecture. It is, however, trained on bigger corpus of text data. It has more analytical power. It has been trained on 1.5 billion parameters.

GPT-3

It is an improvement on GPT-2. It has 175 billion parameters. It has been trained on a bigger corpus of text data.

GPT-3.5

Here a method RLHF or Reinforcement Learning with Human Feedback has been used. It adds rule-based human values to 3.5 models. This reduces toxicity and puts a premium on veracity. The model aligns with the user’s intent. It is a move towards making AI ethical and responsible.

GPT-4

It handles both text and pictures as inputs. The output is text. It is close to 1 trillion parameters. The ultimate aim is to predict the subsequent word, given a series of words.

In its training, substantial publicly available data has been used. It inherits the use of RLHF from GPT-3.5. It passes adversarial factuality tests.

6th June 2023
Tokenization in Generative AI Model Training

You know, the generative AI models are trained on tokens, say 1 trillon tokens. What does this mean?Tokens are the smallest units of text which can be processed by ML algorithm. In natural language processing (NLP), tokenization is the process of splitting a piece of text into small units called tokens. Thus a token may be a word, or a part of a word or just characters like a percentage sign or punctuation marks.

Tokenization is a foundational task. It is a difficult since every language has its own grammatical constructs (which cannot be expressed as rules).

The more the tokens are used to train an AI model, the more it understands the language and the more efficiently it generates human like text.

In short, tokenization splits longer strings of text into smaller pieces or tokens, based on some criteria. These criteria could be punctuation marks, whitespace or other delimiters.

Thus ‘I love you’ can be broken down into individual words ‘I’, ‘love’ and ‘you’

Tokenization is used for search engine indexing (web pages are broken down into individual words or phrases), text classification ( documents are broken down into individual words or phrases) and sentiment analysis ( text is broken down into individual words or phrases that can be analysed for sentiment.

Different tokenizations are whitespace tokenization, word tokenization and character tokenization.

There is difference between tokenization and lemmatization. Lemmatization reduces the words to its base form, e.g. playing when lemmatized becomes ‘play’.

There are tools for tokenization. NTLK is a popular Python library that includes tokenizer. spaCy is another Python library that includes tokenizer. TextBlob is a Python library for processing textual data that includes a tokenizer. Gensim is a Python library for topic modelling that includes a tokenizer.

Training the AI Model

After the text has been tokenized, it can be used to train the generative AI model. The model learns to associate each token with a probability distribution over the next possible tokens. It is this distribution which is then used to generate new text that is similar to the input text. The new text is grammatically correct as well as semantically meaningful. Tokenization improves the accuracy of the model. It increases its efficiency. It makes the model scalable.

Language has statistical properties. The model counts the frequency of different tokens. It learns which tokens are likely to appear together.

After tokenizing the text, it is converted into a numerical representation such as a vector. The numerical representation of each token is typically created using a technique called word embedding. Word embedding is a way of representing words as vectors that capture the meaning of words.

There are two types of embeddings. Static embeddings where a large corpus of text is used. Vocabulary of all words in the corpus is created. Each word in the vocabulary is assigned a vector representation. it is a statistical method of finding patterns in the way the word is used in the corpus. In dynamic embeddings, a neural network predicts the context of a word. the neural network is trained on a corpus of text and word is given an input. it is the asked to predict words likely to come after the word. the vector representation of a word is the created by taking the output of the neural network.

6th June 2023