Blog

  • Monty Hall Problem

    AI makes use of probability-based predictive analysis. It is true for generative AI or focused models for specific predictions and pattern recognition. It also holds water in other AI apps, e.g. product recommendations to prospective buyers.

    Let us consider here the probability puzzle called the Monty Hall Problem. Here the host of a popular game show Let’s Make a Deal presents before the contestants three doors. There is a car behind one door, and there are goats behind the other two doors. The host knows what is behind each door. The contestant is asked to pick up a door. The host then opens up one of the remaining two doors to reveal a goat. The contestant can stick to his original choice of door or can switch the option to other unopened door. What should he do?

    To begin with, out of the remaining two doors, it is considered to be a case of 50 per cent probability of winning the car, irrespective of whether the contestant sticks to the original door or switches the door. Mathematically, however, the chances of winning are doubled if there is a switch of decision. The host has removed one of the losing doors. He has left two doors, with one having a car behind. If the contestant switches a choice, the contestant has bet his initial choice wrongly and the car is behind the remaining unopened door.

    In other words, if the initial choice was door one, there was equal probability across the three doors or one third chance. One door was opened, say door three. Thus its probability has become zero, Door two has now a probability of 2/3, since the probability of the third door gets added to it. This is counterintuitive reasoning.

    This problem highlights the importance of understanding probabilities and their limitations in decision making. It applies to the application of AI.

  • Language Models

    A language model is a probability distribution over sequence of words. Let us say there is a sequence of words of the length m. A probability model assigns a probability to the whole sequence. Language models generate probabilities by training on text in one or many languages.

    Thus it is a statistical method that predicts the next word in the sequence of words. It learns from massive text the probability of each word appearing after a given sequence.

    A translation model is a type of language model that gives conditional probability of the next token, given your source sequence and partial-target sentence.

    A large language model (LLM) is a neural network-based language model that has a large number of parameters. These models are trained on massive data and can generate text that equals human writing.

    What is a small language model? It all depends on the number of parameters. LLMs have more parameters. The more parameters a model has, the more data it can learn. The better it performs.

    Large language models are being used today for text generation, translation and question answering. They can be used to automate tasks that are currently being done by humans.

    The model is a mathematical representation of a system or process. Its job is to make predictions of the text that should follow a particular sequence. The parameters of the model are the values that define the skill of the model. These parameters help a model to make predictions. They transform input data into the desired output.

    In a neural network model, the weights and biases are the parameters of the model. In a clustering model, the centroids of the clusters are the parameters of the model. In a linear regression model variables are the parameters of the model.

    The value of parameters is estimated by the system during training. The value of hyper-parameters are pre-set and independent of the dataset. These values do not change during training. Hyper-parameter is not a part of the trained or the final model. Hyper-parameters specify the model family. They may control the training algorithm used to set the parameters.

  • Role of Data in Model Training

    There are always discussions about the role of data in model training in terms of data quality and synthetic data. A recent Microsoft paper Textbooks Are All You Need emphasizes training models to write Python code. However, the paper has implications beyond coding.

    The models examined in the paper do not owe their success to any pathbreaking design or training methods. Architecture and training methods are conventional. The innovating aspects are drawn from the training data. The data improves the learning efficiency of language models for code.

    Scaling Laws for Neural Language Models (2020) paper focused on model size — large models being trained on modest quantum of data. DeepMind’s Training Compute-Optimal LLMs focused on data size. It talked about the present large models which are undertrained. In 2023, the focus shifts to data quality where a leaked Google memo asserted that data quality scales better than data size.

    Microsoft’s paper Textbooks Are All You Need made analysis of this movement of data quality.The paper demonstrates the feasibility of training a powerful LLM for Python code with a selection of ‘textbook quality’ data from the web and synthetically generated data.

    Less is More for Alignment (LIMA) also shows small high-quality data set produces impressive results.

    There is an issue of synthetic data or the output of data of the models. Smaller-models have been trained on the output of larger models, e.g. Alpaca and Vicuano. Thought should be given to whether larger models can benefit by training them on their own output. LLM data for training should have sufficient diversity.

    Textbooks are All You Need brings out evidence that data quality compensates for data quantity and model size. The discussion around synthetic data will persist. It has shown good results in image processing. The paper also found that language models trained on synthetic data of textbook quality were able to achieve state-of-the-art performance on a variety of tasks. Thus synthetic data can be a valuable resource for training language models, especially when high-quality real world data is not available.

  • LENS : Contextual AI

    Apart from generative tasks of natural language, LLMs can be put on tasks involving vision. An optical encoder could be trained to represent pictures. These are a series of continuous embeddings. Alternatively, there could be a contrastively trained frozen vision encoder. A lightweight transformer could be aligned to a frozen vision encoder.

    There are costs involved in pretraining. Textual and visual datasets are to be aligned to an existing LLM.

    Flamingo adds cross-attention layers into an LLM pre-trained to add visual features. The pre-training is multi-modal — 2 billion picture-text pairs, and 43 million websites.

    Researchers from Contextual AI and Stanford University has developed LENS (Large Language Models ENhanced to see) strategy. Here an LLM functions as a reasoning module across vision modules.

    First rich textual information is extracted using pretrained vision modules. It is then sent to the LLM, which has to execute tasks such as object recognition, vision and language. LENS becomes a bridge between the modalities, without any additional expense since multi-modal pre-training stages are eliminated. In addition, we can avail of the most recent developments in computer vision and NLP by this integration.

    Language model’s few-shot, in-context learning capabilities through natural language descriptions of visual parts is facilitated by LENS. It vests an LLM off the self ability to see. Frozen LLMs can be used to handle object recognition and visual reasoning tasks. ( No need to align multi-modal data).

  • Recognition Model

    As we have already observed, Facebook has developed a recognition model with 10 trillion parameters. Some of the Facebook’s DLRMs tend to be larger than dense generative models such as GPT-3 (175 billion parameters).

    A recognition model determines to what class a pattern belongs to corresponding to the observed value x of the random variable X. In wars, planes were identified using these models. The model combines pattern matching and mental simulation.

    Recognition and classification models both identify the patterns in the data. However, recognition focuses on identifying and locating specific objects or patterns in the given input. Classification assigns an input to a category based on its content. Recognition is more about detection, and classification more about categorization.

    Recognition models are applied in handwriting recognition, speech recognition, image recognition, object recognition, human activity recognition, document processing and recognition.

  • Sparse-Quantized Representation : SpQR

    These days large language models are being used for natural language processing. Still, researchers feel that it is necessary to develop smaller models, though trained on massive data. Smaller models consume less computational resources, e.g. LLaMA model has 7 billion parameters, and has been trained on 1 trillion tokens. It produces results superior to GPT-3 model, though it is 25 times smaller.

    LLMs are compressed so that they fit into devices such as laptops, mobiles and tablets. This they should do without diluting their generative ability. LLMs are sequential, and hence even trivial errors can affect the outputs. The quantization can make the model smaller — say 3 to 4 bit quantization techniques with 1 to 10 billion parameters. Instead of 16 bit model, the low quantization can make model smaller, but this should not affect the accuracy.

    Sparse-Quantized Representation (SpQR) is the answer to this problem. Here there is lossless compression of a pretrained LLM to 3-4 bits per parameter. The end-to-end accuracy error is less than 1%.

    SpQR first starts locating outlier weights. These when quantized lead to high errors. These weights are stored in high precision. The remaining weights are stored in a lower format (say 3 bits). Alternatively, SpQR can make use of group quantization. The small group size of 16 of contiguous elements can be represented in a 3-bit format.

    LLM is converted into SpQR format in post training quantization (PTQ) approach. Thus the LLM when quantized runs on a single 24 GB GPU without any deterioration of performance.

  • Quantum Computing

    Quantum computing field focuses on developing a computer technology based on the principles of quantum theory. It makes use of the extra-ordinary ability of sub-atomic particles to exist in many states, such as 0 and 1 simultaneously. These particles can process exponentially more data than conventional computers. Quantum computing operations use an object’s quantum state to create a qubit. It is a basic data unit in quantum computing. It serves the same purpose as that served by a bit in conventional computing. But quantum bits behave differently. Traditional binary bits can only maintain a position of 0 or 1, whereas qubits include a superposition of every possible state.

    Quantum computers are smaller and consume less power. They look more elegant than supercomputers.

    Qubits are used to execute multi-dimensional quantum computations. These function at incredibly low temperatures, just a hundredth degree above absolute zero. This is achieved by supercooled super-fluids which are converted into superconductors. At such low temperatures, some materials used in the processors show an important quantum mechanical property. Electrons move through them without resistance. Consequently, they are superconductors.

    There are junctions created by Josephson. These are superconducting qubits. Microwave photons are directed to these qubits to control their behaviour. These can store, change and read individual bits of quantum information.

    If two qubits are entangled, changes to one qubit affects immediately the other. To solve complex problem, quantum algorithms make use of these connections.

    The Azure Quantum platform from Microsoft offers quantum technology. Google too allows the use its quantum computers.

  • Select LLMs

    The first LLM that comes to mind is GPT-4 which is released in March, 2023. It accepts both text and images as input. It hallucinates far less than ChatGPT-3.5. GPT-4 has been aligned with reinforcement learning from human feedback. It has been trained on a massive 1+ trillion parameters and supports a context length of 32000 tokens. Probably, its architecture has 8 disparate models with 220 billion parameters each. Its weakness is that it is slow to respond. The inference time is much higher.

    GPT-3.5 is another LLM. It is incredibly fast. It generates response within seconds. It has a context length of 16000. Its weakness is that it hallucinates a lot.

    The third LLM model is PaLM-2 from Google. Its forte is logic, math and coding in 20 plus languages. It is trained on 540 billion parameters. It has context length of 4100 tokens. Google has released four models based on PaLM2 in different sizes — Gecko, Otter, Bison and Unicorn. This model is multi-lingual. It can understand idioms, riddles and nuances of different languages. It is quick to respond.

    You may not be aware but Anthropic has developed an LLM called Claude v1. It is backed by Google. It builds assistants which are helpful, honest, and harmless. The largest content window is 100000 tokens. In a single window, 75000 words can be loaded. Cohero is another model with just 6 B parameters. It works for enterprises.

    Technology Innovation Institute (TII) has introduced Falcon which is open source LLM. Facebook has released LLaMa models in various sizes, all open source. They have 7 billion parameters to 65 billion parameters. Guanaco-65 B is an LLaMA-derived model. Vicuna 33B is another open source LLM derived from LLaMA. MPT-30B is open source model that computes with LLaMA-derived models. It has context length of 8000 tokens. 30B-Lazarus is developed by CalderaAI and it uses LLaMA as its foundational model. WizardLM is opensource LLM that is built to follow complex instructions. GPT4ALL runs local LLMs on your computer without any dedicated GPU or internet connectivity.

  • Machine Take Over : Just Prevent It

    Geoffrey Hinton, formerly from Google, and now at University of Toronto is one of godfathers of AI, along with Lecun and Bengio. He was speaking at the Collision tech conference in the Canadian city to a packed audience of 30,000 startup founders, investors and techies. It was Wednesday, the 28th June, 2023. He urged the governments across the world to make sure that machines do not take control of society. The audience attending was to explore how they can ride the AI wave, and were not much interested in knowing about the dangers of AI.

    Hinton strongly feels that AI can take the control away from the human beings. Critics may feel he is overplaying the risks, but he feels the risk real. Besides, AI deepens inequality — it would make the rich, richer, and the poor, poorer. He was worried about fake news spread by ChatGPT-like bots.

    Hinton feels AI-generated content should be distinguished by putting a watermark-like sign which is used to mark the currency. The European Union may consider such a move in its legislation.

    The conference discussions were far from the threats posed by AI. They were about the opportunities created by the new technology. It is too premature to consider AI as an existential threat. As Andrew Ng puts it, it is ‘like talking about overpopulation on Mars.’

  • Facebook’s Recommendation Models of AI

    Facebook made a bold claim on 28th June, 2023 that the recommendation models it is working on could surpass the biggest LLM models of today. Facebook is into research of multi-modal AI, say visual and auditory to better comprehend a piece of content. Some such models are in public domain, and some are used internally to improve relevance or targeting of messages. These advanced models understand people’s preferences, and have tens of trillions of parameters. In other words, orders of magnitude larger than the biggest language models of today. Is it talking about a theoretical possibility of the potential of a model? The company is clear that these very large models available at present can be trained and deployed efficiently at scale. Is the company ready to create infrastructure for such model? Perhaps, what they are aiming at is aspirational.

    Preference understanding and modelling is a sort of behavioral analysis. Are they aiming at training the models on practically every written work available?

    The 100 trillion parameter claim, though somewhat exaggerated, still shows that Facebook is aiming at something scarily big.

    Facebook is conceiving a model larger than anything yet created. Facebook would like to dazzle advertisers with science. There would be large-scale attention models. There could be graph neural networks. There could be few shot learning and other techniques. An architecture that is hierarchical deep neural retrieval network.

    Researchers may not be impressed. They are familiar with such ideas. The users either do not understand or care. However, an advertiser does feel it is better to put money on media where it is well-spent. Facebook is trying to convince them that it excels in understanding consumer behaviour. The primary aim of social media and tech platforms is to sell ads with better granular and precision targeting. Despite the users revolting against all this, the platforms try to impress upon advertisers the value and legitimacy of targeting. Advertising becomes prolific but the issue is whether it improves.

    These platforms do not do market research to help their users. Have they ever done research to tell us which 10 advertising books are the best for media students? Instead they look over our shoulders when we are surfing the net, and buying some toffees to bombard us with ads of toffees the next day.

    Do we really need a model with 10 trillion parameters just to tell what people like? And spend a hell of amount on building it.