Blog

  • Working of Vector-based Models

    When vectors are compared for similarity, one hears the term high-dimensional space. In maths and computer science, we deal with a large number of dimensions where each dimension refers to a different variable or feature. In 3-D space, the three dimensions are length, width and height. In high-dimensional space, there are many more dimensions. Imagine a dataset where each data point represents a person, and there are features of this person such as age, height, weight, income and education. Each of these features constitute a dimension in the dataset. Thus, five features in the dataset exist in a 5-D space. In high-dimensional space, we refer to datasets with a large number of features or dimensions — may be dozens or hundreds or even thousands of dimensions.

    It is difficult to visualize such spaces since we deal with only 3 dimensions in the physical world. In high-dimensional space, certain phenomena occur which do not occur in lower dimensions, e.g. the curse of dimensionality (here the distances between points become less meaningful as the number of dimensions increase).

    While comparing vectors in high-dimensional spaces, techniques like cosine similarity or Euclidean distance are commonly used (to measure how similar or dissimilar they are). These measures help us in tasks such as clustering, classification and information retrieval.

    Cosine similarity measures the cosine of the angle between two vectors in the space which ranges from –1 to 1. (1 means the vectors are pointing in the same direction — perfect similarity. 0 means they are orthogonal — no similarity and -1 means they point to the opposite directions — complete dissimilarity.

    Sine similarity measures the sine of the angle between two vectors, and ranges from -1 to 1. The interpretations are similar to cosine similarity in terms of direction but are calculated differently.

    Cosine similarity measures the alignment, sine similarity measures the misalignment. Sine similarity is less commonly used compared to cosine similarity. It is, however, useful in certain contexts such as in image processing or signal analysis.

    A large language model such as GPT uses vector comparison both during training and during text generation. In training, the model learns to associate words, phrases and sentences with corresponding vectors in a high-dimensional space. These vectors capture semantic and syntactic information about the language and are used to make predictions about the next word in a sequence, given some input context. While generating text, the model uses these learned representations to determine the likelihood of different words or sequences of words. It can compare the vectors representing different words or sequences to determine which are more similar or relevant in the current context. Thus, the model produces coherent and contextually appropriate responses.

    Vector comparison is an integral part of both the training and generation processes for LLMs. These techniques capture semantic relationships (between words and texts in a continuous vector space).

    The idea that vector comparison could facilitate NLP tasks evolved over a period of time. It originated from distributional hypothesis — words that appear in similar contexts tend to have similar meanings. It goes back to the work of linguists and cognitive scientists such as Zelling Harris (1950s). It was later formalized in the field of computational linguistics.

    The milestone event in this was Word2Vec model proposed by Tomas Mikolov and his colleagues (at Google 2013).

    Since then, more sophisticated methods for representing and comparing vectors have evolved. All this led to the emergence of large language models (LLMs) such as GPT and BERT. These models leverage vector comparison as a fundamental component of this architecture.

    Can NLP proceed on vector-based linear line, or will it require deviations and non-linear breakthroughs? To achieve AGI, we may require new paradigms, new architectures and innovative techniques. There should be advances in symbolic reasoning, commonsense understanding and content-aware processing. All this cannot rely solely on vector-based representations.

    In future, we may require a hybrid approach that combines the strength of vector-based methods with other AI techniques — symbolic reasoning, probabilistic modelling, neurosymbolic approaches.

    The future sustains vector-based advances and searches breakthroughs of new paradigms and approaches.

  • AI and Copyright

    Though AI has affected our lives, it has been alleged that it ingests copyright works while being trained. Already, we know several authors and the New York Times have sued (December 2023) OpenAI and Microsoft for copyright violation. It has been alleged by the NYT that LLMs have been built copying and using copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to-guides and so on.

    It deprives the NYT to enjoy the fruits of their labour. Instead, the LLMs enjoy the fruits of the NYT’s labour.

    These LLMs at times reproduce the copyrighted content verbatim or closely summarize the content or mimic the writing style as demonstrated by examples.

    In short, the LLMs use intellectual property without paying for it. AI companies enrich themselves in terms of valuation. The specimens of the NYT articles thrown up verbatim are attached.

    The AI companies call the use of material for training as fair use. (The NYT counters by saying there is nothing transformative about using AI content without paying for it).

    OpenAI responds that the Times paid someone to hack OpenAI’s, products. The ‘anomalous results’ are generated after tens of thousands of motivated prompts. It violates the terms of use of the model. Besides, these articles have appeared on multiple public websites.

    Microsoft compares the suit to the scriptwriters’ suit against VCR where copyright allegation was raised. The courts ruled in favour of technology. That decision did not destroy Hollywood. Instead, entertainment industry flourished.

    All news corporations have not chosen to fight. Some have joined hands with AI companies by striking deals. It could be a licensing deal to use archives of news stories.

  • Cosine Similarity

    In similarity searches of the vectors, cosine similarity is most widely used. It is a measure of similarity between two non-zero vectors of an inner product space — this measures cosine of the angle between them. In NLP, cosine similarity is often used to compare the similarity of words or documents represented as vectors in a high-dimensional space (word embeddings or document embeddings).

  • Learning Dependencies

    A large language model learns dependencies. Precisely it learns to understand and capture relationships between different elements of language — words, phrases or sentences. These relationships also include syntactic dependencies (subject-verb agreement) and semantic dependencies (‘tiger’ and ‘animal’ are related concepts).

    When there is a sentence with a missing word, the model is able to predict that word based on the context and dependencies it has learnt from the vast amount of text it has been trained on. It also understands the meaning of a sentence and generates coherent responses based on this understanding.

    The model’s learning of dependencies is attributed to pattern recognition in the text it is trained on, capturing statistical regularities and adjusting its internal parameters accordingly while being trained.

    It is thus ready with the complex structure of the language and performs various NLP tasks.

  • Gecko: Text Embedding Model

    Google has revealed text embedding model, Gecko which is trained on LLM generated synthetic dataset FRet. What are text embedding models? They represent natural language as dense vectors. These place semantically similar text next to each other (within the embedding space).

    In other words, text embedding models act as translators for computers — they convert text into numbers which a computer understands.

    As we know now, embeddings are numerical representations. These capture semantic information (about words and sentences) in the text. It enables computers to process natural language. Such processing leads a wide range of tasks (document retrieval, sentence similarity, classification, clustering). Without building a separate model for each of these tasks, a single model is being pushed for a variety of tasks.

    Being a general-purpose model, it requires huge data for training. It is here that LLMs come handy. This is what Google has done — leverage the LLMs for training Gecko. Gecko is a two-step LLM powered embedding model. Synthetic data is generated using an LLM. It is refined by retrieving a set of candidate passages for each query. There is then relabeling — positive and negative passages using the same LLM. LLM re-ranks the passages based on LLM scores. Gecko utilized this approach to achieve strong retrieval performance. It becomes a zero-shot embedding model on the Massive Text Embedding Benchmark (MTEB).

    LLM generated and LLM-ranked data is combined with human-annotated data in Gecko. It achieved best performance on the MTEB benchmark (Average score 66.31).

  • Stargate Supercomputer

    Both Microsoft and OpenAI propose to collaborate to build a supercomputer called Stargate to power AI. This supercomputer could be 100 times more expensive than the largest data centers currently in operation.

    This computer will use millions of specialized server chips and could cost up to $100 billion. It will take next five or six years to build and could be launched in 2028. There could be a series of separate installations. It will further the frontiers of AI.

    Microsoft could fund Stargate. Microsoft has already committed $13 billion plus to OpenAI. OpenAI uses at present Microsoft data centers to power its ChatGPT. In return Microsoft gets to exclusively resell OpenAI’s technology to its own customers.

    This supercomputer will require more computing power that what is currently supplied by Microsoft to OpenAI. At the same time, it will require several gigawatts of electric power, for which a nuclear energy option could be considered.

    The project is expensive as it involves the acquisition of millions of specialized chips. Also, the cost of power makes it expensive.

  • Next Word Prediction

    In a sequence, to generate text, a model predicts the next word using a probability distribution. This is the basic task in NLP (natural language processing).

    First step is tokenization of input text into individual words or parts of words. Each token is a unit of language — word or sub-word.

    The language model is a neural network. The input it takes is the sequence of tokens. It is trained on a corpus of vast amount of data. It learns relationships between words and the context.

    Context representation involves processing of input sequence up to the current token to create representation of the context. In context representation, what is captured is the information of the preceding words in the sequence.

    After context representation, the model calculates the probability distribution over the vocabulary for the next word, Probability is assigned to each word in the vocabulary. It indicates the likelihood of each word being the next word, in the sequence, in the light of the context.

    Lastly, the model can either sample from this probability distribution to generate a predicted next word stochastically or simply choose the word with the highest probability as the next predicted word.

    Probability distribution is computed using SoftMax Activation over the output layer of the model. It converts the raw scores into probabilities. While being trained, the model is optimized to maximize the probability of the correct next word being predicted for each sequence in the training data.

    Here the model leverages its learned relationships between words and their contexts. It enables it to generate coherent and contextually relevant predictions for the next word in a given sequence.

  • Neural Network Architecture

    Neural network is basically a mathematical model implemented by a software. It runs on computer hardware. The physical manifestations of neural networks are seen as neuromorphic chips. The architecture itself is a computational model represented in a code.

    Even transformers and its encoder-decoder are conceptual components, rather than physical objects. They consist of layers of neural network computations. They are implemented through software libraries such PyTorch or TensorFlow. The architecture can be visualized in diagrams. One can inspect their code implementation.

    Nodes in such architecture are computational units that perform specific operations on the input data. In transformer architecture, they represent neurons. These neurons perform operations such as (matrix) multiplications, activation functions (ReLU, SoftMax) and layer normalization.

    Each node receives input from the previous layer. It processes the input as per defined function. It passes the result to the next layer. The collective behaviour of these nodes enables the model to learn complex patterns and make predictions on the new data.

  • Chips and Indian EVs

    India is also moving towards non-ICE cars, mainly electrical vehicles (EVs). Electrical vehicles are governed by automotive chips used in their battery management system. Though India has good expertise in chip design, we hear about the cases of the blowup of electric vehicles. These vehicles blow up because of the wrong chips in their batteries. NBX, a spinoff of Philips, makes automotive chips in India, which are the size of a thumbnail and has 4 billion transistors. The node size of the chip is of 5 nanometers.

    The reason why electric scooters catch fire is their detective battery management systems (BMS). Think about a laptop chip. A laptop is not exposed to sub-zero temperatures or monsoon rains. It also does not work in high temperatures. Indian market cannot afford high prices, and therefore it uses the cheaper consumer electronics chips.

    As we have learnt in another article on autonomous cars, there are five levels of autonomy, where the last fifth level gives you a fully autonomous car without a steering wheel. Right now, the focus is on level three cars — a car that can drive itself on highways where traffic is less complex.

    Just as a human body is a brain on shoes, car is a brain on wheels. Between the various functional levels, there is an entire body. It requires real time electronics. It is not about ML or AI. A car needs creativity when it gets stuck. It cannot stop behind a parked car for hours — it must have creativity to move ahead.

    Car chip makers are taking baby steps when it comes to generative AI. They do it carefully.

  • Rule 7-38-55

    Albert Mehrabian, a psychologist, at the University of California, Los Angelas in 1967, some 57 years ago put forward a rule which reads 7-38-55. It means 7 per cent communication is conveyed by our words, 38 per cent through our tone (of voice) and 55 per cent through our body language.

    It is a simple rule but is captivating. Some say it is not true, but it cannot be discarded outright. If you understand it, it enhances your emotional intelligence.

    It makes you a better communicator. It facilitates business negotiations. You become aware of the body language and tone of voice. You also recognize non-verbal gestures. Yawning indicates the listener is bored. At times, the communicator is aggressive. However, the idea is not to look at sales pitches or interviews. It aims at scanning feelings, words and expressions. It adds up to verbal liking, vocal liking and facial liking. These equations apply when the conversation focuses on feelings or attitudes, It is not focused on factual content.

    Maybe, your girlfriend is irritated with you and still she says she is fine. Here the rule applies. This applies even when you are not sure whether the employees and you are not on the page. It comes handy when it is hard to read the intent of the other person. Whenever in doubt, lean on tone and body language. If you are in conflict with your tone or body language, people will pay attention to what they see rather than what they hear.

    An essential component of emotional intelligence is to read emotions of and convey your own more accurately. The rule is catchy and is a good reminder that these skills must be improved. It is not so important to know what someone says.