SoftMax

SoftMax is a mathematical function that converts a vector of real numbers into a probability distribution. It is frequently used as the output layer activation function in neural networks for classification tasks.

The SoftMax function takes the exponentials of each element of the input vector and then normalizes them by dividing by the sum of exponentials. It results in probability distribution over multiple classes, making it useful for determined the likelihood of each class being the correct one.

SoftMax is a function used to transform a vector of arbitrary real-valued scores into a probability distribution classes..

The formula applied to the vector.

(\mathbf(z)=z_1, z_2, …, z_n):

SoftMax ensures that the output values are non-negative and sum up to #1. (Making it a probability distribution.) It exponentiates the input scores (amplifies differences between scores). It is differentiable except at the point where two classes have the same score.

It transforms raw scores into probabilities. It is useful in various ML algorithms. The raw scores at the output layer refer to activations produced by neurons in the output layer before applying any activation function. These raw scores are often represented as a vector of real numbers. (Each element corresponds to a different class in the classification task).

In an image classification network, there are 10 classes. The output layer may have 10 neurons. Each produces a raw score representing the network’s confidence or likelihood that the input image belongs to a particular task. These raw scores are subjected to SoftMax function to convert them into probabilities.

SoftMax is used during training the LLMs and other neural networks. It is a common activation function used in the output layer of neural network for classification tasks.

In LLMs, SoftMax is used in the output layer to compute probability distribution over the vocabulary of words. During training, the model learns to predict the next word in a sequence bases on the input context. SoftMax helps normalize model’s output into a probability distribution.

LLMs generate text, but here SoftMax is used in the output layer to compute probability distribution over the vocabulary of words. During training, the model learns to predict the next word in a sequence based on the input context. SoftMax helps normalize model’s output into a probability distribution.

LLMs generate text, but here SoftMax is not used. A sampling strategy is used to select the next word based on the probabilities predicted by the model. There are techniques like greedy sampling, beam search or nucleus sampling. During training, SoftMax is used to teach the model to produce these probabilities, but during generation the model uses these learned probabilities to guide the sampling process.

PyTorch

PyTorch is an open-source ML framework used for building deep learning models. It provides a (flexible and dynamic) computational graph. It makes it suitable for research, experimentation and production deployment.

In LLMs, PyTorch is used in GPT-like pre-trained transformers. There are two purposes. First is model development. PyTorch is used to design and implement the architecture of an LLM. It provides high-level interface. It facilitates inclusion of components such as layers, activation functions and optimization algorithms. Since it is dynamic, there could be easy experimentation and prototyping. The second purpose is to train the model and fine-tune it.

LLMs are trained and fine-tuned on vast amounts of data. The models are pre-trained (using unsupervised pre-training and self-supervised learning). PyTorch facilitates implementation of training algorithms, support for distributed computing and enables evaluation of model performance during training.

Let us see how PyTorch is used in building a simple language model. Let us create an RNN model for generating text character.

First of all, install PyTorch. Here we write a a Python script to define, train and use a simple character-level language model. It trains the model on dummy data, and then generates characters using the trained model.

Python Copy code

import torch

import torch.nn as nn

import torch.optim as optim

#Define the RNN model

class CharRNN (nn.module):

def_init_(self, input_size, hidden_size, output_size):

super (CharRNN, self). _ init_( )

self.hidden_size=hidden_size

self.rnn=nnRNN (input_size, hidden_size, batch_first=True)

# Define some hyperparameters

input_size=100 # Size of input vocabulary

hidden_size=128 # Number of hidden units

output_size=100 # size of output vocabulary

seq_length=20 # length of input sequences

num_epochs= 100 # Number of training epochs.

#Create an instance of the model

model= CharRNN (input_size, hidden_size,output_size)

#Define loss function and optimizer

data=torch.randn(100, seq_length, input_size

#Training loop

for epoch in range (num_epochs):

model.train( )

optimizer.zero-grd( )

hidden=criterion (outputs, labels. view (-1)

#Forward pass

outputs, _=model (data, hidden)

outputs=outputs. view (-1, output_ size}

loss =criterion (outputs, labels.view (-1)

#Backward pass and optimization

loss.backward ( )

optimizer. step( )

if (epoch +1) %==0:

print{f ‘ Epoch { (epoch +1) /[( num-epochs], Loss: (loss.item( ): , 4f )’}

#Generating text

def generate_text( model, start_char =A’, length =100) :

model.eval( )

with torcch.no_grad ( ):

input_char=torch.zeros (1,1, input_size)

input_char[ 0,0,ord(start_char)]=1

hidden=None

output_text=start_char

for in range( length):

output, hidden=model( input_char, hidden)

predicted_char=

torch.argmax (output_probs).item.( )

output-text+=chr(predicted _char=1

input_char.fill_(0)

input_char{ 0,0, predicted_char}=1

return output_text

#Generate text using trained model

generate_text=generate_text(model, start_char=A’, length 200)

print (Generated Text : “)

print (generated_text)

Another alternative to PyTorch for designing neural networks is TensorFlow that offers similar capabilities and is widely used in both research and industry.

Prior to the advent of PyTorch and Tensor Flow, neural networks were designed by using lower-level libraries and frameworks such as Theano and Caffe. These provided the building blocks for constructing neural networks. These often required manual coding and were less flexible than the modern networks. Researchers and developers had to implement neural network architectures and algorithms from scratch. It was not only time consuming but also error prone. Besides, it was necessary to use specialized hardware and software optimization to achieve acceptable performance (for training and inference). Of course, actual networks could be designed in the absence of PyTorch and TensorFlow, but the process was cumbersome and less accessible to a wider audience.

PyTorch emerged in 2016. It was developed by Facebook’s AI Research Lab. TensorFlow was released by Google in 2015. Both have come on the scene with a little gap of a year. Both have been used widely for deep learning tasks. These can be used by anyone as they are open-source frameworks with extensive documentation, tutorials and community support.

Working of Vector-based Models

When vectors are compared for similarity, one hears the term high-dimensional space. In maths and computer science, we deal with a large number of dimensions where each dimension refers to a different variable or feature. In 3-D space, the three dimensions are length, width and height. In high-dimensional space, there are many more dimensions. Imagine a dataset where each data point represents a person, and there are features of this person such as age, height, weight, income and education. Each of these features constitute a dimension in the dataset. Thus, five features in the dataset exist in a 5-D space. In high-dimensional space, we refer to datasets with a large number of features or dimensions — may be dozens or hundreds or even thousands of dimensions.

It is difficult to visualize such spaces since we deal with only 3 dimensions in the physical world. In high-dimensional space, certain phenomena occur which do not occur in lower dimensions, e.g. the curse of dimensionality (here the distances between points become less meaningful as the number of dimensions increase).

While comparing vectors in high-dimensional spaces, techniques like cosine similarity or Euclidean distance are commonly used (to measure how similar or dissimilar they are). These measures help us in tasks such as clustering, classification and information retrieval.

Cosine similarity measures the cosine of the angle between two vectors in the space which ranges from –1 to 1. (1 means the vectors are pointing in the same direction — perfect similarity. 0 means they are orthogonal — no similarity and -1 means they point to the opposite directions — complete dissimilarity.

Sine similarity measures the sine of the angle between two vectors, and ranges from -1 to 1. The interpretations are similar to cosine similarity in terms of direction but are calculated differently.

Cosine similarity measures the alignment, sine similarity measures the misalignment. Sine similarity is less commonly used compared to cosine similarity. It is, however, useful in certain contexts such as in image processing or signal analysis.

A large language model such as GPT uses vector comparison both during training and during text generation. In training, the model learns to associate words, phrases and sentences with corresponding vectors in a high-dimensional space. These vectors capture semantic and syntactic information about the language and are used to make predictions about the next word in a sequence, given some input context. While generating text, the model uses these learned representations to determine the likelihood of different words or sequences of words. It can compare the vectors representing different words or sequences to determine which are more similar or relevant in the current context. Thus, the model produces coherent and contextually appropriate responses.

Vector comparison is an integral part of both the training and generation processes for LLMs. These techniques capture semantic relationships (between words and texts in a continuous vector space).

The idea that vector comparison could facilitate NLP tasks evolved over a period of time. It originated from distributional hypothesis — words that appear in similar contexts tend to have similar meanings. It goes back to the work of linguists and cognitive scientists such as Zelling Harris (1950s). It was later formalized in the field of computational linguistics.

The milestone event in this was Word2Vec model proposed by Tomas Mikolov and his colleagues (at Google 2013).

Since then, more sophisticated methods for representing and comparing vectors have evolved. All this led to the emergence of large language models (LLMs) such as GPT and BERT. These models leverage vector comparison as a fundamental component of this architecture.

Can NLP proceed on vector-based linear line, or will it require deviations and non-linear breakthroughs? To achieve AGI, we may require new paradigms, new architectures and innovative techniques. There should be advances in symbolic reasoning, commonsense understanding and content-aware processing. All this cannot rely solely on vector-based representations.

In future, we may require a hybrid approach that combines the strength of vector-based methods with other AI techniques — symbolic reasoning, probabilistic modelling, neurosymbolic approaches.

The future sustains vector-based advances and searches breakthroughs of new paradigms and approaches.

AI and Copyright

Though AI has affected our lives, it has been alleged that it ingests copyright works while being trained. Already, we know several authors and the New York Times have sued (December 2023) OpenAI and Microsoft for copyright violation. It has been alleged by the NYT that LLMs have been built copying and using copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to-guides and so on.

It deprives the NYT to enjoy the fruits of their labour. Instead, the LLMs enjoy the fruits of the NYT’s labour.

These LLMs at times reproduce the copyrighted content verbatim or closely summarize the content or mimic the writing style as demonstrated by examples.

In short, the LLMs use intellectual property without paying for it. AI companies enrich themselves in terms of valuation. The specimens of the NYT articles thrown up verbatim are attached.

The AI companies call the use of material for training as fair use. (The NYT counters by saying there is nothing transformative about using AI content without paying for it).

OpenAI responds that the Times paid someone to hack OpenAI’s, products. The ‘anomalous results’ are generated after tens of thousands of motivated prompts. It violates the terms of use of the model. Besides, these articles have appeared on multiple public websites.

Microsoft compares the suit to the scriptwriters’ suit against VCR where copyright allegation was raised. The courts ruled in favour of technology. That decision did not destroy Hollywood. Instead, entertainment industry flourished.

All news corporations have not chosen to fight. Some have joined hands with AI companies by striking deals. It could be a licensing deal to use archives of news stories.

Cosine Similarity

In similarity searches of the vectors, cosine similarity is most widely used. It is a measure of similarity between two non-zero vectors of an inner product space — this measures cosine of the angle between them. In NLP, cosine similarity is often used to compare the similarity of words or documents represented as vectors in a high-dimensional space (word embeddings or document embeddings).

Learning Dependencies

A large language model learns dependencies. Precisely it learns to understand and capture relationships between different elements of language — words, phrases or sentences. These relationships also include syntactic dependencies (subject-verb agreement) and semantic dependencies (‘tiger’ and ‘animal’ are related concepts).

When there is a sentence with a missing word, the model is able to predict that word based on the context and dependencies it has learnt from the vast amount of text it has been trained on. It also understands the meaning of a sentence and generates coherent responses based on this understanding.

The model’s learning of dependencies is attributed to pattern recognition in the text it is trained on, capturing statistical regularities and adjusting its internal parameters accordingly while being trained.

It is thus ready with the complex structure of the language and performs various NLP tasks.

Gecko: Text Embedding Model

Google has revealed text embedding model, Gecko which is trained on LLM generated synthetic dataset FRet. What are text embedding models? They represent natural language as dense vectors. These place semantically similar text next to each other (within the embedding space).

In other words, text embedding models act as translators for computers — they convert text into numbers which a computer understands.

As we know now, embeddings are numerical representations. These capture semantic information (about words and sentences) in the text. It enables computers to process natural language. Such processing leads a wide range of tasks (document retrieval, sentence similarity, classification, clustering). Without building a separate model for each of these tasks, a single model is being pushed for a variety of tasks.

Being a general-purpose model, it requires huge data for training. It is here that LLMs come handy. This is what Google has done — leverage the LLMs for training Gecko. Gecko is a two-step LLM powered embedding model. Synthetic data is generated using an LLM. It is refined by retrieving a set of candidate passages for each query. There is then relabeling — positive and negative passages using the same LLM. LLM re-ranks the passages based on LLM scores. Gecko utilized this approach to achieve strong retrieval performance. It becomes a zero-shot embedding model on the Massive Text Embedding Benchmark (MTEB).

LLM generated and LLM-ranked data is combined with human-annotated data in Gecko. It achieved best performance on the MTEB benchmark (Average score 66.31).

Stargate Supercomputer

Both Microsoft and OpenAI propose to collaborate to build a supercomputer called Stargate to power AI. This supercomputer could be 100 times more expensive than the largest data centers currently in operation.

This computer will use millions of specialized server chips and could cost up to $100 billion. It will take next five or six years to build and could be launched in 2028. There could be a series of separate installations. It will further the frontiers of AI.

Microsoft could fund Stargate. Microsoft has already committed $13 billion plus to OpenAI. OpenAI uses at present Microsoft data centers to power its ChatGPT. In return Microsoft gets to exclusively resell OpenAI’s technology to its own customers.

This supercomputer will require more computing power that what is currently supplied by Microsoft to OpenAI. At the same time, it will require several gigawatts of electric power, for which a nuclear energy option could be considered.

The project is expensive as it involves the acquisition of millions of specialized chips. Also, the cost of power makes it expensive.

Next Word Prediction

In a sequence, to generate text, a model predicts the next word using a probability distribution. This is the basic task in NLP (natural language processing).

First step is tokenization of input text into individual words or parts of words. Each token is a unit of language — word or sub-word.

The language model is a neural network. The input it takes is the sequence of tokens. It is trained on a corpus of vast amount of data. It learns relationships between words and the context.

Context representation involves processing of input sequence up to the current token to create representation of the context. In context representation, what is captured is the information of the preceding words in the sequence.

After context representation, the model calculates the probability distribution over the vocabulary for the next word, Probability is assigned to each word in the vocabulary. It indicates the likelihood of each word being the next word, in the sequence, in the light of the context.

Lastly, the model can either sample from this probability distribution to generate a predicted next word stochastically or simply choose the word with the highest probability as the next predicted word.

Probability distribution is computed using SoftMax Activation over the output layer of the model. It converts the raw scores into probabilities. While being trained, the model is optimized to maximize the probability of the correct next word being predicted for each sequence in the training data.

Here the model leverages its learned relationships between words and their contexts. It enables it to generate coherent and contextually relevant predictions for the next word in a given sequence.

Neural Network Architecture

Neural network is basically a mathematical model implemented by a software. It runs on computer hardware. The physical manifestations of neural networks are seen as neuromorphic chips. The architecture itself is a computational model represented in a code.

Even transformers and its encoder-decoder are conceptual components, rather than physical objects. They consist of layers of neural network computations. They are implemented through software libraries such PyTorch or TensorFlow. The architecture can be visualized in diagrams. One can inspect their code implementation.

Nodes in such architecture are computational units that perform specific operations on the input data. In transformer architecture, they represent neurons. These neurons perform operations such as (matrix) multiplications, activation functions (ReLU, SoftMax) and layer normalization.

Each node receives input from the previous layer. It processes the input as per defined function. It passes the result to the next layer. The collective behaviour of these nodes enables the model to learn complex patterns and make predictions on the new data.