Backpropagation

A key algorithm in the training of artificial neural networks is backpropagation. It calculates the gradient of the loss function — with reference to the weights of the network.

The gradient is used to update the weights. That enables the model to learn.

Backpropagation is initialized during the training phase after the forward pass — the input data is propelled further through the network to make predictions. Backpropagation is done before the backward pass (where gradients are calculated, and weights updated).

The network’s output or conjecture and the actual correct output are compared. Iteratively, the weights are adjusted. The network learns to map the input to the desired correct output. The network learns from its mistakes.

Say an input image is identified (prediction). This is compared to the actual answer. The difference between the prediction and the answer is then propagated backward through the network. While thus travelling backward, the weights between neurons are adjusted to minimize the error for future predictions.

Here calculus is leveraged. The chain rule is utilized. It determines how much each weight contributes to the overall error. These gradients are calculated. This way the algorithm identifies how the weights are to be adjusted so as to minimize the error and improve the network’s performance.

It is complex math but the idea is to make network learn iteratively to refine its internal connections based on the error it makes.

The weights are adjusted using gradient descent (a common optimization algorithm). The specific calculations use the chain rule of calculus repeatedly to differentiate the error function through the layer of the network.

Gradients provide the direction and magnitude for how much the function changes in response to changes in the input. In this context, the inputs are the weights connecting neurons.

Issue of Vanishing Gradient

In neural network training, we come across the problem of vanishing gradient. The gradients become extremely small during the backpropagation. It hinders the training, especially when there are many layers, since the weights of these layers are not effectively updated.

Consequently, these layers are very slow to learn, or do not learn at all. It results into suboptimal performance, or a failure to converge.

Some techniques are used to overcome this issue — careful weight optimization, batch normalization, and skip connections.

The issue generally occurs in RNNs at the time of training. Here the gradients used to update weights of the network become so small while they are being backpropagated through the network layers. This is particularly so when activation function such as sigmoid function and process information are used.

ReLU and leaky ReLU are less prone to vanishing gradients. Weights should be so initialized that they help the flow of gradients easily through the network. One can use techniques such as Xavier initialization or He initialization. There should be gradient clipping to limit the magnitude of the gradients (to prevent them from becoming too small or too large).

As ReLU have gradients which are either 1 or 0, the vanishing effect is prevented.

On this blog, this is 2800th write-up. Hidden Layers in a Transformer

The hidden layers refer to the layers between the input and output layers in a transformer. These hidden layers are where most of the computation and transformations occur.

In the transformer architecture, the hidden layers consist of the self-attention mechanisms followed by feed forward neural networks.

Thus, there is self-attention mechanism. It contains multiple self -attention heads. It weighs the significance of different input tokens (when processing each token). It enables capturing relationships between tokens (in the input sequence).

Soon after the self-attention mechanism, there is feedforward network layer. It consists of two linear transformations with a non-linear activation function (usually a ReLU) in between. The model thus learns more complex pattern and relationships.

At the end, both self-attention and feedforward layers are augmented with residual connections and layer normalization. Residual connections facilitate the flow of gradients during training. Layer normalization stabilizes the training process.

Hidden layers transforms an input sequence into a representation that captures its semantic and contextual information.

Parallel Computations in Transformer Model

Parallel computations occur in transformer model during self-attention mechanism and the feedforward neural network layers.

The self-attention mechanism involves computing attention scores between all pairs of tokens in the input sequence. Here it is done through matrix multiplication operations.

Within the self-attention mechanism, the scaled dot-product attention computation occurs to calculate attention scores. These are parallelized across all tokens in the sequence.

In feedforward, there is linear transformation — matrix multiplication and element-wise non-linear activation function (ReLU). This is parallelized across multiple tokens in the sequence.

In training with multiple layers of the transformer, there is parallel processing across the different layers. It makes the process efficient and speeds it up.

Thus, parallelization facilitates efficient processing of input sequences and faster training.

Illustration

We have the four tokens in the input sequence — [Tokens1, Token2, Token3, Token4]

We compute three matrices — Query, Key and Value (from input embeddings). Each matrix has dimensions (sequence length) x (embedding dimension).

Let us assume embedding dimension of 3.

Input Embeddings:

Token 1: [1, 2, 3]

Token 2: [4 ,5, 6]

Token 3: [7, 8, 9]

Token 4: [10, 11, 12]

Query Matrix (Q):

[ 1, 4, 7, 10 ]

[ 2, 5, 8, 11 ]

[ 3, 6, 9, 12 ]

Key Matrix (K):

[1, 4, 7, 10]

[2, 5, 8,11]

[3, 6, 9,12]

Value Matrix (V):

[ 1, 4, 7, 10 ]

[ 2, 5, 8, 11 ]

[ 3, 6, 9, 12 ]

Calculate the attention scores by taking the dot product of the Query and Key matrices and scale them.

Finally, compute the weighted sum of the value matrix based on these attention weights.

Parallelization

At each step of the self-attention computation, perform parallel operations across all tokens in the sequence. Say while computing attention score of Token 1, simultaneously calculate scores of other tokens, viz. Token 2, Token 3 and Token 4.

This parallelization brings in an element of efficiency. The models are scalable.

Transformer models are more amenable to parallel computation as compared to RNNs (where information processing happens one step at a time).

Feedforward layer is more amenable to parallelization (feedforward in encoder and decoder). In training, computations across different sequences within a batch can also be parallelized.

It should be noted that all aspects of transformers are not parallelizable, (e.g. self-attention layer has sequential dependencies).

Masked Multi-head Attention

In a transformer model, masked multi-head attention in the output layer is a mechanism used to attend to the different parts of the input sequence (while preventing the model from attending to future tokens).

This is useful in language modelling or sequence generation where the model should only have access to past tokens during training and inference to maintain causality.

Multi-head attention is present in the decoder layer to capture different aspects of the input sequence (It is similar to encoder).

Prior to computing attention scores, a masking mechanism is applied to the attention weights to prevent the model from attending to future tokens. Masking ensures that each token can only attend to previous tokens in the sequence (preserving the autoregressive property).

During training, the masking is achieved by setting the attention scores for future tokens to a large negative value (e.g. infinity) before applying SoftMax. It effectively masks future tokens. Thus, the model attends to only past tokens during training.

This mechanism is employed in the decoder block to ensure that the model attends to previously generated outputs during sequence generation tasks (machine translation or text summarization).

Masked attention restricts access to future information.

Feedforward

In a transformer model, feedforward refers to a neural network layer that processes each position in a sequence independently of others.

This layer consists of two linear transformations with a non-linear activation function in between (typically ReLU — rectified linear unit). It is necessary to capture complex patterns and relationships within the input sequence.

The feedforward layer in a transformer is a multi-layered perception (MLP) that acts on the output of self-attention layer. It introduces non-linearity in the model. This enables the model to be good at capturing complex relationships between the input elements.

Thus, when we break down feedforward, it takes output from the self-attention layer and processes it. It then applies a non-linear transformation to this output (using a ReLU activation function). The transformed output is then fed through another linear layer to create a new representation of the input sequence. It is this enriched representation which is better suited for the next self-attention layer or the final output layer of the model.

Essentially, feedforward layer enhances the contextual information extracted by self-attention layer.

We have referred to ReLU as activation function to introduce non-linearity — a critical property for neural networks to learn complex patterns in data.

If the input to the ReLU function is positive, the output remains unchanged.

If the input is negative, the output becomes zero.

It makes ReLU computationally efficient and helps address the vanishing gradient problem (which can hinder training in deep neural networks).

Mathematically, ReLU(x) = max (0,x), which means it returns O for negative input values and the input value itself for positive input values.

Illustration

An input vector has three features — x1, x2 and x3. There is a hidden layer with two neurons. We require a single output. Each connection between neurons has a weight associated with it, and each neuron has a bias term.

Copy Code

Input Layer_____ Hidden Layer_______ Output Layer

x1, x2, x3 — —- >h1, h2 —–> Output

^ _______________^___________________^

l ________________l_____________________l

Weights_________ Weights _____________Weights

(W1)___________ (W2)__________________(W3)

Bias_______________Bias___________________Bias

[B1]_______________[B2]___________________[B3]

The weighted sum of inputs plus bias is calculated for each neuron in the hidden layer. It is then passed through an activation function like ReLU.

Output from each neuron in the hidden layer is connected to the output neuron in the output layer (with associated weights and biases). Again, the weighted sum of inputs plus bias is calculated for the output neuron. This result is the final output of the network.

This is the process of one feedforward pass (through the neural network).

Self-attention and Multi-head Attention

Let us understand the concept self-attention mechanism in a transformer. It is a core part of neural network architecture. It facilitates natural language processing. It makes the model focus on relevant parts of an input sequence while processing each element .

First, each word in the sequence is converted into a vector (input encoding). Thereafter, for word three vectors are created — Query vector (focus of attention), Key vector (essence of each word) and value vector (actual information for each word). Afterwards, the similarity between q vector and each key vector is computed. It gives us attention scores (how relevant each word is to current focus). Lastly, weighted values summation is done — attention scores are summed up. It means a new representation for the current word (enriched by the context of similar words in the sequence).

The whole thing is iterated through the entire sequence. It captures long-range dependences and understand the relationships between words (when they are far apart in the sentence). This facilitates machine translation, text summarization and question answering.

The weighted sum of the value vectors becomes the output for the current position in the sequence. This output captures the context of the current word based on the relevant parts of the sequence.

Self-attention provides contextual embeddings — the meaning of each word is incorporated in the context of surrounding words. To illustrate, money and bank and river and bank. In these two pairs, the meaning of the word ‘bank’ changes as per the context.

Multi-head attention is an extension of the self-attention mechanism in transformer models. Instead of a single attention mechanism, there is multiple-head attention mechanism in parallel. Each such attention mechanism is called a head.

The input embedding are split into multiple heads. Each head has its own set of parameters (query, key and value matrices). It is called splitting. Each head’s attention score is independently calculated (between query and k vectors).

The attention of scores from all heads are concatenated and linearly transformed. Then finally a weighted sum of value scores is calculated, generating the output of multi-head attention layer.

Here the multi-head attention mechanism allows us to focus on two or more perspectives simultaneously. The diverse relationships are captured. It enhances a model’s representational capacity. The model is able to capture complex patterns in the data.

LeCun’s View on AGI

Not a day passes without AI getting the eulogies. It is expected that sooner, rather than later, the AI system will outperform humans at various cognitive tasks.

Huang, CEO, Nvidia suggested the arrival of AGI in the next five years. Ben Goertzel, the father of AGI shortens the period of the arrival to three years. Elon Musk sees AGI coming by the end of 2025. All are not so bullish. Yann LeCun, Facebook’s chief AI scientist argues there is no such thing as AGI, since human intelligence is nowhere near general. He prefers to call it human-level AI. He calls it a port of distant call.

There are challenges — reasoning, planning, persistent memory and understanding of the physical world. These are essential requirements of human-level intelligence or even animal-level intelligence. Current AI systems cannot perform all these tasks.

LLMs are restricted for their knowledge to text. Their understanding of reality is superficial. They are trained on data. If a human being has to absorb so much data, he may take 1 lac years. However, this is not our primary method of learning. Our learning of the world is through our interactions with the physical world around us. A child can have more data than the biggest LLM. LeCun calls it objective- driven AI. There is training through our senses and training through visuals. These impact our actions. Our memory gets constantly updated.

Ultimately, machines will surpass human intelligence, but it will take a while. It is not just round the corner.

TensorFlow

Tensor Flow is an open-source ML platform (developed by Google) used for building and training various types of ML models including deep learning models. It is a flexible architecture that facilitates computation across a variety of platforms (desktops, servers, mobile and edge devices). It is used both by novices and professionals. It has rich resources and documentation.

To begin with, TensorFlow had a static computation graph. However, since TensorFlow 2.0, it has become dynamic like PyTorch. It is easier to use and debug now. Though PyTorch is considered user-friendly and a simpler API, TensorFlow has a steeper learning curve with a complex API and abstraction layers.

TensorFlow, of course, has a wider community of users — both industry and academia, since it is Google-backed. PyTorch is catching up.

TensorFlow has a mature ecosystem for deploying models in production. There are tools like TensorFlow Serving and TensorFlow Lite. They suit different environments (mobile and embedded devices). PyTorch is catching up.

Choosing between TensorFlow and PyTorch is a matter of personal choice.

Translation Platforms

In past, we had Google Translate to translate text from one language to another. Of course, the translation output was not as good as it is today when AI is used for translating. The translation was based on statistical techniques to detect patterns between the two languages.

After the arrival of neural networks, the translation has become neural. There is encoding of sequences as the input. Then there is decoding of these sequences as the output in the target language. The decoding happens as the model has been trained on vast amount of data of multiple languages. The model uses attention mechanism too. The model learns the nuances and dependencies of these languages and is able to translate even idiomatic sequences from the source language to the target language.

Many language models are able to translate in 100 plus languages. Neural translation has supplanted statistical translation.

India has set up Bhashini, a language translation and database platform. It is doing proof-of-concepts for multi-lingual call centers or IVR setups. It will shortly offer real-time language translation services on a paid basis.

An app will be launched to demo real-time translation in open format.

Bhashini has been developed by the Digital India Bhashini division under the Digital India Corporation as a section 8 company. It can do text-to-text translation in 22 languages. It has the capability to recognize automated speech, text-to-speech synthesis, OCR, video translation, document translation, language detection, voice-based payments, among others.

It also provides API (application programming interface) integration to startups.

It handles, at present, 40 million inferences per month. It means translations happening on the platform by users using its different features.

It also wants to facilitate e-commerce through ONDC.

Bhashini collects datasets from crowdsourcing model called Bhashadaan.