Feedforward

In a transformer model, feedforward refers to a neural network layer that processes each position in a sequence independently of others.

This layer consists of two linear transformations with a non-linear activation function in between (typically ReLU — rectified linear unit). It is necessary to capture complex patterns and relationships within the input sequence.

The feedforward layer in a transformer is a multi-layered perception (MLP) that acts on the output of self-attention layer. It introduces non-linearity in the model. This enables the model to be good at capturing complex relationships between the input elements.

Thus, when we break down feedforward, it takes output from the self-attention layer and processes it. It then applies a non-linear transformation to this output (using a ReLU activation function). The transformed output is then fed through another linear layer to create a new representation of the input sequence. It is this enriched representation which is better suited for the next self-attention layer or the final output layer of the model.

Essentially, feedforward layer enhances the contextual information extracted by self-attention layer.

We have referred to ReLU as activation function to introduce non-linearity — a critical property for neural networks to learn complex patterns in data.

If the input to the ReLU function is positive, the output remains unchanged.

If the input is negative, the output becomes zero.

It makes ReLU computationally efficient and helps address the vanishing gradient problem (which can hinder training in deep neural networks).

Mathematically, ReLU(x) = max (0,x), which means it returns O for negative input values and the input value itself for positive input values.

Illustration

An input vector has three features — x1, x2 and x3. There is a hidden layer with two neurons. We require a single output. Each connection between neurons has a weight associated with it, and each neuron has a bias term.

Copy Code

Input Layer_____ Hidden Layer_______ Output Layer

x1, x2, x3 — —- >h1, h2 —–> Output

^ _______________^___________________^

l ________________l_____________________l

Weights_________ Weights _____________Weights

(W1)___________ (W2)__________________(W3)

Bias_______________Bias___________________Bias

[B1]_______________[B2]___________________[B3]

The weighted sum of inputs plus bias is calculated for each neuron in the hidden layer. It is then passed through an activation function like ReLU.

Output from each neuron in the hidden layer is connected to the output neuron in the output layer (with associated weights and biases). Again, the weighted sum of inputs plus bias is calculated for the output neuron. This result is the final output of the network.

This is the process of one feedforward pass (through the neural network).

Self-attention and Multi-head Attention

Let us understand the concept self-attention mechanism in a transformer. It is a core part of neural network architecture. It facilitates natural language processing. It makes the model focus on relevant parts of an input sequence while processing each element .

First, each word in the sequence is converted into a vector (input encoding). Thereafter, for word three vectors are created — Query vector (focus of attention), Key vector (essence of each word) and value vector (actual information for each word). Afterwards, the similarity between q vector and each key vector is computed. It gives us attention scores (how relevant each word is to current focus). Lastly, weighted values summation is done — attention scores are summed up. It means a new representation for the current word (enriched by the context of similar words in the sequence).

The whole thing is iterated through the entire sequence. It captures long-range dependences and understand the relationships between words (when they are far apart in the sentence). This facilitates machine translation, text summarization and question answering.

The weighted sum of the value vectors becomes the output for the current position in the sequence. This output captures the context of the current word based on the relevant parts of the sequence.

Self-attention provides contextual embeddings — the meaning of each word is incorporated in the context of surrounding words. To illustrate, money and bank and river and bank. In these two pairs, the meaning of the word ‘bank’ changes as per the context.

Multi-head attention is an extension of the self-attention mechanism in transformer models. Instead of a single attention mechanism, there is multiple-head attention mechanism in parallel. Each such attention mechanism is called a head.

The input embedding are split into multiple heads. Each head has its own set of parameters (query, key and value matrices). It is called splitting. Each head’s attention score is independently calculated (between query and k vectors).

The attention of scores from all heads are concatenated and linearly transformed. Then finally a weighted sum of value scores is calculated, generating the output of multi-head attention layer.

Here the multi-head attention mechanism allows us to focus on two or more perspectives simultaneously. The diverse relationships are captured. It enhances a model’s representational capacity. The model is able to capture complex patterns in the data.

LeCun’s View on AGI

Not a day passes without AI getting the eulogies. It is expected that sooner, rather than later, the AI system will outperform humans at various cognitive tasks.

Huang, CEO, Nvidia suggested the arrival of AGI in the next five years. Ben Goertzel, the father of AGI shortens the period of the arrival to three years. Elon Musk sees AGI coming by the end of 2025. All are not so bullish. Yann LeCun, Facebook’s chief AI scientist argues there is no such thing as AGI, since human intelligence is nowhere near general. He prefers to call it human-level AI. He calls it a port of distant call.

There are challenges — reasoning, planning, persistent memory and understanding of the physical world. These are essential requirements of human-level intelligence or even animal-level intelligence. Current AI systems cannot perform all these tasks.

LLMs are restricted for their knowledge to text. Their understanding of reality is superficial. They are trained on data. If a human being has to absorb so much data, he may take 1 lac years. However, this is not our primary method of learning. Our learning of the world is through our interactions with the physical world around us. A child can have more data than the biggest LLM. LeCun calls it objective- driven AI. There is training through our senses and training through visuals. These impact our actions. Our memory gets constantly updated.

Ultimately, machines will surpass human intelligence, but it will take a while. It is not just round the corner.

TensorFlow

Tensor Flow is an open-source ML platform (developed by Google) used for building and training various types of ML models including deep learning models. It is a flexible architecture that facilitates computation across a variety of platforms (desktops, servers, mobile and edge devices). It is used both by novices and professionals. It has rich resources and documentation.

To begin with, TensorFlow had a static computation graph. However, since TensorFlow 2.0, it has become dynamic like PyTorch. It is easier to use and debug now. Though PyTorch is considered user-friendly and a simpler API, TensorFlow has a steeper learning curve with a complex API and abstraction layers.

TensorFlow, of course, has a wider community of users — both industry and academia, since it is Google-backed. PyTorch is catching up.

TensorFlow has a mature ecosystem for deploying models in production. There are tools like TensorFlow Serving and TensorFlow Lite. They suit different environments (mobile and embedded devices). PyTorch is catching up.

Choosing between TensorFlow and PyTorch is a matter of personal choice.

Translation Platforms

In past, we had Google Translate to translate text from one language to another. Of course, the translation output was not as good as it is today when AI is used for translating. The translation was based on statistical techniques to detect patterns between the two languages.

After the arrival of neural networks, the translation has become neural. There is encoding of sequences as the input. Then there is decoding of these sequences as the output in the target language. The decoding happens as the model has been trained on vast amount of data of multiple languages. The model uses attention mechanism too. The model learns the nuances and dependencies of these languages and is able to translate even idiomatic sequences from the source language to the target language.

Many language models are able to translate in 100 plus languages. Neural translation has supplanted statistical translation.

India has set up Bhashini, a language translation and database platform. It is doing proof-of-concepts for multi-lingual call centers or IVR setups. It will shortly offer real-time language translation services on a paid basis.

An app will be launched to demo real-time translation in open format.

Bhashini has been developed by the Digital India Bhashini division under the Digital India Corporation as a section 8 company. It can do text-to-text translation in 22 languages. It has the capability to recognize automated speech, text-to-speech synthesis, OCR, video translation, document translation, language detection, voice-based payments, among others.

It also provides API (application programming interface) integration to startups.

It handles, at present, 40 million inferences per month. It means translations happening on the platform by users using its different features.

It also wants to facilitate e-commerce through ONDC.

Bhashini collects datasets from crowdsourcing model called Bhashadaan.

Medical Innovations

As we know, the premier technical institutes in the country such as IITs promote innovations by introducing an incubator programme. On similar lines, the country’s leading medical institutions such as AIIMs too have joined hands with young entrepreneurs to develop health-related products and software. These products could become useful to both doctors and patients. They should be scalable and feasible. AIIM shares its infrastructure (samples and patients) to encourage these startups.

At present, 10 such projects are progressing at AIIMs. The whole thing started in 2021. Some projects are awaiting validation by clinical trials. Some are waiting for regulatory approval (from CDCO). The startups are being mentored by AIIMs faculty. There are training programmes and bootcamps for these startups so that they develop patient-friendly products.

One such project is that of weight loss. They are developing a clinically validated app (Zeigen ObesityRx) for people struggling to lose weight. Current weight-loss apps do not target a user’s psychology and are focused on workouts and nutrition plans.

A stem-cell-based product is being developed for treating traumatic injuries and burn wounds. It will promote tissue regeneration (without reconstruction or plastic surgery). The stem-cell products are preserved through freezing and storage. These are not user-friendly. AIIMS products are derived from stem cells by products. (double-membrane-shaped vesicles measuring less than 200 nanometers). These are an alternative to stem cells and have similar properties. They can be developed at a lower cost.

The product is in the form of a powder which could be sprinkled over the wound so as to speed up its healing and regeneration.

They are also working on sprays and gels of stem cells.

The powder form can be reconstituted into an injectable liquid which can be introduced into the knee joint to treat arthritis. It treats the underlying cause of the disease. It has been tested on animals such as pigs. It is awaiting human clinical trials.

A gut microbiome is being tested to boost immunity. It will protect the heart and brain health.

Foundational Models

Foundational models refer to large language models (such as GPT series), BERT and others which are trained on a vast corpus of data and serve as the foundation for various NLP tasks (such as text generation, translation, sentiment analysis and more). They are the starting point for building specialized models for specific applications.

Some prominent scientists who contributed to NLP and ML research are:

Geoffrey Hinton: His work laid the foundation for modern deep learning techniques, including those used in foundational models.

Yoshua Bengio: His research has advanced our understanding of neural networks and their applications in NLP tasks.

Yann Lecun: His work on CNNs is well-known. It has applications in computer vision (CV). His contributions are useful in the development of foundational models.

Google Team, OpenAI Team: These teams played crucial roles in the development and advancement of foundational models such as GPT and BERT.

Gen AI: Biggest Human Invention

Jeff Maggioncalda, CEO, Coursera feels that generative AI is at the pinnacle of human inventions. As far as its impact on humanity is concerned, he rates it as high as language, alphabet or writing. Just as the ability to speak changed the course of human history, the generative AI too will do so provided humans master it.

If you know how to use it to your advantage, you will stand out. It is a tech disruption. Coursera runs a course for CXOs — Navigating Generative AI. Then there is an umbrella course on AI. These are massive open online courses (MOOCs). They are free but if you want certificate, you have to pay a nominal amount (a couple of thousand rupees). MOOCs can count as a credit for college degrees.

Coursera has used AI to translate courses in Hindi.

Learn from Silicon Valley’s Gurus

Organizations these days appoint Chief Experience Officers (CXOs) who manage overall customer experience of an organization. They encourage positive customer interactions. They have a background in operations, marketing, sales and customer service, and are often MBAs or have some other master’s degree.

As a CXO, you cannot ever stop learning. Learning is not just restricted to classrooms, but it is much more. CXOs must have international connections involving elite professionals. They must learn from industry leaders to push boundaries in their sectors. They must acquaint themselves with new business environments.

India and APAC, the high-quality education providers, would like to facilitate CXO education by starting a unique Global Executive Immersion programme for Indian CXOs in the Silicon Valley. It is a 6-day, 7-night programme.

The cost of the programme is $15000. Airfare and visa fees are on the participant. Silicon Valley is an ideal landscape for learning. It is the epicenter of technological innovation. It serves as the incubator of cutting-edge ideas and breakthrough advancements. It has the highest concentration of tech companies and Fortune 500 firms.

Participants will arrive in the Silicon Valley on April 26. There will be a welcome reception.

On April 27, there will be a panel discussion on AI. Participants will learn about investment trends. They will identify future growth areas.

On April 28, they will visit Stanford University Campus and learn about its legacy on the Silicon Valley and tech industry. There will be a classroom session on Design Thinking by Barry Katz.

On April 29, participants will converse with OpenAI’s mentor/advisor. There will be a tour of Berkeley Campus and a tour of Intel Museum.

In fact, Silicon Valley has been shaped by technological prowess and Intel Museum pays a tribute it. Intel’s journey has been depicted here — pioneering Intel 4004 to present day processors.

On April 30, there will be a tour of a search giant’s office. It will be followed by a visit to Apple Park visitor center.

The next two days will be spent on a guided tour of the first chip manufacturer in the world and conversation with Venture Capital. Finally, there will be conversation with OpenAI’s Zack Kass, former head of commercialization. There will be Computer History Museum tour.

On May 2, the programme will end with a gala evening of reflection, connection and global insights. There will be lavish dinner party set against the backdrop of Bay Area.

The programme can be attended by CEO and CXOs, MDs, presidents, founders and co-founders and partners with a minimum of 20 years of work experience.

Vector Representation of Words

Consider three words — cat, dog and bird. Each word can be represented by a numerical vector in a high-dimensional space. The vector captures three dimensions — x, y and z.

Cat could be represented by [0.8, 0.2, 0.5]

Dog could be represented by [0.7, 0.3, 0.6]

Bird could be represented by [0.3, 0.9, 0.2]

x, y and z could represent size, animal type and habitat (a different aspect of the word’s meaning or usage).

Algorithms analyze large amounts of text data and construct these word embeddings. These encode semantic and syntactic information about words.

These representations are standardized to a certain extent. However, there is no single standard. Word embeddings (Word2Vec, GloVe and FastText) are popular approaches for generating vector representations of words. These vectors are of fixed length. It facilitates standardization across words and models. What varies are the specific dimensions and values within these vectors, depending on the algorithm used, and the training data used.

Without using these algorithms too, words can be converted into vectors. The approach is called one-hot encoding. Here, each word in the vocabulary is represented as a vector where all elements are 0 except for the element corresponding to the index of that word in the vocabulary, which is 1. Let us consider a small vocabulary with three words : eat, dog and bird.

Cat could be represented as [1, 0, 0]

Dog could be represented as [ 0, 1, 0]

Bird could be represented as [ 0, 0, 1]

The vectors created are sparse vectors, where most elements are zero. However, one-hot encodings do not capture the semantic relationships between words (like embeddings do). They can result into very high-dimensional representations for large vocabularies.

Index here refers to its position in pre-defined vocabulary. Each word has a unique index. The element corresponding to the word’s index is set to 1. All other elements are set to 0.

In the three-word vocabulary of cat, dog and bird, let us consider the indices assigned.

Cat _> index 0

Dog _> index 1

Bird _> index 2

Cat has three elements in the vector. The element at index 0 (corresponding to cat ) would be set to 1.

The other elements would be set to 0.

Dog and Bird — 1 would move to their respective index. It is thus a binary vector with a single 1. (including the position of the world in the vocabulary).

We now know what a sparse vector is. Let us consider now a dense vector where most of its elements are non-zero. Dense vectors are typically used (rather than sparse vectors).

Dense vectors are often used in word embeddings. Each word is represented by a vector of real numbers (floats). It is in a continuous vector space. The real numbers capture nuanced relationship between words.

Each dimension of the vector might represent a different aspect of the word’s meaning or context. These are generally lower-dimensional (compared to one-hot encodings). They are computationally more efficient and are able to capture subtle semantic relationships between words.

Though the data stored in the hardware is in vectors (word embeddings), the answers to our prompts are in the natural text.

It involves a process of decoding these representations back into the natural language.

The input prompts is processed. It is converted into corresponding embeddings. These embedding go into the model. They are processed into layers (RNNS, transformers or other architectures). The model learns to generate text based on input embedding and context provided.

The output is a sequence of tokens or words embeddings. They are decoded back into text. It involves selecting the most probable word for each position in the sequence (based on the model’s learned probabilities in training). It can also use techniques such as beam search or sampling.

In post-processing, coherence and readability of the output is checked. Duplicate phrases are removed. Grammatical mistakes are corrected. The style is adjusted to match the input prompt or context.