Blog

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an optimization technique mostly used in ML and other fields to minimize an objective function, say an error with respect to its parameters. This method is iterative — it starts with a starting judgement for the parameters and iteratively improves upon them till the minimum is reached.

Each iteration tries to calculate the gradient of the objective function with respect to its parameters. The gradient indicates the direction of the steepest ascent. SGD uses it to decide the direction in which to update its parameters.

In the updating of parameters a scaled version of the gradient is deducted from their current values. The scaling factor is called the learning rate which determines the size of the step taken along the gradient direction.

The above two steps are repeated till the objective function converges to a minimum or a stopping criterion is met.

SGD is faster than other optimization algorithms, especially while dealing with large datasets. The calculation is done only for a small sub-set of data points — a mini-batch in each iteration (and not the whole data set).

It is scalable since it can be extended to large and complex models with multiple parameters.

It is less sensitive to noise in the data.

SGD’s limitation is that it can converge to a local mininum of objective function instead of a global minimum. Another limitation is its tuning — choice of optimal learning rate and other hyperparameters. There is lot of experimentation.

To avoid local minima, SGD’s variant momentum is used. Adagrad variant adjusts the learning rate for each parameter individually to improve convergence.

SGD is used in machine learning (ML) and neural networks and in optimization of signal processing and finance.

SGD is a variant of the gradient descent algorithm. Unlike standard gradient descent, SGD uses only a small batch of data points (mini-batch) to estimate gradient instead of the entire dataset. It is stochastic and hence more efficient and scalable.

It can perform regularization to prevent overfitting.

It is stochastic since it refers to the randomness of mini-batch updates chosen randomly and uses a learning rate schedule where the rate decreases over time. It uses an estimate of the true gradient rather than the actual gradient.

Randomness allows it explore parameter space more effectively. Otherwise, the algorithm follows a set of rules.

9th December 2023
AI in Consultancy and Law Firms

Big consultancy firms and law firms hire interns who do the taxing work of repetitive and time-consuming tasks for the first few years of their jobs. At PwC, these juniors spend time preparing meeting documents for the clients. Junior lawyers keep interpreting complex contracts, which their seniors handle.

Just three or four years later, they reach the prestigious partner level status. Artificial intelligence speeds up time it takes to reach the partner level. It is good to learn by doing some of these documents, but should you do it for two-three years? Probably, not. Just do it two or three times and you are comfortable.

The use of AI will bring about a seismic shift for professional services firms who subject their juniors to years of tedious work before making them partners.

The partner title entitles you to bigger client assignments and lucrative emoluments.

8th December 2023
GPT-4 and Radiology

GPT-4 has one important application — processing of medical images. These images range from X-rays to MRIs. GPT-4 prepares a summary of the reports. Some of these summaries are preferable to those written by expert radiologists.

GPT-4, as we know, is multi-modal. The application of image reading is now available for both Android and iOS operating systems.

It is necessary to scan the reports. GPT-4 algorithm is used to interpret them. The report generated carries summary, diagnosis and medication suggestions.

Medical practitioners can use ChatGPT to get these services through API.

GPt-4 structures the reports automatically. Most of the documents, say clinical history of the patient and radiological interpretation of medical images remain unstructured. This makes interpretation difficult.

If these reports are organized, they become easily searchable. It gathers real-world data (RWD) and real-world evidence. (RWE). It facilitates clinical trials.

Microsoft is one company that uses generative AI in radiology. Other companies too are joining.

GPT-4 can serve as a valuable assistant to radiologists. It does not supplant human judgement but supplements it. Radiologists must verify these reports.

There is dearth of radiologists — there are 20000 radiologists for a population of 1.4 billion people. It means one radiologist per 1 lac people. It is a below par ratio. GPT-4 is thus a boon for the field of radiology.

7th December 2023
AI Models with Planning and Strategizing Capacity

As we have observed, Google has launched in December 2023 its Gemini multi-modal model. Already Google had created a model that could beat champion Go players. Gemini will not only generate text and images but will be able to do some planning and stratigising. Gemini will use some of those skills for problem solving.

On the other hand, we have heard about Q* of OpenAI. Gemini will compete with ChatGPT. Q* can perform, it is said, grade-school math. OpenAI is thus pushing ChatGPT in the direction of Gemini. They are combining math capabilities with software, which can generate text and images. This is unique. This process resembles the thinking and problem-solving process of the human beings.

Such models can be asked to perform tasks like marketing research for a new product. They will come back with market analysis and additional ideas. Maybe, they require some hand-holding but they perform their responsibility. They do not remain limited to a task. Thus these models are capable of performing broader tasks, rather than just single tasks. This does affect the job market.

Companies have just a handful of foundational models to choose from — OpenAI’s ChatGPT, Google’s Bard or Gemini or Amazon’s model.

There is one disturbing factor — these models have entrenched bias towards people with disabilities and racial minorities. Some of their operations are inscrutable — they have a black box.

Microsoft 365 Co-pilot will be used by 7 million knowledge workers and Google’s Duet AI by 3 billion users from enterprise platform Workspace. (Forrester Research).

We now know the way things are moving. It could bring about disruption. Models with planning and strategizing capacity should be used with caution.

7th December 2023
Training Large Language Models (LLMs)

LLMs are trained on a massive amount of data in pre-training. It is unlabelled text data, e.g. web pages, articles, books. It is unsupervised training. The idea is to make the model learn the statistical patterns and structure of language.

The most common pre-training exercise is the prediction of the next word. Here the LLM is given a sequence of words and is asked to predict the next word in the sequence. This actually teaches LLM to learn the relationships between words, and how they are used in different contexts. Alternatively, certain words are masked in the sequence, and the LLM has to predict these words.

After pre-training, the LLMs are time tuned for a specific task, e.g. translation from one language to another or answering a question. Here such labelled data specific to the task is used to train the LLM. It is supervised learning. Here the LLM learns task-specific patterns and relationships between the words.

In both pre-training and fine tuning, forward pass and backpropagation are used. In forward pass, the input data is fed into LLM and the output is computed. The data passes through layers of neurons. Weights and activation function is applied to each neuron. The output of the forward pass is LLM’s prediction.

Backward propagation is the process of using error between LLM’s prediction and the true label to adjust LLMs weights. This is accomplished by computing the gradient of the error with respect to the weights, and then using the gradient to update the weights in a way that reduces the error.

In an LLM’s training, both forward pass and back propagation are important. The forward pass allows it to make predictions. and backpropagation allows it to learn from its mistakes and improve its predictions over a period of time. In other words, it improves its accuracy.

This training process is iterative. In each iteration the LLM receives batch of data, and then the forward and backpropagation processes are applied. The weights are updated in the light of the results of backpropagation process. Then the next iteration begins.

First, the training data is prepared. There is cleaning of data, removal of noise and tokenization of data. The model weights are initialized. It is done randomly or by using pretrained weights from another model. Now the batch of data is fed to the model. In forward pass, it passes through layers of neurons. The LLMs weights and activation functions are applied to each neuron. The output of the forward pass is LLMs prediction for the input data. At this stage, loss is calculated between LLM’s prediction and the true label (desired loss). This loss actually indicates how bad the LLMs prediction was.

In backward propagation, the loss is used to compute the gradient of the loss with respect to LLMs weights. The gradient is used to update the weights so as to reduce the loss.

The process of forward pass, loss calculation and backward propagation is repeated till the LLM has learned to make the accurate predictions for the training data. Later it is fine tuned for specific tasks on smaller amount of data specific to the task.

This process is continuous, since predictions are constantly updated by receiving new input data. In the further iterations, LLM uses predicted word, and the current context to generate the next prediction. It learns range dependencies in the language and make more accurate predictions.

Geoffrey Hinton, a British computer scientist is known for his work on developing backpropagation. LeCun, a French computer scientist pioneered CNNs well suited for image recognition and NLP tasks including machine translation and question answering. Bengio, a Canadian scientist made significant contributions for training neural networks. Nitish Srivastava, a Canadian scientist is known for his regularization technique — dropout — which prevents overfitting. Thomas Mikola is known for his work on word2vec. He is a Czech scientist. Cho, a Korean, working for NY University is known for his work on RNNs. They are well-suited for sequential data and are used for NLP including machine translation and speech recognition.

6th December 2023
AI and Advertising

Ad agencies these days commonly use tools such as ChatGPT for research, data analytics and code generation. Midjourney is used for visualization. Stable Diffusion is capable of generating photo-realistic images. Adobe Firefly is used for seamless adoption of creative assets across media.

AI algorithms could be used to read reams of customer data.

The company can supplement its promotional material by AI-led videos. If similar videos are produced with human actors, they would cost so much. AI can bring about a reduction in content production cost.

AI is used to create social media content. There are great savings.

It should be remembered what Piyush Pandey, Ogilvy says that behind even the most impressive AI system, there will always be people who use their uniquely human faculties to come up with Big Ideas.

5th December 2023
AI and Music

Music and technology go hand in hand. Musicians performing on stage have been using tech equipment. Since then music has received a new technological push on account of AI. AI is getting better at generating music that sounds like the real thing — let us call this vocal deepfakes. There are imitations of super stars which are pitch perfect. Generally, they appear without the express consent of the artist.

AI has created issues for $26 billion music industry. Artists fear their royalties will get affected as machines copy and replace recorded music catalogues. Machine-manufactured music would be a dampener for the music industry.

Of course, AI has created some opportunities for the artists, Music companies want withdrawal of deepfakes which are unauthorized. Grimes thought of a new business model — let the artist decide about this. Anyone can use her voice clone, but when a song is released officially, the royalties will be shared on equal basis (50-50). It is a way of partnering. There is no paperwork. There are no gatekeepers. Grimes casts herself wider and is capturing new markets.

Tech augments jobs. It does not replace them. Grimes, it so seems, believes in this economic theory.

There is no clarity about the actual demand of AI music. Grimes AI song reached a respectable 1.6 million streams. Still, we are not in a machine-first world. Many US workers still identify their occupation as singing. They would resist the pressure of supercomputer music.

Grimes too is cautious enough to insist on artists consent, and some control over her gains. Record labels too are making AI music of their own.

We know music industry is not known about caring for its artists beyond a point. If musicians survive the AI’s onslaught, there is a hope for other professionals too.

5th December 2023
In Retrospect 2023 : The AI Year

The introduction of ChatGPT in November 2022 made it one year old in 2023. It has been a ChatGPT based on large language model. (LLM). LLMs, as we know, are based on a seminal paper that was co-written by seven authors – Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez and Illia Sutskever. The paper was titled ‘Attention Is All You Need’. It was published in December 2017 (Google).

LLMs use transformer architecture. ChatGPT has user-friendly conversational interface to the underlying LLM GPT-3.5. It generated content and initiated generative AI. Gen AI caused a global stir, unlike anything since the iPhone’s debut. It acquired a million users in five days of its launch. At present, it has hundreds of millions of users. A series of similar bots too appeared in the market from different organizations, the most recent being Amazon Q.

The technology had a profound effect on creative and knowledge-based work. It reduced the time taken to generate content by a significant percentage and enhanced the output quality a great deal.

The technology was put on par with electricity and fire considering its tremendous transformative power. McKinsey estimates an addition of $4 trillion a year to the global economy.

There are issues about the safety of generative AI and it is a matter of discussion right from the US Congress to Bletchley Park resolution in the UK, and the debate has divided the world into two camps — those who want to speed up AI research and those who would like to go slow on it.

Some adore its potential to benefit mankind and some are ‘doomers’ who are concerned about the harm it can cause.

There are voices everywhere advocating regulation of AI.

The whole of 2023 could be considered a year of AI. In the midst of all this, OpenAI’s Board dismissed Sam Altman, the CEO but he was reinstated in less than a week after an employee threat of en masse attrition.

It has been rumoured in this context that a certain secretive project Q* — pronounced as Q-star — has been conceived at OpenAI days before Sam(uel) Altman was fired. The Board was informed about the project by the employees. It was a warning that Q* could be harmful to humanity. Perhaps, this could be attributed to the dismissal of the CEO. It is, however, a remote possibility, since Sutskever denies receiving any such letter.

Q* could have elaborated on new neuro-symbolic architecture. Such a system could enable an AI system to learn from less data, explaining behaviour and logic. This is a stepping stone to achieve ‘artificial general intelligence’ (AGI). Though ill-defined, AGI means information processing ability equalling a human being or even surpassing a human being, and exercised at machine speed.

Q* could be short of this breakthrough, but takes us closer to it. Nvidia’s CEO Jensen Huang says this is achievable in next five years. Microsoft’s president Brad Smith is not so optimistic about AGI. He feels it may take years and decades.

Both ChatGPT and Q* has given rise to speculation, apprehension, competition and regulation. The year that has gone by –2023 — is a milestone year and represents human quest for knowledge and mastery over the universe.

The coming years are going to be much more tumultuous. The path we chose to tread depends on the guidance the wizened in the field offer to us.

4th December 2023
AI Superintelligence — Elusive in Near Future

Facebook’s AI chief scientist Yann LeCun believes that the status of AI today is far away from reaching . semblance of sentience. Jensen Huang, Nvidia’s CEO expresses exactly the opposite view, and says within five years AI will be fairly advanced. Maybe, his forecasting is based on his vested interest — he supplies the chips to advance the AI race. LeCun says that society is likely to reach cat-level or dog-level AI competence before reaching human-level competence. The focus on language models is not sufficient to take us to human-level AI systems.

Text alone cannot train a model to understand the distinction between A and B. The transformer models should be able to handle a variety of data — audio, image and video. There could be billions of correlations between these kinds of data. The more we understand it, the more fascinating it will be.

Thus the multi-modal systems is the next frontier of knowledge. However, it is an expensive proposition. Facebook’s Llama has been trained on 16000 Nvidia A100 GPUs. You can imagine the hardware requirements of the future models.

Currently, GPU technology is the gold standard for AI but the future chips may not be called GPUs. They would be just neutral, deep learning accelerators.

LeCun has his misgivings even about quantum computing. He still believes that classical computing is enough to solve the problems efficiently. Useful quantum computing has got a long time horizon.

AI, however, a decade back, was considered to be a commercializable technology.

4th December 2023
Redesigning Transformer Architecture

As we know, LLMs such as ChatGPT consume extensive memory and has huge computational demands. Thus such models are expensive, and the operating costs are very high.

It is possible to simplify the architecture of LLMs to make them economical. The underlying architecture consists of a transformer. ETH Zurich researchers have targeted the transformer architecture. They have come out with a new streamlined design of a transformer, while retaining its accuracy and its inference making ability.

LLMs, as we know, operate on a foundation of transformer blocks. These process the data sequences. In each transformer block, there are two key sub-blocks — attention mechanism and multi-layer perception (MLP). As we know, the attention layer focuses selectively on different parts of input data (say tokens in sequence) to get the context and relative importance. Though the tokens are apart, the model learns how they relate to each other.

Sequential data is processed by the transformer block. The sub-block of attention-mechanism further refines and processes highlighted information. The whole thing captures relationships.

There are additional features such as residual connections and normalization layers. These speed up learning and reduce the severity of issues.

Transformers stack up to increase their capacity to capture complex relationships in training data. However, the fundamental design of the transformer block has remained unaltered since its making.

Given the excessive training cost and costs of their deployment, the efficiency we can bring about in training and inference making ability result into substantial savings.

The transformer block can be simplified by eliminating unnecessary components. It reduces parameter count and increases throughput of the model.

The stripped-down version of the transformer, as per the research team, does not compromise either the training speed or performance on downstream tasks.

There are multiple attention heads in a transformer model — key (K), query (Q) and value (V) parameters go with these components. These together map the interface of the input tokens. If V parameters are eliminated, there could be projection layer which synthesizes the values for the MLP block, there is no loss of efficiency.

At the same time, the researchers removed skip connections (which avoid vanishing gradients). These vanishing gradients make training difficult (the gradient is too small to bring about significant learning in preceding layers).

The transformer block has been redesigned to process attention heads and MLP concurrently (rather than one after the other.) It is this parallel processing which deviates from the conventional architecture.

The reduction of parameters has been compensated by adjusting, non-learnable parameters, by refinement of training, and by implementing architectural tweaks. Put together, these alterations maintain a model’s learning capabilities, despite the leaner structure.

The new transformer block has been tested by the researchers. The transformer has shrunk in size by as much as 16 per cent without diluting its capabilities. If this is extended to a large model with billions of parameters, it could result into a massive memory saving.

The greater depth makes the model trainable faster and makes use of extra capacity the depth provides. Though tested on a smaller scale, the research still remains untested on larger models.

3rd December 2023