Blog

  • Medical Innovations

    As we know, the premier technical institutes in the country such as IITs promote innovations by introducing an incubator programme. On similar lines, the country’s leading medical institutions such as AIIMs too have joined hands with young entrepreneurs to develop health-related products and software. These products could become useful to both doctors and patients. They should be scalable and feasible. AIIM shares its infrastructure (samples and patients) to encourage these startups.

    At present, 10 such projects are progressing at AIIMs. The whole thing started in 2021. Some projects are awaiting validation by clinical trials. Some are waiting for regulatory approval (from CDCO). The startups are being mentored by AIIMs faculty. There are training programmes and bootcamps for these startups so that they develop patient-friendly products.

    One such project is that of weight loss. They are developing a clinically validated app (Zeigen ObesityRx) for people struggling to lose weight. Current weight-loss apps do not target a user’s psychology and are focused on workouts and nutrition plans.

    A stem-cell-based product is being developed for treating traumatic injuries and burn wounds. It will promote tissue regeneration (without reconstruction or plastic surgery). The stem-cell products are preserved through freezing and storage. These are not user-friendly. AIIMS products are derived from stem cells by products. (double-membrane-shaped vesicles measuring less than 200 nanometers). These are an alternative to stem cells and have similar properties. They can be developed at a lower cost.

    The product is in the form of a powder which could be sprinkled over the wound so as to speed up its healing and regeneration.

    They are also working on sprays and gels of stem cells.

    The powder form can be reconstituted into an injectable liquid which can be introduced into the knee joint to treat arthritis. It treats the underlying cause of the disease. It has been tested on animals such as pigs. It is awaiting human clinical trials.

    A gut microbiome is being tested to boost immunity. It will protect the heart and brain health.

  • Foundational Models

    Foundational models refer to large language models (such as GPT series), BERT and others which are trained on a vast corpus of data and serve as the foundation for various NLP tasks (such as text generation, translation, sentiment analysis and more). They are the starting point for building specialized models for specific applications.

    Some prominent scientists who contributed to NLP and ML research are:

    Geoffrey Hinton: His work laid the foundation for modern deep learning techniques, including those used in foundational models.

    Yoshua Bengio: His research has advanced our understanding of neural networks and their applications in NLP tasks.

    Yann Lecun: His work on CNNs is well-known. It has applications in computer vision (CV). His contributions are useful in the development of foundational models.

    Google Team, OpenAI Team: These teams played crucial roles in the development and advancement of foundational models such as GPT and BERT.

  • Gen AI: Biggest Human Invention

    Jeff Maggioncalda, CEO, Coursera feels that generative AI is at the pinnacle of human inventions. As far as its impact on humanity is concerned, he rates it as high as language, alphabet or writing. Just as the ability to speak changed the course of human history, the generative AI too will do so provided humans master it.

    If you know how to use it to your advantage, you will stand out. It is a tech disruption. Coursera runs a course for CXOs — Navigating Generative AI. Then there is an umbrella course on AI. These are massive open online courses (MOOCs). They are free but if you want certificate, you have to pay a nominal amount (a couple of thousand rupees). MOOCs can count as a credit for college degrees.

    Coursera has used AI to translate courses in Hindi.

  • Learn from Silicon Valley’s Gurus

    Organizations these days appoint Chief Experience Officers (CXOs) who manage overall customer experience of an organization. They encourage positive customer interactions. They have a background in operations, marketing, sales and customer service, and are often MBAs or have some other master’s degree.

    As a CXO, you cannot ever stop learning. Learning is not just restricted to classrooms, but it is much more. CXOs must have international connections involving elite professionals. They must learn from industry leaders to push boundaries in their sectors. They must acquaint themselves with new business environments.

    India and APAC, the high-quality education providers, would like to facilitate CXO education by starting a unique Global Executive Immersion programme for Indian CXOs in the Silicon Valley. It is a 6-day, 7-night programme.

    The cost of the programme is $15000. Airfare and visa fees are on the participant. Silicon Valley is an ideal landscape for learning. It is the epicenter of technological innovation. It serves as the incubator of cutting-edge ideas and breakthrough advancements. It has the highest concentration of tech companies and Fortune 500 firms.

    Participants will arrive in the Silicon Valley on April 26. There will be a welcome reception.

    On April 27, there will be a panel discussion on AI. Participants will learn about investment trends. They will identify future growth areas.

    On April 28, they will visit Stanford University Campus and learn about its legacy on the Silicon Valley and tech industry. There will be a classroom session on Design Thinking by Barry Katz.

    On April 29, participants will converse with OpenAI’s mentor/advisor. There will be a tour of Berkeley Campus and a tour of Intel Museum.

    In fact, Silicon Valley has been shaped by technological prowess and Intel Museum pays a tribute it. Intel’s journey has been depicted here — pioneering Intel 4004 to present day processors.

    On April 30, there will be a tour of a search giant’s office. It will be followed by a visit to Apple Park visitor center.

    The next two days will be spent on a guided tour of the first chip manufacturer in the world and conversation with Venture Capital. Finally, there will be conversation with OpenAI’s Zack Kass, former head of commercialization. There will be Computer History Museum tour.

    On May 2, the programme will end with a gala evening of reflection, connection and global insights. There will be lavish dinner party set against the backdrop of Bay Area.

    The programme can be attended by CEO and CXOs, MDs, presidents, founders and co-founders and partners with a minimum of 20 years of work experience.

  • Vector Representation of Words

    Consider three words — cat, dog and bird. Each word can be represented by a numerical vector in a high-dimensional space. The vector captures three dimensions — x, y and z.

    Cat could be represented by [0.8, 0.2, 0.5]

    Dog could be represented by [0.7, 0.3, 0.6]

    Bird could be represented by [0.3, 0.9, 0.2]

    x, y and z could represent size, animal type and habitat (a different aspect of the word’s meaning or usage).

    Algorithms analyze large amounts of text data and construct these word embeddings. These encode semantic and syntactic information about words.

    These representations are standardized to a certain extent. However, there is no single standard. Word embeddings (Word2Vec, GloVe and FastText) are popular approaches for generating vector representations of words. These vectors are of fixed length. It facilitates standardization across words and models. What varies are the specific dimensions and values within these vectors, depending on the algorithm used, and the training data used.

    Without using these algorithms too, words can be converted into vectors. The approach is called one-hot encoding. Here, each word in the vocabulary is represented as a vector where all elements are 0 except for the element corresponding to the index of that word in the vocabulary, which is 1. Let us consider a small vocabulary with three words : eat, dog and bird.

    Cat could be represented as [1, 0, 0]

    Dog could be represented as [ 0, 1, 0]

    Bird could be represented as [ 0, 0, 1]

    The vectors created are sparse vectors, where most elements are zero. However, one-hot encodings do not capture the semantic relationships between words (like embeddings do). They can result into very high-dimensional representations for large vocabularies.

    Index here refers to its position in pre-defined vocabulary. Each word has a unique index. The element corresponding to the word’s index is set to 1. All other elements are set to 0.

    In the three-word vocabulary of cat, dog and bird, let us consider the indices assigned.

    Cat _> index 0

    Dog _> index 1

    Bird _> index 2

    Cat has three elements in the vector. The element at index 0 (corresponding to cat ) would be set to 1.

    The other elements would be set to 0.

    Dog and Bird — 1 would move to their respective index. It is thus a binary vector with a single 1. (including the position of the world in the vocabulary).

    We now know what a sparse vector is. Let us consider now a dense vector where most of its elements are non-zero. Dense vectors are typically used (rather than sparse vectors).

    Dense vectors are often used in word embeddings. Each word is represented by a vector of real numbers (floats). It is in a continuous vector space. The real numbers capture nuanced relationship between words.

    Each dimension of the vector might represent a different aspect of the word’s meaning or context. These are generally lower-dimensional (compared to one-hot encodings). They are computationally more efficient and are able to capture subtle semantic relationships between words.

    Though the data stored in the hardware is in vectors (word embeddings), the answers to our prompts are in the natural text.

    It involves a process of decoding these representations back into the natural language.

    The input prompts is processed. It is converted into corresponding embeddings. These embedding go into the model. They are processed into layers (RNNS, transformers or other architectures). The model learns to generate text based on input embedding and context provided.

    The output is a sequence of tokens or words embeddings. They are decoded back into text. It involves selecting the most probable word for each position in the sequence (based on the model’s learned probabilities in training). It can also use techniques such as beam search or sampling.

    In post-processing, coherence and readability of the output is checked. Duplicate phrases are removed. Grammatical mistakes are corrected. The style is adjusted to match the input prompt or context.

  • Sampling Techniques in Generating Next Word

    LLMs use sampling techniques to generate the next word. Some common sampling techniques used are:

    Greedy Sampling Here the word with the highest probability is chosen. Though a very straightforward, it becomes repetitive and less diverse in output.

    Top-k Sampling Here the word is selected from the top-k most likely words. It is a random technique. It ensures that words with higher probability are chosen.

    First probabilities of all possible words in the vocabulary based on context are calculated. The probabilities are sorted and top k words with the highest probabilities are selected. From this reduced set of top k words, the model randomly selects one word to be the next word in the generated sequence.

    There is a balance between selecting highly probable words (coherence) and some randomness (diversity). The value of k determines how many words are considered in this selection process.

    Top-(nucleus) Sampling The word is selected from the smallest set of words. The cumulative probability of these words exceeds a threshold p. It dynamically adjusts the set of words to maintain diversity (based on changing probabilities.)

    The smallest set of words is determined dynamically based on a cumulative probability threshold (denoted as p).

    First, probabilities of all possible words based on context is determined. Ther are sorted in descending order. Starting from the word with the highest probability, the model calculates cumulative probability while iterating through the list of sorted words.

    Once the cumulative probability exceeds the threshold p, the model stops considering additional words. The model selects from this sub-set of words whose cumulative probability exceeds p. It is a random selection based on original probabilities of these words.

    Temperature Scaling SoftMax probabilities are adjusted before sampling to control randomness of the generated text. Lower temperatures lead to more deterministic outputs. Higher temperatures promote mor randomness.

    These techniques achieve a balance between coherent responses and diversity in generated text.

  • Optimization of an LLM

    A large language model’s efficiency, performance and scalability can be improved by using a suitable combination of the following strategies.

    1. Algorithmic improvements One can research and implement novel algorithms specially customized for optimizing LLMs.
    2. Architecture optimization A model’s architecture should be refined off and on to improve its performance and efficiency — experiment with different architectures, layer configurations, activation function etc.
    3. Hardware optimization Either use customized hardware or specialized hardware architectures which are optimized for deep learning tasks.
    4. Parameter tuning There are parameters such as learning rate, batch size, optimizer choice. These can be fine-tuned. It improves training efficiency and convergence speed.
    5. Quantization One It can reduce the precision of model’s weights and activations so as to decrease memory usage and speed up inference without sacrificing performance.
    6. Data augmentation A model can use synthetic training data. Or else one can apply techniques like dropout and regularization. It prevents overfitting and improves generation.
    7. Knowledge distillation A larger model is used to distill knowledge for a smaller model. It reduces the computational complexity.
    8. Pruning One can reduce redundant or less important connections in the model to shrink its size and computational cost, while preserving its performance.
    9. Parallelization Distributed computing frameworks are leveraged. Hardware accelerators such as GPUs and TPUs are used. It parallelizes training and inference tasks. It reduces execution time.
    10. Model compression Several techniques such as low rank factorization, weight sharing, or parameter tying are used to compress the model’s parameters and reduce its memory footprint.
  • Google in the AI Race

    As we know, ChatGPT was launched by OpenAI in late November 2022. It was a pathbreaking event. Google was already testing generative AI for several months by then. All that led to various models emerging from different divisions within Google. None was good enough to excel GPT-4. Google deferred its plans to launch a rival model, while sorting out the scrambled research work. In the meanwhile, it released a chatbot called Bard. However, it was considered less sophisticated than ChatGPT.

    A year later, Gemini was ready but some flaws were detected in it in image generation. The release was delayed, and the company could not seize the opportunity to be a leader in generative AI, the technology which got shaped in the lab of Google. (Ref. Attention Is All that You Need by Vaswani et al in 2017).

    Google enjoyed leadership position in internet revolution with its search engine — Google Search. (late 1990s and early 2000s). Later Google diversified into mapping, email and more to become the most valuable company in 2016.

    ChatGPT was another event after 25 years of Google’s launch. It was a tool to navigate the online information more creatively. Microsoft was determined to take advantage, while Google stumbled. Microsoft tied-up with OpenAI and funded its research. It embedded its existing products with AI. By doing so, it has become the most valuable company in the world.

    After initial hiccups, Google is steadying itself. Gemini has become acceptable in tech circles. Google is thinking of adding paid generative AI services to its Search engines. However, Google so far has earned most of its revenues through advertising. It is struggling to become successful in generative AI space.

    There are several flaws in Google’s strategy — there is lack of a corporate plan for rolling out generative AI. The company has fragmented organization structure. There could be simmering inter-departmental tensions. Google has to master execution of its strategies. These are, however, early days. Google is well-positioned to move ahead.

    At present, the formidable image Google has in search engine space brings outsized attention to its minor flaws. Google is addressing cultural and organizational issues. Sheer size of Google also causes certain problems.

    Google has merged two research divisions — DeepMind and Google Brain. It is now Google DeepMind. There should be coordination between the research teams.

    The new technology could cannibalize its traditional search business. Generative AI is direct, and the search links displays require further effort. It has to protect its cash cow product.

    Google cannot afford to ignore generative AI. The longer it takes to adopt it fully, the greater is the risk of consumers switching over to rival companies.

  • SoftMax

    SoftMax is a mathematical function that converts a vector of real numbers into a probability distribution. It is frequently used as the output layer activation function in neural networks for classification tasks.

    The SoftMax function takes the exponentials of each element of the input vector and then normalizes them by dividing by the sum of exponentials. It results in probability distribution over multiple classes, making it useful for determined the likelihood of each class being the correct one.

    SoftMax is a function used to transform a vector of arbitrary real-valued scores into a probability distribution classes..

    The formula applied to the vector.

    (\mathbf(z)=z_1, z_2, …, z_n):

    SoftMax ensures that the output values are non-negative and sum up to #1. (Making it a probability distribution.) It exponentiates the input scores (amplifies differences between scores). It is differentiable except at the point where two classes have the same score.

    It transforms raw scores into probabilities. It is useful in various ML algorithms. The raw scores at the output layer refer to activations produced by neurons in the output layer before applying any activation function. These raw scores are often represented as a vector of real numbers. (Each element corresponds to a different class in the classification task).

    In an image classification network, there are 10 classes. The output layer may have 10 neurons. Each produces a raw score representing the network’s confidence or likelihood that the input image belongs to a particular task. These raw scores are subjected to SoftMax function to convert them into probabilities.

    SoftMax is used during training the LLMs and other neural networks. It is a common activation function used in the output layer of neural network for classification tasks.

    In LLMs, SoftMax is used in the output layer to compute probability distribution over the vocabulary of words. During training, the model learns to predict the next word in a sequence bases on the input context. SoftMax helps normalize model’s output into a probability distribution.

    LLMs generate text, but here SoftMax is used in the output layer to compute probability distribution over the vocabulary of words. During training, the model learns to predict the next word in a sequence based on the input context. SoftMax helps normalize model’s output into a probability distribution.

    LLMs generate text, but here SoftMax is not used. A sampling strategy is used to select the next word based on the probabilities predicted by the model. There are techniques like greedy sampling, beam search or nucleus sampling. During training, SoftMax is used to teach the model to produce these probabilities, but during generation the model uses these learned probabilities to guide the sampling process.

  • PyTorch

    PyTorch is an open-source ML framework used for building deep learning models. It provides a (flexible and dynamic) computational graph. It makes it suitable for research, experimentation and production deployment.

    In LLMs, PyTorch is used in GPT-like pre-trained transformers. There are two purposes. First is model development. PyTorch is used to design and implement the architecture of an LLM. It provides high-level interface. It facilitates inclusion of components such as layers, activation functions and optimization algorithms. Since it is dynamic, there could be easy experimentation and prototyping. The second purpose is to train the model and fine-tune it.

    LLMs are trained and fine-tuned on vast amounts of data. The models are pre-trained (using unsupervised pre-training and self-supervised learning). PyTorch facilitates implementation of training algorithms, support for distributed computing and enables evaluation of model performance during training.

    Let us see how PyTorch is used in building a simple language model. Let us create an RNN model for generating text character.

    First of all, install PyTorch. Here we write a a Python script to define, train and use a simple character-level language model. It trains the model on dummy data, and then generates characters using the trained model.

    Python Copy code

    import torch

    import torch.nn as nn

    import torch.optim as optim

    #Define the RNN model

    class CharRNN (nn.module):

    def_init_(self, input_size, hidden_size, output_size):

    super (CharRNN, self). _ init_( )

    self.hidden_size=hidden_size

    self.rnn=nnRNN (input_size, hidden_size, batch_first=True)

    # Define some hyperparameters

    input_size=100 # Size of input vocabulary

    hidden_size=128 # Number of hidden units

    output_size=100 # size of output vocabulary

    seq_length=20 # length of input sequences

    num_epochs= 100 # Number of training epochs.

    #Create an instance of the model

    model= CharRNN (input_size, hidden_size,output_size)

    #Define loss function and optimizer

    data=torch.randn(100, seq_length, input_size

    #Training loop

    for epoch in range (num_epochs):

    model.train( )

    optimizer.zero-grd( )

    hidden=criterion (outputs, labels. view (-1)

    #Forward pass

    outputs, _=model (data, hidden)

    outputs=outputs. view (-1, output_ size}

    loss =criterion (outputs, labels.view (-1)

    #Backward pass and optimization

    loss.backward ( )

    optimizer. step( )

    if (epoch +1) %==0:

    print{f ‘ Epoch { (epoch +1) /[( num-epochs], Loss: (loss.item( ): , 4f )’}

    #Generating text

    def generate_text( model, start_char =A’, length =100) :

    model.eval( )

    with torcch.no_grad ( ):

    input_char=torch.zeros (1,1, input_size)

    input_char[ 0,0,ord(start_char)]=1

    hidden=None

    output_text=start_char

    for in range( length):

    output, hidden=model( input_char, hidden)

    predicted_char=

    torch.argmax (output_probs).item.( )

    output-text+=chr(predicted _char=1

    input_char.fill_(0)

    input_char{ 0,0, predicted_char}=1

    return output_text

    #Generate text using trained model

    generate_text=generate_text(model, start_char=A’, length 200)

    print (Generated Text : “)

    print (generated_text)

    Another alternative to PyTorch for designing neural networks is TensorFlow that offers similar capabilities and is widely used in both research and industry.

    Prior to the advent of PyTorch and Tensor Flow, neural networks were designed by using lower-level libraries and frameworks such as Theano and Caffe. These provided the building blocks for constructing neural networks. These often required manual coding and were less flexible than the modern networks. Researchers and developers had to implement neural network architectures and algorithms from scratch. It was not only time consuming but also error prone. Besides, it was necessary to use specialized hardware and software optimization to achieve acceptable performance (for training and inference). Of course, actual networks could be designed in the absence of PyTorch and TensorFlow, but the process was cumbersome and less accessible to a wider audience.

    PyTorch emerged in 2016. It was developed by Facebook’s AI Research Lab. TensorFlow was released by Google in 2015. Both have come on the scene with a little gap of a year. Both have been used widely for deep learning tasks. These can be used by anyone as they are open-source frameworks with extensive documentation, tutorials and community support.