Blog

Seamless: Speech Translator Model

Facebook has released seamless an AI-powered language translation model. It converts speech in one language into another, and at the same time maintains the tone and emotion of the original speech.

Facebook got close to developing a language translation model in 2023 called universal language translator. It released seamlessM4T, which can translate audio or text in 100 languages into text of any of the languages or speech in 36 languages. Much water has flown through the Ganga since then.

Seamless is the foundational model. In future, it will facilitate a world where everyone will be understood.

The updated version is called SeamlessM4T v2. It recognizes speech automatically. It is good at speech-to-speech, speech-to-text and text-to-speech functions.

The new AI used in the model is SeamlessStreaming and SeamlessExpressive.

A typical translator translates after the speaker finishes a sentence in speech. It is to deal with different language structures. The syntax of subject-verb-object order may differ from language to language. It leads to delays in translation. The conversations feel less natural.

SeamlessStreaming commences translation as soon as the speaker speaks. The listener hears the translation with a delay of just few seconds. It is a latency of just 1-2 seconds.

SeamlessExpressive focuses not on the content but the tenor of it. The translation should maintain the emotion, style and rhythm of the original soeech.

Seamless is multi-task and multi-lingual model.

There is an expressivity of encoder and expressive unit-to-speech generator conditioned on source speech.

It has been made open source and is available on GitHub.

4th January 2024
Rohan Murty

Rohan Murty’s parents are celebrated Narayana Murthy, Infosys founder and Sudha Murty a well-known author and philanthropist. Akshta Murty, Rohan’s sister is the wife of the British Prime Minister Rishi Sunak.

Rohan was interested in programming and invention from his student days. He schooled in Bangalore, and graduated from Cornell in Computer Science. He completed his doctorate in computer engineering from Harvard. His doctoral specialization was opportunistic wireless networks.

He founded a company of his own — Soroco in 2014, with Arjun Narayan and George Nychis as co-founders. The company facilitated digital transformation of business. It used AI-powered technology. The company is rapidly scaling up. It earned revenues worth $18 million in 2022.

Rohan’s stake in Infosy is 1.67 per cent, which in value terms is Rs.5.55 lac crore. He received Rs.100 crore plus in dividend income.

3rd January 2024
Rohan Murty

Rohan Murty’s parents are celebrated NarayaThe company

3rd January 2024
Tokenization and Vectorization

Tokenization is breaking the input text into individual units called tokens — words, sub-words, characters, punctuation marks. You can choose a suitable tokenization method. Each token is assigned a unique numerical ID.

Then the process of vectorization starts. A matrix embedding is created — each row corresponds to a unique token ID and contains a vector of numerical values, typically floating-point numbers.

When we come across a token in the text, its ID is used to retrieve the corresponding vector from the embedding matrix. This vector is readable in a machine-readable format.

Vectors in the embedding matrix are initialized with random values. The model is trained on a corpus of data. It learns meaningful representation of each token. During training, these vectors are adjusted to capture semantic relationships, similarities and patterns.

The aim is to have similar vector representation for tokens with similar meanings. It enables models to generalize and make predictions for new text inputs based on learned patterns.

The number of values in each vector shows its dimensionality. It is a hyperparameter. It can be adjusted depending on the task and model architecture.

There are pre-trained embeddings which can be used instead of training these embeddings from scratch. It is time and cost-effective to do so.

There is positional encoding in sequential models such as RNNs and Transformers. It is used to incorporate information about the position of a token within a sequence. It makes the model learn the word order, and sentence structure.

2nd January 2024
Wish You All a Happy and Prosperous New Year : AI Talent Hunt

Admittedly, there is a global race for AI dominance. And the winner in this race will be decided by the quality of AI talent.

How India fares in this race? Has it got an ecosystem to attract and nurture and retain its AI talent? Are Indian universities equipped with cutting edge research capabilities and thought leadership? India’s need for AI talent is far greater than what a few outlier universities with such capacities offer.

It is estimated by NASSCOM that India’s AI talent pool is 16 per cent at the global AI talent. India has 2 million plus STEM graduates. What matters is the quality of this manpower. It is a big challenge for the Indian industry to convert this manpower into workable individuals by training them.

Even academia lags behind the current trends. They languish in obsolescence in STEM domain. They offer only a sparse of cutting-edge work. In fact, this is an injustice to the young India. There should be enough universities dedicated to AI research.

India has many data centers and digital public pipelines. These are repositories of accessible datasets. Indian citizens understand more than one language. Some speak a mix of two languages, say Hinglish. This sets India apart. It has a distinctive cultural-linguistic nuance not found elsewhere.

AI has so far remained a darling of media. It has to travel to research labs, and then to industry. There should be tie-ups across the sectors. There should be curated catalogues of AI datasets. There should be testbeds, educational resources, metadata.

AI’s commercialization will receive a boost at the hands of local industry. India is being leveraged by foreign tech firms for its cost-effective talent and backend development. They keep their commercial interest in mind. India must realize this and must cultivate its own ecosystem. India must learn to retain its AI talent which can push it to global stage. Otherwise, the talent will gravitate to global ecosystems.

At present, even Indian tech industry is in stress due to global factors. Hiring has slowed down except in AI and related domains AI is a key driver of innovation. India should adapt to client needs of digital transformation and AI solutions. There is a promise in tie ups of global firms such as Nvidia with Indian firms. India’s IT majors must recalibrate their AI offerings with the necessary skilled talent.

In AI, the US maintains an edge. It fosters AI unicorns. China too is making substantial investments in AI. The UK and Germany follow suit. The US attracts talent. In STEM fields, half the masters and doctorates are from the US universities. They remain in the US after their education. And 60 per cent of AI researchers are affiliated with US institutions.

Governments have recognized AI as an agent of change. At the same time, governments are also concerned with the risks posed by AI. It is a policy dichotomy. The government have to play a dual role — harnessing AI for its positives, and simultaneously addressing the risks associated with it.

In AI race, make no mistake, the winner takes it all.

1st January 2024
Cached Transformer with a GRC

Here a transformer incorporates a memory cache to enhance its capacity to tackle long-range dependencies in sequences. In traditional transformers, there is always a struggle to capture relationships between distant elements in long sequences.

The key components of a cached transformer include a GRC or Gated Recurrent Cache. This component stores dynamically token embeddings based on their relevance and historical significance.

It serves as a differentiable memory cache. The model can attend to both current and previously seen information.

Tokenized embeddings are converted into numerical veotors. GRC processes these veotors. It stores relevant information in its cache. The transformer self-attention mechanism now can attend to both to current input tokens and cached historical information. It performs standard self attention on the combined input and cached representations. This processed information is further refined through normalization and feed-forward layers. The GRC constantly updates its cache based on current input tokens. It thus ensures that the stored information remains relevant.

Advantages:

There is improved handling of long-range dependencies. The model captures relationships between distant elements in long sequences. It enhances its performance, e.g. in language modelling, machine translation, image classification and instance segmentation.

It reduces computation cost as compared to traditional transformers. It is a promising advance in transformer architecture. Further research is expected to explore their full capabilities and potential impact on language and vision processing. Researchers from the Chinese University of Hong Kong, the University of Hon Kong and Tencent Inc. propose this innovative approach called Cached Transformers with a GRC.

31st December 2023
NYT Suit against OpenAI and Microsoft

The basic idea behind copyright is to ensure that creators have an incentive to do new work. It also leaves some space for derivative work — say fair usage for criticism, comment, reporting, teaching, scholarship or research among others. Here a small sample of copyright work can be duplicated.

Of late AI is testing the boundaries of copyright. It generates music, visuals, lyrics and scripts, based on ingesting previous work of the creative people. On receiving a prompt, AI processes ingested material and delivers the outcome.

The New York Times has filed a suit against OpenAI and Microsoft alleging that both these companies have used LLMs that were trained on the copyright articles from NYT. This deprives NYT of the audience that could have reached it. The attempt substitutes the content without permission or payment. According to NYT, this is not fair usage since these models compete with NYT and closely mimic it by using the content to train them.

The lawsuit sites examples of articles reproduced from NYT word by word. It bypasses the subscription paywall, which is critical for its survival.

Even Bing search engine of Mircrosoft generates detailed summaries and excerpts from the articles. It is far beyond fair usage.

The NYT demands not only compensation and restriction but also the destruction of all tools and models that incorporate their work.

The issue that can be raised is — why the NYT did not block the access to its content. The answer is that ChatGPT went live in November 2022, and by that time it has already been trained on 175 billion parameters, and about 45 terabytes of data from various datasets. By the time this was realized, the model has already ingested the data.

The NYT points out that the information is public. However, it does not mean it is free to copy.

Apple, by contrast, is negotiating with publishers and offering them monetary compensation for licensing the content for training its AI tools. If this transaction happens, it will strengthen the NYT case.

The NYT case has been filed three months after Authors of Guild of America went to court against OpenAI.

Content created by someone is being used without either acknowledgement or payments. Both Authors Guild and NYT have accused the tech companies of freeriding.

Traditional AI used data for pattern recognition. These models were mostly predictive. Generative AI creates or generates content. It takes technology to another level. To do this, generative AI uses extraordinarily large dataset — ChatGPT-like apps use 45 terabytes. It is trained on content created by others. Its answers closely resemble the original content. It substitutes the original.

The issue is not stifling the innovation. It is the use of data without express permission and payment. The principle is pay for what you use.

This is an untested legal area.

The vernacular models being developed in India must respect copyrights. The government can frame laws to prevent freeriding. Original content creators must not be taken for a ride.

30th December 2023
Controlling Superhuman Intelligence

Mankind faces problems. Innovations too pose challenges. It is paradoxical. Advanced mathematics is built on imaginary numbers. Blackholes have validated many laws, and still, they remain inscrutable. Similarly, AI too pages certain challenges. The basic challenge is how a superintelligent system of future would be controlled.

OpenAI has laid down Preparedness Strategy Framework in December 2023. It intends to adopt a scientific approach to assess catastrophic risk of any advanced AI system. The document describes processes to track, evaluate, forecast and protect against such risks.

If AGI is realized, it will require oversight.

AI sector has developed a concept of super alignment to deal with AGI. It is a holistic approach. It goes beyond technical specifications. It wants to consider the societal impact and ethical issues.

So far alignment was restricted to alignment of AI systems to human values (during training phase). Super alignment refers to continuous alignment throughout the life cycle of AI systems — including deployment, adaptation and evolution.

OpenAI suggests that a less potent LLM should serve as a proxy for human oversight of the more potent superintelligent AI.

OpenAI forecasts superintelligence could be a reality in the next 10 years.

OpenAI looked at how GPT 2 developed five years ago could supervise GPT-4, which is the latest LLM.

AI could be an existential threat to mankind. It is a dooms day scenario. It distracts from short-term risks of the present-day AI systems. Such risks include misinformation, bias, copyright violations and expensive compute. Industry should not be fixated upon doomsday scenarios. All such talk is highly hypothetical. The issue is how to deal with technology that currently exists. Of course, future possibilities cannot be ignored. But as Andrew Ng puts it — ‘Is there any engineering discipline where much attention is on hypothetical problems, rather than actual problems?’

AI is transformative. It has the potential to do much good.

29th December 2023
Popular AI Tools

The year 2023 is a significant year for AI tools, though some of the tools have been released in 2022. Generative AI demonstrates its ability to perform certain tasks which were once thought to be impossible for computer algorithms to carry out — say engaging in conversations to creating mind-blowing images.

OpenAI has become a light bearer of AI. ChatGPT and Dall E 2 are being replicated by others. In April 22, the company released DALL E-2, its image generating tool. Later in November 2022, it released its chatbot ChatGPT powered by GPT-3 a large language model. Within five days of it release, it acquired one million users. It became the fastest app by garnering 100 million users by January 2023.

Let us acquaint ourselves with most popular AI tools.

ChatGPT: It stands for Chat Generative Pre-Trained Transformer. It is a chatbot which converses with us and generates natural language text in response to a user’s prompt.

It is a tool for content creation, is an alternative search engine, and is used for coding. There is an OpenAI API which businesses can use for various tasks.

Midjourney: It generates images from natural language descriptions called prompts. It has been created by a research lab called Midjourney Inc of the USA. The ‘Imagine’ command is used by the users to generate images based on their imagination. It is of great help to graphic and visual artists.

Bing: Bing is AI-powered search engine of Microsoft. It has been launched along with Edge browser in February 2023. The integration has enhanced its functionality. It uses key features of ChatGPT and GPT-3.5. Bing search engine rankings are also made more relevant.

Notion AI: It is a writing assistant. It is useful in editing, summarizing and brainstorming. It performs grammar check too. It helps marketers to write brand new pages extracting insights from databases and pages.

Runway Gen 2: It is a leading tool for video generation. It generates videos based on text prompts. It has incorporated other features — motion brush feature for animation in specifc parts of image, camera movement etc. It is useful for short films, music videos, animation ad design etc.

Co-pilot — GitHub: It is a tool from GitHub developed in collaboration with OpenAI. It is based on the OpenAI Codex model. There are code editors (Visual Studio Code) which are integrated. Developers get entire lines or blocks of code as they type. It speeds up development, streamlines repetitive tasks and provide assistance with syntax.

MusicLM: It is a Google product. Here the musical idea comes to life with AI. This algorithm creates two versions of song based on inputs.

ElevenLabs: It is a text-to-speech AI tool. It is handy for video creators, podcasters and businesses.

Framer: It is used for web site creation and publication.

28th December 2023
Chandamama Kathalu: Telugu SLM

Chandamama was a popular magazine that told stories to children. Swecha, a non-profit organization, in collaboration with Ozonetel decided to tell those stories by developing a small language model (SLM) in Telugu. The SLM will be launched in January, 2024.

Let us understand the concept of SLM. The genesis of SLM lies in a paper authored by Microsoft research scientists titled Tiny Stories. SLMs are built on the same methology as any LLM, but its neural network is smaller, it has fewer parameters, and it is trained on smaller corpus of data.

Ozonetel who collaborated with Swecha decided to develop a Telugu SLM. They were assisted by IIIT, Hyderabad. There was a dataset of Telugu stories — some 40000 pages of stories, preprocessed by some 8000 students. The idea was to give children access to the kind of stories that used to appear in Chandamama Kathalu magazine. This magazine was in print till 2012. Chandamama was available in all Indian homes through the 1940s till 2012. It published long-running mythological and magical Indian stories.

After building a dataset, it is being assessed whether the data is to be tokenized. Tokens, as we know, are the basic units of text or code that a language model uses to process and generate language. Tokens could be words or parts of words or characters or other segments of text or code.

Microsoft soon after the release of their paper Tiny Stories developed an SLM using 21 million stories. This SLM was capable of generating coherent text. That gave Swecha a lot of hope.

Swecha also worked upon optical character recognition (OCR). It is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine encoded text. They used an open source OCR tool and converted 70 per cent of text. The remaining 30 per cent were typed out by students. They had a corpus of 45000 stories. Big Stories were added too. They generated half a million lines of text.

The corpus was uploaded on Hugging Face, so that other companies can use this dataset. They wanted to open up this dataset.

They are now researching if an LLM is to be built, what kind of tokenization is needed. They are interacting with IIIT. They will take four or five months before they have their LLM.

27th December 2023