Blog

Sora: Text-to-video

Microsoft-backed OpenAI has released on February 15, 2024, a generative AI model that can convert a prompt to a minute-long video. The model is called Sora. It is currently available for red teaming (so as to identify its flaws).

Sora is capable of creating complex scenes with multiple characters. There are accurate details of the subject and background. The software understands how objects exist in the physical world. It can interpret props. The characters created express vibrant emotions.

OpenAI in its blog as well as on X (formerly Twitter) illustrates how it works. The prompt is ‘Beautiful snowy Tokyo city is bustling. The camera moves through a bustling street. It follows several people enjoying snowy weather. They are shopping at nearby stalls. Sakura petals are flying through the wind along the snowflakes.’

The model has a deep understanding of the language, and interprets the prompts. It creates characters expressing emotions. It generates a single video with multiple shots. The characters and visual style persist.

OpenAI has, however, cautioned that model is far from being perfect and may struggle with complex prompts. The company is testing it with feedback of visual artists, designers and filmmakers so as to advance the model. The current model has weaknesses — say the physics of a complex scene or instances of cause and effect. A person might take a bite out of a cookie, but later the cookie may not have a bite mark.

The model may confuse spatial details of a prompt, may mix left and right. It may struggle with events taking place over a period of time, say following a camera trajectory.

Some safety steps may be necessary. There are classifiers to review frames of every video generated to ensure that it complies with the usage policy. The system should not generate misinformation and hateful content.

Generative AI has made text-to-video generation significantly better over the past few years. This is an area that lagged behind. It has its unique set of challenges.

Apart from OpenAI, other companies too have ventured into this field. Google’s Lumiere can create five-second videos on a given prompt. Runway and Pika too have good text-to-video models.

The video generation software follows OpenAI’s ChatGPT which was released in late 2022 and created a buzz around generative AI with content generation capability.

Facebook strengthened its image generation model Emu in 2023 to add AI-based features that can edit and generate videos from text prompts. Facebook too is trying to compete with Google, OpenAI and Amazon in the rapidly transforming generative AI landscape.

17th February 2024
Text Embeddings

While we human beings think in words or text, computers think in numbers, or vectors of numbers. Machine-understandable text was formulated early as ASCII. This approach lacked the meaning of words. In search engines, there was key word search. Here in the documents, some specific words were searched — N-grams.

Later embeddings emerged. Embeddings can represent words, sentences, or even images. Embeddings are the vectors of numbers. They do capture meaning. Thus, they can be used in semantic search. They can work with documents in different languages.

Text representation is done through embeddings. The most fundamental approach is to convert texts into vectors is through a bag of words. The texts is converted into words or tokens, and later into their base forms (say running is converted to run). There is a list of base forms for all the words. Later we calculate their frequencies to create a vector.

The vector conversion takes into account not only the words that appear in text, but the whole vocabulary. There are words such as ‘I’, ‘You’ and ‘Study’. ‘The girl is studying physics‘ and ‘the woman is studying AI’ are not close to each other. Bag of words is improved by TF-IDF (Term Frequency-Inverse Document Frequency). It is a multiplication of two metrics.

IF-IDF (t,d,D) =TF (t,d) x IDF (t,D)

Term Frequency shows the frequency of the word in the documents.

TF(t,d) = number of time t appears in document d /number of terms in document d

Inverse Document Frequency denotes how much information the word provides — articles and certain pronouns do not give any additional information about the topic. Words such as AI or LLM define the tenor. It is calculated as the logarithm of the ratio of total number of documents to those containing the word.

IDF(t,D)= log total number of documents in corpus D / number of documents containing term t

The closer the IDF is to 0 — the more common the word is and less information it provides.

Ther are vectors where there are common words with less weights. The rare words in documents carry higher weights. This improves the results. Still, this cannot capture the semantic meaning.

This approach produces sparse vectors. The length of the vector is equal its corpus size. There are 470k unique words in English. They will show huge vectors. A sentence will not have more than 50 unique words. Therefore 99.99 per cent values in vectors will be 0 ( It does not encode any info).

To overcome these limitations, researchers started looking at dense vector representation.

Word2Vec

Google’s paper (2013) Efficient Estimation of Word Representation in Vector Space by Mikolov et al is one of the most well-known approaches to dense representation.

In this paper, there are two different approaches — Continuous Bag of Words where word is predicted based on surrounding words and Skip-gram where the word is predicted on the basis of the context on the word.

Mikolov model trains two models — encoder and decoder. When in skip-gram model, we pass on Diwali to the encoder, the encoder will produce a vector ‘happy’ ‘to’ and ‘you’.

Input — Encoder– Embedding — Decoder– Output

It has considered the meaning since it is trained on the context of words. However, it ignores morphology, information from word parts such as less or lack of something. This limitation was addressed in GloVe.

Though Word2Vec is suited to work with words, we would like to encode whole sentences. We shall examine now the transformers.

Transformers and Sentence Embeddings

‘Attention Is All You Need‘ (2017) paper by Vaswani’ et al led to transformers. There emerged information-rich dense vectors. They became the principal technology for modern Large Language models (LLMs).

Transformer’s core model is used and it is fine tuned for specific purposes. Transformers are pre-trained.

BERT or Bi-directional Encoder Representation from Transformers from Google AI is one such early model. To begin with, it operated on token level. Just like Word2Vec. It took an average of all tokens. It was not efficient performer.

In 2019, Sentence-BERT was released. It was good at semantic textual similarity and enabled calculation of sentence embeddings.

Open AI’s models are embedding-3-small and embedding-3-large. They are the best performing embedding models.

Distance Between Vectors

Embeddings being vectors, if we want to understand how close two sentences are to each other, we can calculate the distance between vectors. A smaller distance indicates closer semantic meaning.

The metrics used to measure distance are Euclidean (L2), Manhattan (L1), Dot product and Cosine distance. For NLP tasks, the best practice is to use cosine similarity.

As OpenAI embeddings are already normed, dot product and cosine similarity are here equal.

Cosine is less affected by the curse of dimensionality. The higher the dimension, the narrower the distribution the distances between vectors.

The most basic dimensionality reduction technique is PCA or Principal Component Analysis. Since PCA is a linear algorithm, t-SNE is used to separate clusters on account of non-linearity.

Vectors are used for clustering, classification, finding anomalies and RAG.

16th February 2024
AI Dangers

AI is advancing faster than the world expects. Though one cannot go overboard and imagine killer robots roaming the streets, one can have concerns about AI and society’s misalignments. Sam Altman too is concerned about such misalignment, which can be intentional or unintentional. It is the right time for a debate. It is time to consider an international agency on the lines of International Atomic Energy.

Sam Altman has been the public face for generative AI’s rapid commercialization. He feels that the current technology we have is nascent — it is like that primitive feature phone with a black-and-white screen. In years to come, the technology will become better than it is at present. In a decade hence, it will be pretty remarkable.

16th February 2024
Chip Factories

Sam Altman toys with the idea of setting up chips factories to power AI. He has already met several officials from the UAE to pitch for his plan to build chips factories. The capital outlay expected to realize this dream touches $5 trillion to $7 trillion. As we know, the USA economy is $23 trillion, and hence $7 trillion is a lot of money. It is way more than the US spent on building its highway network.

Altman is hoping to partner with investors and other chip makers. There would be foundries that could be utilized by the existing chip makers. OpenAI and other companies could be the customers for these chip makers. The capital raised would be debt plus equity. The talks are in the preliminary stages. There is no clear idea about the potential investors.

President Biden recently signed the Chips Act which earmarks $52 billion in subsidies to build factories in the USA.

Altima’s plan is more ambitious. There are just a few things that cost in trillions.

15th February 2024
Direct-to-mobile Technology

In direct-to-mobile technology, the television content is streamed directly to mobile phones without an internet connection. The challenge is to integrate smart phones that support D2M and digital terrestrial TV to mobile devices (DTT2M).

There should be nationwide network for indoor coverage for quality services.

The implementation means higher costs for smartphone makers. The sets will be costlier.

There should be a standard compatible with existing mobile handsets. Smart phones can be designed and manufactured to be ready for receiving broadcast signals directly.

This technology seeks convergence of broadband and broadcasting services. It enables users to receive terrestrial digital TV directly on their handsets. It is similar to the concept of FM on the handsets.

The service can be used to spread citizen-centric information, say emergency alerts, public safety messages, social services etc.

The technology is in its early stages.

14th February 2024
Online Gaming

Once, there was a distinction between online gaming and gambling platforms. Of late, both these platforms have converged. A substantial portion of India’s online gaming involves gambling. There is a campaign to call these as a game of skill whereas gambling is a game of chance. These days every game is being played for stakes. Self-regulation has been proposed but there has not been any progress here. The crux of the matter is what is going to be regulated and how it will be regulated.

There could be regulation of gambling, sporting, or entertainment companies. Or will it be just a regulation of internet intermediaries? Internet companies tend to avoid regulation, e.g. e-commerce, social media, communication apps, super apps. First step of regulation could be licensing these companies.

There are unregulated pool or purse collections, and GST has been levied on them.

China has been strict with such companies.

Medical profession so far acknowledged only substance abuse addiction. It has now acknowledged non-substance behavioural addiction including online gaming and gambling. South Korea has identified online gaming as the largest health problem.

13th February 2024
AI in Coding

Many have been laid off in technical companies. At the same time, tech businesses have made significant investments in AI. The model that is emerging is of shedding headcount and hiring talent with AI skills. It is estimated that in the next five years, 30 per cent of entry-level coding jobs would be replaced by AI.

AI and ML has changed coding which has become speedier. Software development uses AI in development, testing and maintenance. Programmers should use AI tools to their advantage. A programmer can complement his own skills and can overcome his weaknesses by leveraging AI. AI can enhance the capabilities of programmers.

AI generated content is not always accurate, and hence manual monitoring to check the codes would be essential. There should be sufficient safeguards before proprietary codes are shared with generative AI tools for decoding.

Human ingenuity is required in the system design and software architecture. It is necessary to know where to use AI and where human intervention.

12th February 2024
AI Hopes

Investor expectation from AI has reached sky high. These expectations are hard to meet. Microsoft, Google and AMD are working hard to infuse AI into their products. Still their stocks decline in value.

AMD’s new processors will capture market. Microsoft declares that users are taking to its AI assistants. Google expects to enhance its searches and cloud computing services.

Investors took the shares to dizzy heights since they expected AI to be magical. Results do not satisfy them.

Microsoft, in fact, is doing very well. AI products help drive the adoption of its data center services. Revenue from Azure cloud has shown a 30 per cent jump.

Wall Street wants clarity about what AI will contribute to financial performance. Investors want companies to quantify the AI potential over the next couple of years.

11th February 2024
AI: Neural Reasoning Engine

While dealing with computers for the last 70 years, the dream, according to Satya Nadella, has always been — making computers that we can understand versus making computers that can understand us. He was addressing a crowd of developers in Bangalore on Thursday, February 8, 2024. The theme at this event was democratization of artificial intelligence (AI). AI thus becomes accessible and transformative power for the society.

Developers are inclined to digitize people, places and things. AI changes this. There are data models. We can query them. They act as neural reasoning engines. These can find patterns and that gives us predictive power.

There is a platform shift. That is the key to drive the economic growth. It can contribute 10 per cent to India’s GDP. That empowers people at large.

India is only second to the US in terms of total number of developers on GitHub, and by 2027, India may cross this to emerge as the hub of greatest number of developers. India puts generative AI projects on GitHub next to the US. India has achieved creditable momentum here.

There are AI transformation opportunities in India. AI can facilitate employee experience, customer engagement, business processes, and can bend the curve on innovation.

Microsoft has 60 Azure regions, 800 data centers. There can be breakthroughs through AI in science. Microsoft has global AI footprint. It believes in equitable AI.

10th February 2024
AI and India

Microsoft’s CEO Satya Nadella in a talk with the Indian CEOs in February 2024 said that it is for the first time that India’s fast growth in AI is bridging the gap between India and the rest of the world.

Since Nadella joined the industry, he has seen four big shifts in technology — PCs, client servers, web, mobile and cloud. AI is the first revolution where India is keeping pace with the world. There is no gap. India is not just talking about AI, but it is scaling AI.

He calls this ADVANA(I)GE. INDIA. He calls the AI engineers in India second only to the US. A technology is real if it can have real impact on the growth of an economy. In a growing economy that India is, a significant percentage of growth is going to be driven by AI. According to Nadella, AI’s contribution by 2025 would be worth $500 billion to GDP when India’s GDP is expected to reach $5 trillion.

It is necessary for India and the US to cooperate on the norms and regulations, instead of fracturing them.

If AI diffuses faster, it could be an ‘equal distributor of growth.’

Microsoft intends to train 5 lac students and job seekers in AI skills. It will provide AI technical skills to 1 lac young women. Microsoft will help equip 2 million Indians with skills in AI by 2025.

9th February 2024