Despite Limitations, Sora Is a Game-Changer

Sora is based on diffusion model structure. According to LeCun, Facebook’s latest V-JEPA (Video Join Embedding Predictive Architecture) is a model that analyzes interactions between objects in videos. It is not generative and makes predictions in representation space. LeCun wants to impress upon us that their self-supervised model is superior to Sora’s diffusion transformer model.

A model must go beyond LLMs or DMs. Elon Musk also feels that Tesla’s video-generation capabilities are superior to OpenAI’s Sora with respect to predicting accurate physics.

Sora uses transformer architecture similar to GPT models. The foundation will understand and simulate the real world. It may have used Unreal Engine 5’s generated data to train the model. Jim Fan points out Sora’s learning in neural parameters is through gradient descent using massive amounts of videos.

Sora, according to Fan, may not learn physics, but manipulates pixels in 2D. It is a reductionist view. It is like saying GPT has not learnt coding but learns sampling of strings.

Transformers just do a manipulation of sequence of integers (token IDs). What neural networks do is just manipulation of floating numbers. Fan does not agree with such reductionist view.

Sora may not be able to simulate the physics of a complex scene. It may not grasp cause and effect. It can get confused with spatial details of a prompt.

Fan describes heavy prompting for Sora as babysitting.

Of course, there are limitations, but these do not dim the outstanding video quality from Sora. Sora has the potential to disrupt the video game industry.

CUDA Libraries in GPUs

CUDA libraries facilitate the harnessing of GPUs for various computing tasks. They provide optimized implementation of common algorithms and functions. It enables developers to write high-performance apps to leverage the parallel processing capabilities of GPUs.

The various CUDA libraries are:

cuBLAS provides optimized routines for basic linear algebra operations (matrix multiplication, vector addition and so on)

cuFFT accelerates Fast Fourier Transforms (FFTs). It is crucial for signal processing and image analysis.

cuSPARSE handles sparse matrix computations. It is useful for scientific simulations and ML.

cuDNN is designed for deep learning. It offers high-performance implementations of essential neural network primitives.

In addition to these libraries, there is CUDA-X suite. It empowers developers to create apps that run faster. It unlocks the potential in various fields such as AI, graphics and scientific computing.

Linear algebra libraries are cuBLAS, cuSOLVER and cuSPARSE. Deep learning libraries are cuDNN. It accelerates TensorFlow and PyTorch. Data science libraries are cuFFT and cuRAND for data analysis and ML. Computer vision libraries are cuFFT and cuBLAS. These accelerate image and video processing. Other domains covered are nVJPEG, NCCL and NPP.

These libraries bridge the gap between hardware capabilities of GPUs and the software applications that leverage them. There are significant performance gains.

OpenAI’s Sora: How Open?

On February 15, OpenAI announced the red teaming of its text-to-video platform called Sora. It can create up to a minute-long video of high quality. It has caused concern to stock video producers, startup founders, actors and filmmakers.

OpenAI has not revealed about the data used to train Sora. When Facebook released its text-to-video model in 2022, it used 10.7 million Shutterstock videos and 3.3 million YouTube videos to train it. This information enables researchers to check for bias, and creators to know if their work is being exploited.

It is speculated by some gaming and AI experts that Sora could have been trained on the underlying physics engines of computer games. It is not sure since OpenAI will not disclose the information, as it did with its other AI models.

Since GPT-4 was tested for about six months before its release, Sora could also take the same time. It could be released in August 2024, just 3 months prior to the elections in the US.

Deepfakes of politicians generated by AI could affect the elections. OpenAI uses safety filters to keep their models away from violence, sexual content and hateful imagery. It is still impossible to know whether these AI systems will not be misused until they are in the market. Sora is likely to make a bigger impact judging from the use of ChatGPT by millions of people. It will put video generation capabilities into the hands of millions.

It is obvious that the secrecy OpenAI maintains about its new products is to keep ahead of the competitors. OpenAI is also enhancing its computing power to train its models — this strategy seems to have worked. This is the reason why Sam Altman is seeking trillions of dollars for a chip making unit.

OpenAI’s stated goal is to attain AI that surpasses our own capabilities. It puts products for the public to try out the transformative tech to reach that goal. That is the open part of OpenAI, while keeping the tactics part closed.

AI Chip Venture

Softbank Group founder Masayoshi Son is interested to set up a chip venture that can compete with Nvidia. The unit will make chips essential for AI.

The capital investment of Softbank will be $30 billion, and another $70 billion will be possibly raised from institutions in the middle east. The total project cost would be $100 billion, one of the largest investments in AI arena since the advent of ChatGPT, dwarfing Microsoft’s contribution of $10 billion in OpenAI.

The project has been code named Izanagi, the Japanese God of creation and life, partly because the name includes the initials AGI which stands for artificial general intelligence.

It is not clear which company or companies will play a role in developing technology that can challenge Nvidia the leader in high-end AI accelerators. There could be collaboration between Softbank and Arm Holdings, a chip design unit. Arm’s CEO Rene Haas is a member of the Board of Directors of SoftBank. He is also a technology expert. He has been advising Son on the project. They would like to focus on compute, power efficiency and energy to develop AGI.

There is a history of Son changing his mind abruptly, and he keeps on throwing many ideas and technologies while meeting people. Son is, however, unwavering in his enthusiasm for AGI. He is convinced that AGI will be real in 10 years.

AI Challenge for India

India too wants to make its place in this age of generative AI. However, there are two formidable challenges — there is lack of hardware accelerators suited for AI requirements and a shortage of talent.

LLM training is very capital intensive. Here shortage of talent is a big issue. There are only a few people in the world who really know how to train LLMs. There are issues of curating data and carefully running evaluation metrics. It must be ensured that the models are generalizable. Very few people in the world know how to do all this. Most of them are US-based and are working in a handful of companies — OpenAI, Facebook, Anthropic, DeepMind and Mistral. The knowledge of training a model that has GPT-4 capability is concentrated both individual-wise and geography-wise.

Computing capacity (compute) is another challenge in building a large AI system. Then there are issues of algorithmic innovations and datasets.

AI accelerators are specialized automatic data processing systems. They accelerate computer science applications, especially artificial neural networks, machine visualization and ML.

India has to set up hardware accelerators, and then train them. This is a difficult task. India alternatively can think in terms of ‘inference hardware’. Inference is the process of running live data through a trained AI model to process the data. AI hardware is coupled with software. Nvidia’s GPUs are coupled with the CUDA libraries, needed to make the good use of hardware. It is a big advantage for Nvidia.

India can use open-source models — Llama from Facebook. India can take these base models and try to build on top of them. It can bootstrap off them.

This is the summarization of thinking of Arvind Srinivas, CEO of Perplexity AI who is making waves in Silicon Valley.

Amazon’s AI Model with Emergent Abilities

Amazon’s new AI model is showing linguistic abilities for which it has not been trained. It is showing the type of naturalness that matches human-level AI or artificial general intelligence (AGI). The paper has not yet been peer reviewed.

The model meets all the criteria set up by an expert linguist. It takes the leaps which a human learner naturally takes but is difficult for a model.

This model is called Big Adaptive Streamable TTS with Emergent Abilities or BASE TTS. The initial model was trained on 1 lac hours of speech data of which 90 per cent was in English. Amazon AGI team trained two smaller models, one on 1 thousand hours of speech and another on 10 thousand hours of speech. They wanted to find out which of the three models showed the type of naturalness they were looking for. In fact, they were looking for emergent abilities or abilities they were not trained on.

The 10 thousand hours of speech model scored the highest on the emergent abilities criteria list. It included abilities to understand punctuation, non-English words and emotions. It blurts out words which are natural for human readers, say shh — a non-word. It also used internet jargon, say ASAP or as-soon-as possible. The model was never told to come up with such surprising outputs. It produces emotional or whispered speeches and pronounces correctly foreign words. It has not been trained for all this, and it may not strictly constitute AGI, but it is on the path to realize this goal especially when it could do this without huge training data to get there.

The model’s evaluation and testing should continue to know its true capabilities and generalizability. It is too early to infer anthropomorphization. The models output are based on statistical patterns, not on genuine understanding or sentience.

Sora: Text-to-video

Microsoft-backed OpenAI has released on February 15, 2024, a generative AI model that can convert a prompt to a minute-long video. The model is called Sora. It is currently available for red teaming (so as to identify its flaws).

Sora is capable of creating complex scenes with multiple characters. There are accurate details of the subject and background. The software understands how objects exist in the physical world. It can interpret props. The characters created express vibrant emotions.

OpenAI in its blog as well as on X (formerly Twitter) illustrates how it works. The prompt is ‘Beautiful snowy Tokyo city is bustling. The camera moves through a bustling street. It follows several people enjoying snowy weather. They are shopping at nearby stalls. Sakura petals are flying through the wind along the snowflakes.’

The model has a deep understanding of the language, and interprets the prompts. It creates characters expressing emotions. It generates a single video with multiple shots. The characters and visual style persist.

OpenAI has, however, cautioned that model is far from being perfect and may struggle with complex prompts. The company is testing it with feedback of visual artists, designers and filmmakers so as to advance the model. The current model has weaknesses — say the physics of a complex scene or instances of cause and effect. A person might take a bite out of a cookie, but later the cookie may not have a bite mark.

The model may confuse spatial details of a prompt, may mix left and right. It may struggle with events taking place over a period of time, say following a camera trajectory.

Some safety steps may be necessary. There are classifiers to review frames of every video generated to ensure that it complies with the usage policy. The system should not generate misinformation and hateful content.

Generative AI has made text-to-video generation significantly better over the past few years. This is an area that lagged behind. It has its unique set of challenges.

Apart from OpenAI, other companies too have ventured into this field. Google’s Lumiere can create five-second videos on a given prompt. Runway and Pika too have good text-to-video models.

The video generation software follows OpenAI’s ChatGPT which was released in late 2022 and created a buzz around generative AI with content generation capability.

Facebook strengthened its image generation model Emu in 2023 to add AI-based features that can edit and generate videos from text prompts. Facebook too is trying to compete with Google, OpenAI and Amazon in the rapidly transforming generative AI landscape.

Text Embeddings

While we human beings think in words or text, computers think in numbers, or vectors of numbers. Machine-understandable text was formulated early as ASCII. This approach lacked the meaning of words. In search engines, there was key word search. Here in the documents, some specific words were searched — N-grams.

Later embeddings emerged. Embeddings can represent words, sentences, or even images. Embeddings are the vectors of numbers. They do capture meaning. Thus, they can be used in semantic search. They can work with documents in different languages.

Text representation is done through embeddings. The most fundamental approach is to convert texts into vectors is through a bag of words. The texts is converted into words or tokens, and later into their base forms (say running is converted to run). There is a list of base forms for all the words. Later we calculate their frequencies to create a vector.

The vector conversion takes into account not only the words that appear in text, but the whole vocabulary. There are words such as ‘I’, ‘You’ and ‘Study’. ‘The girl is studying physics‘ and ‘the woman is studying AI’ are not close to each other. Bag of words is improved by TF-IDF (Term Frequency-Inverse Document Frequency). It is a multiplication of two metrics.

IF-IDF (t,d,D) =TF (t,d) x IDF (t,D)

Term Frequency shows the frequency of the word in the documents.

TF(t,d) = number of time t appears in document d /number of terms in document d

Inverse Document Frequency denotes how much information the word provides — articles and certain pronouns do not give any additional information about the topic. Words such as AI or LLM define the tenor. It is calculated as the logarithm of the ratio of total number of documents to those containing the word.

IDF(t,D)= log total number of documents in corpus D / number of documents containing term t

The closer the IDF is to 0 — the more common the word is and less information it provides.

Ther are vectors where there are common words with less weights. The rare words in documents carry higher weights. This improves the results. Still, this cannot capture the semantic meaning.

This approach produces sparse vectors. The length of the vector is equal its corpus size. There are 470k unique words in English. They will show huge vectors. A sentence will not have more than 50 unique words. Therefore 99.99 per cent values in vectors will be 0 ( It does not encode any info).

To overcome these limitations, researchers started looking at dense vector representation.

Word2Vec

Google’s paper (2013) Efficient Estimation of Word Representation in Vector Space by Mikolov et al is one of the most well-known approaches to dense representation.

In this paper, there are two different approaches — Continuous Bag of Words where word is predicted based on surrounding words and Skip-gram where the word is predicted on the basis of the context on the word.

Mikolov model trains two models — encoder and decoder. When in skip-gram model, we pass on Diwali to the encoder, the encoder will produce a vector ‘happy’ ‘to’ and ‘you’.

Input — Encoder– Embedding — Decoder– Output

It has considered the meaning since it is trained on the context of words. However, it ignores morphology, information from word parts such as less or lack of something. This limitation was addressed in GloVe.

Though Word2Vec is suited to work with words, we would like to encode whole sentences. We shall examine now the transformers.

Transformers and Sentence Embeddings

‘Attention Is All You Need‘ (2017) paper by Vaswani’ et al led to transformers. There emerged information-rich dense vectors. They became the principal technology for modern Large Language models (LLMs).

Transformer’s core model is used and it is fine tuned for specific purposes. Transformers are pre-trained.

BERT or Bi-directional Encoder Representation from Transformers from Google AI is one such early model. To begin with, it operated on token level. Just like Word2Vec. It took an average of all tokens. It was not efficient performer.

In 2019, Sentence-BERT was released. It was good at semantic textual similarity and enabled calculation of sentence embeddings.

Open AI’s models are embedding-3-small and embedding-3-large. They are the best performing embedding models.

Distance Between Vectors

Embeddings being vectors, if we want to understand how close two sentences are to each other, we can calculate the distance between vectors. A smaller distance indicates closer semantic meaning.

The metrics used to measure distance are Euclidean (L2), Manhattan (L1), Dot product and Cosine distance. For NLP tasks, the best practice is to use cosine similarity.

As OpenAI embeddings are already normed, dot product and cosine similarity are here equal.

Cosine is less affected by the curse of dimensionality. The higher the dimension, the narrower the distribution the distances between vectors.

The most basic dimensionality reduction technique is PCA or Principal Component Analysis. Since PCA is a linear algorithm, t-SNE is used to separate clusters on account of non-linearity.

Vectors are used for clustering, classification, finding anomalies and RAG.

AI Dangers

AI is advancing faster than the world expects. Though one cannot go overboard and imagine killer robots roaming the streets, one can have concerns about AI and society’s misalignments. Sam Altman too is concerned about such misalignment, which can be intentional or unintentional. It is the right time for a debate. It is time to consider an international agency on the lines of International Atomic Energy.

Sam Altman has been the public face for generative AI’s rapid commercialization. He feels that the current technology we have is nascent — it is like that primitive feature phone with a black-and-white screen. In years to come, the technology will become better than it is at present. In a decade hence, it will be pretty remarkable.

Chip Factories

Sam Altman toys with the idea of setting up chips factories to power AI. He has already met several officials from the UAE to pitch for his plan to build chips factories. The capital outlay expected to realize this dream touches $5 trillion to $7 trillion. As we know, the USA economy is $23 trillion, and hence $7 trillion is a lot of money. It is way more than the US spent on building its highway network.

Altman is hoping to partner with investors and other chip makers. There would be foundries that could be utilized by the existing chip makers. OpenAI and other companies could be the customers for these chip makers. The capital raised would be debt plus equity. The talks are in the preliminary stages. There is no clear idea about the potential investors.

President Biden recently signed the Chips Act which earmarks $52 billion in subsidies to build factories in the USA.

Altima’s plan is more ambitious. There are just a few things that cost in trillions.