Microsoft-backed OpenAI has released on February 15, 2024, a generative AI model that can convert a prompt to a minute-long video. The model is called Sora. It is currently available for red teaming (so as to identify its flaws).
Sora is capable of creating complex scenes with multiple characters. There are accurate details of the subject and background. The software understands how objects exist in the physical world. It can interpret props. The characters created express vibrant emotions.
OpenAI in its blog as well as on X (formerly Twitter) illustrates how it works. The prompt is ‘Beautiful snowy Tokyo city is bustling. The camera moves through a bustling street. It follows several people enjoying snowy weather. They are shopping at nearby stalls. Sakura petals are flying through the wind along the snowflakes.’
The model has a deep understanding of the language, and interprets the prompts. It creates characters expressing emotions. It generates a single video with multiple shots. The characters and visual style persist.
OpenAI has, however, cautioned that model is far from being perfect and may struggle with complex prompts. The company is testing it with feedback of visual artists, designers and filmmakers so as to advance the model. The current model has weaknesses — say the physics of a complex scene or instances of cause and effect. A person might take a bite out of a cookie, but later the cookie may not have a bite mark.
The model may confuse spatial details of a prompt, may mix left and right. It may struggle with events taking place over a period of time, say following a camera trajectory.
Some safety steps may be necessary. There are classifiers to review frames of every video generated to ensure that it complies with the usage policy. The system should not generate misinformation and hateful content.
Generative AI has made text-to-video generation significantly better over the past few years. This is an area that lagged behind. It has its unique set of challenges.
Apart from OpenAI, other companies too have ventured into this field. Google’s Lumiere can create five-second videos on a given prompt. Runway and Pika too have good text-to-video models.
The video generation software follows OpenAI’s ChatGPT which was released in late 2022 and created a buzz around generative AI with content generation capability.
Facebook strengthened its image generation model Emu in 2023 to add AI-based features that can edit and generate videos from text prompts. Facebook too is trying to compete with Google, OpenAI and Amazon in the rapidly transforming generative AI landscape.