Sora is based on diffusion model structure. According to LeCun, Facebook’s latest V-JEPA (Video Join Embedding Predictive Architecture) is a model that analyzes interactions between objects in videos. It is not generative and makes predictions in representation space. LeCun wants to impress upon us that their self-supervised model is superior to Sora’s diffusion transformer model.
A model must go beyond LLMs or DMs. Elon Musk also feels that Tesla’s video-generation capabilities are superior to OpenAI’s Sora with respect to predicting accurate physics.
Sora uses transformer architecture similar to GPT models. The foundation will understand and simulate the real world. It may have used Unreal Engine 5’s generated data to train the model. Jim Fan points out Sora’s learning in neural parameters is through gradient descent using massive amounts of videos.
Sora, according to Fan, may not learn physics, but manipulates pixels in 2D. It is a reductionist view. It is like saying GPT has not learnt coding but learns sampling of strings.
Transformers just do a manipulation of sequence of integers (token IDs). What neural networks do is just manipulation of floating numbers. Fan does not agree with such reductionist view.
Sora may not be able to simulate the physics of a complex scene. It may not grasp cause and effect. It can get confused with spatial details of a prompt.
Fan describes heavy prompting for Sora as babysitting.
Of course, there are limitations, but these do not dim the outstanding video quality from Sora. Sora has the potential to disrupt the video game industry.