After DALL-E 3 and GPT-4, OpenAI has now introduced Sora as it steps into the realm of video generation with artificial intelligence. Let’s take a look at what we know about the platform so far and what it has to offer.
What is Sora?
It is a new generative AI Text-to-Video model that can create minute-long videos from a textual prompt. It can convert the text in a prompt into complex and detailed visual scenes, owing to its understanding of the text and the physical existence of objects in a video. Moreover, the model can express emotions in its visual characters.
Source: OpenAI
The above video was generated by using the following textual prompt on Sora:
Several giant wooly mammoths approach, treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds; and a sun high in the distance creates a warm glow, The low camera view is stunning, capturing the large furry mammal with beautiful photography, depth of field.
While it is a text-to-video generative model, OpenAI highlights that Sora can work with a diverse range of prompts, including existing images and videos. It enables the model to perform varying image and video editing tasks. It can create perfect looping videos, extend videos forward or backward, and animate static images.
Moreover, the model can also support image generation and interpolation between different videos. The interpolation results in smooth transitions between different scenes.
Explore AI tools for art generation in our detailed guide here
How to Use Sora AI
Getting started with Sora AI is easy and intuitive, even if you’re new to generative models. This powerful tool allows you to transform your ideas into captivating videos with just a few simple steps. Whether you’re looking to create a video from scratch using text, enhance existing visuals, or experiment with creative animations, Sora AI has you covered. Here’s how you can begin:
- Access the Platform: Start by logging into the Sora AI platform from your device. If you’re a first-time user, you’ll need to sign up for an account, which only takes a few minutes.
- Choose Your Prompt Type: Decide what kind of input you want to use—text, an image, or an existing video. Sora is flexible, allowing you to explore various creative avenues depending on your project needs.
- Enter Your Prompt: For text-to-video generation, type in a detailed description of the scene you want to create. The more specific your prompt, the better the output. If you’re working with images or videos, simply upload your file.
- Customize Settings: Tailor your project by adjusting video length, adding looping effects, or extending clips. Sora’s user-friendly interface makes it easy to fine-tune these settings to suit your vision.
- Generate and Review: Once your input is ready, hit the generate button. It will process your prompt and create the video. Review the output and make any necessary tweaks by refining your prompt or adjusting settings.
- Download and Share: When you’re happy with the result, download the video or share it directly from the platform. Sora makes it simple to distribute your creation for various purposes, from social media to professional projects.
Another interesting read: AI Video Faceoff: Sora vs. Movie Gen
By following these steps, you’ll quickly master this new AI model and bring your creative ideas to life with stunning, dynamic videos.
What is the Current State of Sora?
Currently, OpenAI has only provided limited availability of Sora, primarily to graphic designers, filmmakers, and visual artists. The goal is to have people outside of the organization use the model and provide feedback. The human-interaction feedback will be crucial in improving the model’s overall performance.
Moreover, OpenAI has also highlighted that Sora has some weaknesses in its present model. It makes errors in comprehending and simulating the physics of complex scenes. Moreover, it produces confusing results regarding spatial details and has trouble understanding instances of cause and effect in videos.
Now, that we have an introduction to OpenAI’s new Text-to-Video model, let’s dig deeper into it.
Learn how to prompt AI video generators effectively in our guide here
OpenAI’s Methodology to Train Generative Models of Videos
As explained in a research article by OpenAI, the generative models of videos are inspired by large language models (LLMs). The inspiration comes from the capability of LLMs to unite diverse modes of textual data, like codes, math, and multiple languages.
While LLMs use tokens to generate results, Sora uses visual patches. These patches are representations used to train generative models on varying videos and images. They are scalable and effective in the model-training process.
Compression of Visual Data to Create Patches
We need to understand how visual patches are created that Sora relies on to create complex and high-quality videos. OpenAI uses an AI-trained network to reduce the dimensionality of visual data. It is a process where a video input is initially compressed into a lower-dimensional latent space.
It results in a latent representation that is compressed both temporally and spatially, called patches. Sora operates within the same temporal space to generate videos. OpenAI simultaneously trains a decoder model to map the generated latent representations back to pixel space.
Generation of Spacetime Latent Patches
When the Text-to-Video model is presented with a compressed video input, the AI model extracts from it a series of spacetime patches. These patches act as transformer tokens that are used to create a patch-based representation. It enables the model to train on videos and images of different resolutions, durations, and aspect ratios. It also enables control over the size of generated videos by arranging patches in a specific grid size.
What is Sora, Architecturally?
It is a diffusion transformer that takes in noisy patches from the visual inputs and predicts the cleaner original patches. Like a typical diffusion transformer that produces effective results for various domains, it also ensures effective scaling of videos. The sample quality improves with an increase in training computation.
Below is an example from OpenAI’s research article that explains the reliance of quality outputs on training compute.
Source: OpenAI
This is the output produced with base compute. As you can see, the video results are not coherent and highly defined.
Let’s take a look at the same video with a higher compute.
Source: OpenAI
The same video with 4x compute produces a highly-improved result where the video characters can hold their shape and their movements are not as fuzzy. Moreover, you can also see that the video includes greater detail.
What happens when the computation times are increased even further?
Source: OpenAI
The results above were produced with 16x compute. As you can see, the video is in higher definition, where the background and characters include more details. Moreover, the movement of characters is more defined as well.
It shows that Sora’s operation as a diffusion transformer ensures higher quality results with increased training compute.
The Future Holds…
Sora is a step ahead in video generation models. While the model currently exhibits some inconsistencies, the demonstrated capabilities promise further development of video generation models. OpenAI talks about a promising future of the simulation of physical and digital worlds. Now, we must wait and see how Sora develops in the coming days of generative AI.