


The world's first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights
The world's first open source Sora-like architecture video generation model is here!
The entire training process, including data processing, all training details and model weights, is all open.
This is the just released Open-Sora 1.0.
The actual effect it brings is as follows, it can generate the bustling traffic in the night scene of the bustling city.
You can also use an aerial photography perspective to show the scene of the cliff coast and the sea water lapping against the rocks.
Or the vast starry sky under time-lapse photography.
Since Sora's release, revealing and recreating Sora has become one of the most talked about topics in the development community due to its stunning effects and scarcity of technical details. For example, the Colossal-AI team launched a Sora training and inference replication process that can reduce costs by 46%.
After just two weeks, the team once again released the latest progress, reproducing a Sora-like solution, and made the technical solution and detailed tutorials open source for free on GitHub.
Then the question is, how to reproduce Sora?
Open-Sora open source address: https://github.com/hpcaitech/Open-Sora
Comprehensive interpretation of the Sora recurrence plan
The Sora recurrence plan includes four Aspects:
- Model architecture design
- Training reproduction plan
- Data preprocessing
- Efficient training optimization strategy
Model Architecture Design
The model adopts Sora homologous architecture Diffusion Transformer (DiT).
It is based on PixArt-α, a high-quality open source Vincent graph model using DiT architecture. On this basis, it introduces a temporal attention layer and extends it to video data.
Specifically, the entire architecture includes a pre-trained VAE, a text encoder and a STDiT (Spatial Temporal Diffusion Transformer) model that utilizes the spatial-temporal attention mechanism.
Among them, the structure of each layer of STDiT is shown in the figure below.
It uses a serial method to superimpose a one-dimensional temporal attention module on a two-dimensional spatial attention module to model temporal relationships. After the temporal attention module, the cross-attention module is used to align the semantics of the text.
Compared with the full attention mechanism, such a structure greatly reduces training and inference overhead.
Compared with the Latte model, which also uses the spatial-temporal attention mechanism, STDiT can better utilize the weights of pre-trained image DiT to continue training on video data.
△STDiT structure diagram
The training and inference process of the entire model is as follows.
It is understood that in the training stage, the pre-trained Variational Autoencoder (VAE) encoder is first used to compress the video data, and then STDiT is trained together with text embedding in the compressed latent space. Diffusion model.
In the inference stage, a Gaussian noise is randomly sampled from the latent space of VAE, and input into STDiT together with prompt embedding to obtain the denoised features, and finally input to VAE decoding processor, decode to get the video.
△Model training process
Training reproduction plan
In the training reproduction part, Open-Sora refers to Stable Video Diffusion (SVD).
It is divided into 3 stages:
- Large-scale image pre-training.
- Large-scale video pre-training.
- Fine-tuning of high-quality video data.
Each stage will continue training based on the weights of the previous stage.
Compared with single-stage training from scratch, multi-stage training achieves the goal of high-quality video generation more efficiently by gradually expanding data.
△Three phases of training plan
The first phase is large-scale image pre-training.
The team used the rich image data and Vincentian graph technology on the Internet to first train a high-quality Vincentian graph model, and used this model as the initialization weight for the next stage of video pre-training.
At the same time, since there is currently no high-quality spatio-temporal VAE, they use Stable Diffusion pre-trained image VAE.
This not only ensures the superior performance of the initial model, but also significantly reduces the overall cost of video pre-training.
The second stage is large-scale video pre-training.
This stage mainly increases the generalization ability of the model and effectively grasps the time series correlation of the video.
It needs to use a large amount of video data for training and ensure the diversity of video materials.
At the same time, the second-stage model adds a temporal attention module based on the first-stage Vincentian graph model to learn temporal relationships in videos. The remaining modules remain consistent with the first stage and load the first stage weights as initialization. At the same time, the output of the temporal attention module is initialized to zero to achieve more efficient and faster convergence.
The Colossal-AI team used PixArt-alpha’s open source weights as the initialization of the second-stage STDiT model, and the T5 model as the text encoder. They used a small resolution of 256x256 for pre-training, which further increased the convergence speed and reduced training costs.
△Open-Sora generation effect (prompt word: shot of the underwater world, a turtle swimming leisurely among the coral reefs)
The third stage is fine-tuning of high-quality video data.
According to reports, this stage can significantly improve the quality of model generation. The data size used is one order of magnitude lower than in the previous stage, but the duration, resolution and quality of the videos are higher.
Fine-tuning in this way can achieve efficient expansion of video generation from short to long, from low resolution to high resolution, and from low fidelity to high fidelity.
It is worth mentioning that Colossal-AI also disclosed the resource usage of each stage in detail.
In the reproduction process of Open-Sora, they used 64 H800s for training. The total training volume of the second stage is 2808 GPU hours, which is approximately US$7,000, and the training volume of the third stage is 1920 GPU hours, which is approximately US$4,500. After preliminary estimation, the entire training plan successfully controlled the Open-Sora reproduction process to about US$10,000.
Data Preprocessing
In order to further reduce the threshold and complexity of Sora reproduction, the Colossal-AI team also provides a convenient video data preprocessing script in the code warehouse, so that everyone can easily Start Sora recurrence pre-training.
Includes downloading public video data sets, segmenting long videos into short video clips based on shot continuity, and using the open source large language model LLaVA to generate precise prompt words.
The batch video title generation code they provide can annotate a video with two cards and 3 seconds, and the quality is close to GPT-4V.
The final video/text pair can be used directly for training. With the open source code they provide on GitHub, you can easily and quickly generate the video/text pairs required for training on your own data set, significantly reducing the technical threshold and preliminary preparation for starting a Sora replication project.
Efficient training support
In addition, the Colossal-AI team also provides a training acceleration solution.
Through efficient training strategies such as operator optimization and hybrid parallelism, an acceleration effect of 1.55 times was achieved in the training of processing 64-frame, 512x512 resolution video.
At the same time, thanks to Colossal-AI’s heterogeneous memory management system, a 1-minute 1080p high-definition video training task can be performed without hindrance on a single server (8H800).
#And the team also found that the STDiT model architecture also showed excellent efficiency during training.
Compared with DiT, which uses a full attention mechanism, STDiT achieves an acceleration effect of up to 5 times as the number of frames increases, which is particularly critical in real-life tasks such as processing long video sequences.
Finally, the team also released more Open-Sora generation effects.
, duration 00:25
The team and Qubits revealed that they will update and optimize Open-Sora related solutions and developments in the long term. In the future, more video training data will be used to generate higher quality, longer video content and support multi-resolution features.
In terms of practical applications, the team revealed that it will promote implementation in movies, games, advertising and other fields.
Interested developers can visit the GitHub project to learn more~
Open-Sora open source address: https://github.com/hpcaitech/Open-Sora
Reference link:
[1]https://arxiv.org/abs/2212.09748 Scalable Diffusion Models with Transformers.
[2]https://arxiv.org/abs/2310.00426 PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
[3]https://arxiv.org/abs/2311.15127 Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets.
[4]https://arxiv.org/abs/2401.03048 Latte: Latent Diffusion Transformer for Video Generation.
[5]https://huggingface.co/stabilityai/sd-vae-ft-mse-original.
[6]https://github.com/google-research/text-to-text-transfer-transformer.
[7]https://github.com/haotian-liu/LLaVA.
[8]https://hpc-ai.com/blog/open-sora-v1.0.
The above is the detailed content of The world's first Sora-like open source reproduction solution is here! Full disclosure of all training details and model weights. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

WorldCoin (WLD) stands out in the cryptocurrency market with its unique biometric verification and privacy protection mechanisms, attracting the attention of many investors. WLD has performed outstandingly among altcoins with its innovative technologies, especially in combination with OpenAI artificial intelligence technology. But how will the digital assets behave in the next few years? Let's predict the future price of WLD together. The 2025 WLD price forecast is expected to achieve significant growth in WLD in 2025. Market analysis shows that the average WLD price may reach $1.31, with a maximum of $1.36. However, in a bear market, the price may fall to around $0.55. This growth expectation is mainly due to WorldCoin2.

Factors of rising virtual currency prices include: 1. Increased market demand, 2. Decreased supply, 3. Stimulated positive news, 4. Optimistic market sentiment, 5. Macroeconomic environment; Decline factors include: 1. Decreased market demand, 2. Increased supply, 3. Strike of negative news, 4. Pessimistic market sentiment, 5. Macroeconomic environment.

Exchanges that support cross-chain transactions: 1. Binance, 2. Uniswap, 3. SushiSwap, 4. Curve Finance, 5. Thorchain, 6. 1inch Exchange, 7. DLN Trade, these platforms support multi-chain asset transactions through various technologies.

Aavenomics is a proposal to modify the AAVE protocol token and introduce token repos, which has implemented a quorum for AAVEDAO. Marc Zeller, founder of the AAVE Project Chain (ACI), announced this on X, noting that it marks a new era for the agreement. Marc Zeller, founder of the AAVE Chain Initiative (ACI), announced on X that the Aavenomics proposal includes modifying the AAVE protocol token and introducing token repos, has achieved a quorum for AAVEDAO. According to Zeller, this marks a new era for the agreement. AaveDao members voted overwhelmingly to support the proposal, which was 100 per week on Wednesday

Suggestions for choosing a cryptocurrency exchange: 1. For liquidity requirements, priority is Binance, Gate.io or OKX, because of its order depth and strong volatility resistance. 2. Compliance and security, Coinbase, Kraken and Gemini have strict regulatory endorsement. 3. Innovative functions, KuCoin's soft staking and Bybit's derivative design are suitable for advanced users.

In the volatile cryptocurrency market, investors are looking for alternatives that go beyond popular currencies. Although well-known cryptocurrencies such as Solana (SOL), Cardano (ADA), XRP and Dogecoin (DOGE) also face challenges such as market sentiment, regulatory uncertainty and scalability. However, a new emerging project, RexasFinance (RXS), is emerging. It does not rely on celebrity effects or hype, but focuses on combining real-world assets (RWA) with blockchain technology to provide investors with an innovative way to invest. This strategy makes it hoped to be one of the most successful projects of 2025. RexasFi

The platforms that have outstanding performance in leveraged trading, security and user experience in 2025 are: 1. OKX, suitable for high-frequency traders, providing up to 100 times leverage; 2. Binance, suitable for multi-currency traders around the world, providing 125 times high leverage; 3. Gate.io, suitable for professional derivatives players, providing 100 times leverage; 4. Bitget, suitable for novices and social traders, providing up to 100 times leverage; 5. Kraken, suitable for steady investors, providing 5 times leverage; 6. Bybit, suitable for altcoin explorers, providing 20 times leverage; 7. KuCoin, suitable for low-cost traders, providing 10 times leverage; 8. Bitfinex, suitable for senior play

Cryptocurrency data platforms suitable for beginners include CoinMarketCap and non-small trumpet. 1. CoinMarketCap provides global real-time price, market value, and trading volume rankings for novice and basic analysis needs. 2. The non-small quotation provides a Chinese-friendly interface, suitable for Chinese users to quickly screen low-risk potential projects.
