Open-Sora Plan
This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.
ๆฌ้กน็ฎๅธๆ้่ฟๅผๆบ็คพๅบ็ๅ้ๅค็ฐSora๏ผ็ฑๅๅคง-ๅ ๅฑAIGC่ๅๅฎ้ชๅฎคๅ ฑๅๅ่ตท๏ผๅฝๅ็ๆฌ็ฆป็ฎๆ ๅทฎ่ทไป็ถ่พๅคง๏ผไป้ๆ็ปญๅฎๅๅๅฟซ้่ฟญไปฃ๏ผๆฌข่ฟPull request๏ผ็ฎๅไปฃ็ ๅๆถๆฏๆไฝฟ็จๅฝไบงAI่ฎก็ฎ็ณป็ป๏ผๅไธบๆ่ พ๏ผ่ฟ่กๅฎๆด็่ฎญ็ปๅๆจ็ใๅบไบๆ่ พ่ฎญ็ปๅบ็ๆจกๅ๏ผไนๅฏ่พๅบๆๅนณไธ็็่ง้ข่ดจ้ใ
๐ฃ News
COMING SOON
โก๏ธโก๏ธโก๏ธ For large model parallelisation training, TP & SP and more strategies are coming...่ฟๆๅฐๆฐๅขๅไธบๆ่ พๅคๆจกๆMindSpeed-MMๅๆฏ๏ผๅๅฉๅไธบMindSpeed-MMๅฅไปถ็่ฝๅๆฏๆOpen-Sora Planๅๆฐ็ๆฉๅข๏ผไธบๆดๅคงๅๆฐ่งๆจก็ๆจกๅ่ฎญ็ปๆไพTPใSP็ญๅๅธๅผ่ฎญ็ป่ฝๅใ
[2024.10.16] ๐ We released version 1.3.0, featuring: WFVAE, pompt refiner, data filtering strategy, sparse attention, and bucket training strategy. We also support 93x480p within 24G VRAM. More details can be found at our latest report.
[2024.08.13] ๐ We are launching Open-Sora Plan v1.2.0 I2V model, which based on Open-Sora Plan v1.2.0. The current version supports image-to-video generation and transition generation (the starting and ending frames conditions for video generation). Checking out the Image-to-Video section in this report.
[2024.07.24] ๐ฅ๐ฅ๐ฅ v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
[2024.05.27] ๐ We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
[2024.04.09] ๐ค Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
[2024.04.07] ๐๐๐ Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
[2024.03.27] ๐๐๐ We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
[2024.03.01] ๐ค We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch ๐ this repository for the latest updates.
๐ Gallery
Text & Image to Video Generation.
๐ฎ Highlights
Open-Sora Plan shows excellent performance in video generation.
๐ฅ High performance CausalVideoVAE, but with fewer training cost
- High compression ratio with excellent performance, capable of compressing videos by 256 times (4ร8ร8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.
๐ Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.
- With a new sparse attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.
๐ค Demo
Gradio Web UI
Highly recommend trying out our web demo by the following command.
python -m opensora.serve.gradio_web_server --model_path "path/to/model" \
--ae WFVAEModel_D8_4x8x8 --ae_path "path/to/vae" \
--caption_refiner "path/to/refiner" \
--text_encoder_name_1 "path/to/text_enc" --rescale_betas_zero_snr
ComfyUI
Coming soon...
๐ณ Resource
Version | Architecture | Diffusion Model | CausalVideoVAE | Data | Prompt Refiner |
---|---|---|---|---|---|
v1.3.0 | 3D | Anysize in 93x640x640[3], more checkpoints are coming soon | Anysize | prompt_refiner | checkpoint |
v1.2.0 | 3D | 93x720p, 29x720p[1], 93x480p[1,2], 29x480p, 1x480p, 93x480p_i2v | Anysize | Annotations | - |
v1.1.0 | 2+1D | 221x512x512, 65x512x512 | Anysize | Data and Annotations | - |
v1.0.0 | 2+1D | 65x512x512, 65x256x256, 17x256x256 | Anysize | Data and Annotations | - |
[1] Please note that the weights for v1.2.0 29ร720p and 93ร480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
[2] We fine-tuned 3.5k steps from 93ร720p to get 93ร480p for community research use.
[3] The model is trained arbitrarily on stride=32. So keep the resolution of the inference a multiple of 32. Frames needs to be 4n+1, e.g. 93, 77, 61, 45, 29, 1 (image).
๐จ For version 1.2.0, we no longer support 2+1D models.
โ๏ธ Requirements and Installation
- Clone this repository and navigate to Open-Sora-Plan folder
git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
- Install required packages We recommend the requirements as follows.
- Python >= 3.8
- Pytorch >= 2.1.0
- CUDA Version >= 11.7
conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .
- Install optional requirements such as static type checking:
pip install -e '.[dev]'
๐๏ธ Training & Inferencing
๐๏ธ CausalVideoVAE
The data preparation, training, inferencing and evaluation can be found here
๐ Prompt Refiner
The data preparation, training, inferencing can be found here
๐ Text-to-Video
The data preparation, training and inferencing can be found here
๐ผ๏ธ Image-to-Video
The data preparation, training and inferencing can be found here
โก๏ธ Extra Save Memory
๐ Training
During training, the entire EMA model remains in VRAM. You can enable --offload_ema
or disable --use_ema
. Additionally, VAE tiling is disabled by default, but you can pass --enable_tiling
or disable --vae_fp32
. Finally, a temporary but extreme saving memory option is enable --extra_save_mem
to offload the text encoder and VAE to the CPU when not in use, though this will significantly slow down performance.
We currently have two plans: one is to continue using the Deepspeed/FSDP approach, sharding the EMA and text encoder across ranks with Zero3, which is sufficient for training 10-15B models. The other is to adopt MindSpeed for various parallel strategies, enabling us to scale the model up to 30B.
โก๏ธ 24G VRAM Inferencing
Please first ensure that you understand how to inference. Refer to the inference instructions in Text-to-Video.
Simply specify --save_memory
, and during inference, enable_model_cpu_offload()
, enable_sequential_cpu_offload()
, and vae.vae.enable_tiling()
will be automatically activated.
๐ก How to Contribute
We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!
For more details, please refer to the Contribution Guidelines
๐ Acknowledgement
- Latte: It is an wonderful 2+1D video generated model.
- PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
- VideoGPT: Video Generation using VQ-VAE and Transformers.
- DiT: Scalable Diffusion Models with Transformers.
- FiT: Flexible Vision Transformer for Diffusion Model.
- Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.
๐ License
- See LICENSE for details.
โจ Star History
โ๏ธ Citing
BibTeX
@misc{lin2024opensoraplanopensourcelarge,
title={Open-Sora Plan: Open-Source Large Video Generation Model},
author={Bin Lin and Yunyang Ge and Xinhua Cheng and Zongjian Li and Bin Zhu and Shaodong Wang and Xianyi He and Yang Ye and Shenghai Yuan and Liuhan Chen and Tanghui Jia and Junwu Zhang and Zhenyu Tang and Yatian Pang and Bin She and Cen Yan and Zhiheng Hu and Xiaoyi Dong and Lin Chen and Zhang Pan and Xing Zhou and Shaoling Dong and Yonghong Tian and Li Yuan},
year={2024},
eprint={2412.00131},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.00131},
}
Latest DOI
๐ค Community contributors
- Downloads last month
- 3