Open-Sora Plan: Open-Source Large Video Generation Model
Pith reviewed 2026-05-23 08:38 UTC · model grok-4.3
The pith
Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations.
Load-bearing premise
The assumption that the specific combination of Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, condition controllers, and the proposed data curation pipeline will reliably produce high-quality long-duration videos, as the abstract provides no metrics, baselines, or ablation details to support this.
Figures
read the original abstract
We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs. Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. Moreover, many assistant strategies for efficient training and inference are designed, and a multi-dimensional data curation pipeline is proposed for obtaining desired high-quality data. Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations. We hope our careful design and practical experience can inspire the video generation research community. All our codes and model weights are publicly available at \url{https://github.com/PKU-YuanGroup/Open-Sora-Plan}.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 37 Pith papers
-
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
Attention sparsity in video DiTs is an input-stable layer-wise property, enabling offline profiling and online bidirectional QK co-clustering for up to 1.93x speedup with PSNR up to 29 dB.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
MultiAnimate: Pose-Guided Image Animation Made Extensible
MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D i...
-
History-Guided Video Diffusion
DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.
-
SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
SCOPE adds per-pixel action conditioning to pretrained video diffusion models and releases the CrossFPS multi-game dataset to support cross-game FPS world model simulation with zero-shot transfer.
-
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
DyMoS rebalances self-attention from generated frames to the reference frame in initial denoising steps of image-to-video models to reduce reference dominance and improve motion without training or fidelity loss.
-
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
DyMoS rebalances reference-frame dominance in self-attention of I2V diffusion models during initial denoising to improve motion dynamics without retraining or input changes.
-
AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling
AtlasVid proposes a decoupled global-local diffusion framework that trains at low resolution with LoRA and generalizes to ultra-high-resolution long video synthesis via semantic proxy guidance and locality-preserving ...
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories to balance teacher supervision against a monotonic continuity objective in autoregressive video generation.
-
Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation
Delta Forcing uses latent trajectory deltas to adaptively limit unreliable teacher guidance while enforcing monotonic continuity, improving temporal consistency in interactive autoregressive video generation.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
-
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
-
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Causal Forcing uses an autoregressive teacher for ODE initialization in diffusion distillation to close the causal attention gap and deliver better real-time video generation than Self Forcing.
-
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Causal Forcing initializes autoregressive diffusion students from AR teachers to recover flow maps that bidirectional teachers cannot provide, delivering 19%+ gains over Self Forcing on dynamic degree and related metrics.
-
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.
-
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.
-
Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis
Latent Wavelet Diffusion uses wavelet energy map masking and a scale-consistent VAE to improve detail fidelity in 2K-4K image generation without extra inference overhead.
-
ImgEdit: A Unified Image Editing Dataset and Benchmark
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
-
Latte: Latent Diffusion Transformer for Video Generation
Latte achieves state-of-the-art video generation on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD by using a latent diffusion transformer with four efficient spatial-temporal decomposition variants and best-pract...
-
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures whi...
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights...
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Wan: Open and Advanced Large-Scale Video Generative Models
Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
Image-to-Video Diffusion: From Foundations to Open Frontiers
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021a. Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of...
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023a. 1Core contributors with equal contributions 23 Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhong...
-
[4]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition,
work page 2010
-
[7]
24 Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Con- trolnet++: Improving conditional controls with efficient consistency feedback. In European Conference on Computer Vision (ECCV), 2024a. Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavele...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024a. Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[11]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Playground v3: Improving text-to-image alignment with deep-fusion large language models
Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Improving text-to-image alignment with deep-fusion large language models. arXiv preprint arXiv:2409.10695, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruct...
-
[13]
Fit: Flexible vision transformer for diffusion model
Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, and Lei Bai. Fit: Flexible vision transformer for diffusion model. arXiv preprint arXiv:2402.12376,
-
[14]
Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, and Jiaya Jia. Controlnext: Powerful and efficient control for image and video generation. arXiv preprint arXiv:2408.06070,
-
[15]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[17]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
Anytext: Multilingual visual text generation and editing
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054,
-
[19]
Tarsier: Recipes for training and evaluating large video description models
Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634, 2024a. Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequenc...
-
[20]
Fitv2: Scalable and improved flexible vision transformer for diffusion model
ZiDong Wang, Zeyu Lu, Di Huang, Cai Zhou, Wanli Ouyang, et al. Fitv2: Scalable and improved flexible vision transformer for diffusion model. arXiv preprint arXiv:2410.13925, 2024c. Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative ...
-
[21]
Easyanimate: A high-performance long video generation method based on transformer architecture
Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, and Jun Huang. Easyanimate: A high-performance long video generation method based on transformer architecture. arXiv preprint arXiv:2405.18991, 2024a. Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, and Jun Huang. Easyanimate: A high-performance...
-
[22]
mt5: A massively multilingual pre-trained text-to-text transformer
L Xue. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934,
-
[23]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video di...
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Yi: Open Foundation Models by 01.AI
ai. arXiv preprint arXiv:2403.04652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, and Li Yuan. Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation. arXiv preprint arXiv:2406.18522,
-
[27]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
work page 2018
-
[28]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Allegro: Open the black box of commercial-level video generation model
Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model. arXiv preprint arXiv:2410.15458,
-
[30]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.