{"paper":{"title":"MagicVideo: Efficient Video Generation With Latent Diffusion Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MagicVideo generates 256x256 text-to-video clips on a single GPU using 64 times fewer computations than prior video diffusion models.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Daquan Zhou, Hanshu Yan, Jiashi Feng, Weimin Wang, Weiwei Lv, Yizhe Zhu","submitted_at":"2022-11-20T16:40:31Z","abstract_excerpt":"We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a pre-trained VAE maps video clips into a low-dimensional latent space while preserving enough spatial and temporal information for the diffusion model to produce high-fidelity, temporally coherent output without major artifacts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MagicVideo generates 256x256 text-to-video clips on a single GPU using 64 times fewer computations than prior video diffusion models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"518ce05e33b54bd6960846e91be43bdf866c909ec8b57ec93e12ed2b961cef5a"},"source":{"id":"2211.11018","kind":"arxiv","version":2},"verdict":{"id":"937aebc7-f53b-44b4-8b1c-7446604ed466","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T18:43:06.063034Z","strongest_claim":"MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs.","one_line_summary":"MagicVideo generates 256x256 text-conditioned video clips via latent diffusion with a custom 3D U-Net, achieving roughly 64 times lower compute than prior video diffusion models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a pre-trained VAE maps video clips into a low-dimensional latent space while preserving enough spatial and temporal information for the diffusion model to produce high-fidelity, temporally coherent output without major artifacts.","pith_extraction_headline":"MagicVideo generates 256x256 text-to-video clips on a single GPU using 64 times fewer computations than prior video diffusion models."},"references":{"count":52,"sample":[{"doi":"","year":2016,"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","ref_index":1,"cited_arxiv_id":"1607.06450","is_internal_anchor":true},{"doi":"","year":2021,"title":"Fitvid: Overﬁtting in pixel-level video prediction","work_id":"98b75ffa-1d61-4641-a59f-5967267b7d2c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Frozen in time: A joint video and im- age encoder for end-to-end retrieval","work_id":"17fcb6c4-ac33-4b0a-adc1-250adad835cd","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Multimodal datasets: misog- yny, pornography, and malignant stereotypes","work_id":"fa09cccc-4dae-4631-ad3d-0b4c145a5b44","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Quo vadis, action recognition? a new model and the kinet- ics dataset","work_id":"72464151-385b-4866-a5ed-ec27a2e568f3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":52,"snapshot_sha256":"d39e22f98167478e2f98d004eb0624e7a8c36febd85d1668d23ea1b510fe3b9b","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2102607c4e20570c501022de525389400902c3a8200d0e4c28a28aff5399e960"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}