arxiv: 2401.03048 · v3 · submitted 2024-01-05 · 💻 cs.CV

Recognition: 1 theorem link

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma , Yaohui Wang , Xinyuan Chen , Gengyun Jia , Ziwei Liu , Yuan-Fang Li , Cunjian Chen , Yu Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationlatent diffusiontransformerspatio-temporal tokensdiffusion modelstext-to-videoUCF101state-of-the-art

0 comments

The pith

Latte generates higher-quality videos by running a transformer on latent spatio-temporal tokens with decomposed dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latte as a diffusion model that first compresses videos into latent representations, extracts tokens carrying both spatial and temporal information, and then processes those tokens with transformer blocks. To keep the computation feasible when token counts grow large, it offers four variants that separate the spatial and temporal axes at different stages. The authors run systematic ablations to settle on the strongest choices for patch embedding, timing signals, positional encodings, and training schedules. When these pieces are combined, the resulting model produces videos that surpass previous methods on four established benchmarks covering faces, time-lapse scenes, human actions, and Tai Chi motions, and it also performs competitively when extended to text-conditioned generation.

Core claim

Latte extracts spatio-temporal tokens from input videos and models their distribution in latent space using a series of transformer blocks. Four efficient variants are introduced by decomposing the spatial and temporal dimensions of the tokens. Rigorous experiments identify the best practices for video clip patch embedding, model architecture choice, timestep-class injection, temporal positional embedding, and learning strategies, enabling state-of-the-art performance on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, along with competitive results on text-to-video tasks.

What carries the argument

Transformer blocks applied to spatio-temporal tokens in latent space, with four decomposition variants that separate spatial and temporal processing to manage token volume efficiently.

If this is right

Video diffusion models can handle larger token counts without proportional compute increases by using dimension decomposition.
Careful design of timestep injection and temporal positional embeddings measurably improves sample quality in transformer-based diffusion.
The same latent-token transformer backbone supports both unconditional and text-conditioned video generation.
Insights from the ablation study on embedding and learning strategies can be reused in other diffusion transformer architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition approach may extend to longer or higher-resolution videos by further factoring the temporal axis.
Combining the latent transformer with external control signals could enable finer-grained editing of motion and appearance.
The efficiency gains suggest similar token-decomposition patterns could help diffusion models on other high-dimensional sequences such as 3D point clouds.
If the best-practice findings generalize, future work could standardize a small set of transformer blocks for video diffusion rather than designing new ones from scratch.

Load-bearing premise

The performance improvements come from the proposed transformer architecture and chosen practices rather than from dataset-specific tuning or differences in experimental setup.

What would settle it

Reproducing the exact training protocol and baselines on one of the four datasets while keeping data splits and hyperparameters identical, then measuring no gain in generation quality metrics.

read the original abstract

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Latte adds four explicit spatio-temporal decomposition variants to a latent diffusion transformer and reports SOTA numbers on standard video datasets, but the gains need checking against matched baseline tuning.

read the letter

Latte's core move is to extract spatio-temporal tokens from video and run them through transformer blocks in latent space, with four specific variants that split spatial and temporal dimensions to keep the token count manageable. It also claims state-of-the-art results on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, plus competitive text-to-video performance. The practical analysis of patch embedding, timestep-class injection, temporal positional embeddings, and learning strategies is the part that could actually help other people building similar models.

Referee Report

2 major / 2 minor

Summary. The paper proposes Latte, a latent diffusion model for video generation that extracts spatio-temporal tokens from input videos and processes them with a series of Transformer blocks in latent space. It introduces four efficient architectural variants based on different decompositions of spatial and temporal dimensions, selects best practices for patch embedding, timestep-class injection, temporal positional embeddings, and learning strategies via experimental analysis, and reports state-of-the-art results on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. The work also extends the model to text-to-video generation with competitive performance.

Significance. If the reported performance gains are shown to arise from the proposed token decomposition and Transformer blocks under matched experimental conditions, the paper would supply concrete evidence that Transformer-based diffusion models can scale effectively to video by handling large numbers of spatio-temporal tokens, offering practical design guidelines for future video generation architectures.

major comments (2)

[§4] §4 (Experimental Setup and Results): The SOTA claims on the four datasets rest on comparisons whose validity depends on whether the baselines (prior diffusion and transformer video models) were re-implemented and re-tuned with the same hyperparameter search, data splits, and augmentations used for the four Latte variants. The text states that best practices were chosen via 'rigorous experimental analysis' for Latte, but does not explicitly confirm equivalent optimization for baselines; this asymmetry would prevent attribution of gains to the spatio-temporal decomposition.
[§3.2] §3.2 (Model Variants): The four efficient variants are motivated by decomposing spatial and temporal dimensions, yet the results section provides no per-variant ablation isolating which decomposition (e.g., spatial-first vs. temporal-first) drives the reported metric improvements. Without these controls, it is unclear whether any single architectural change is load-bearing for the central performance claim.

minor comments (2)

[§3] Notation for the four variants is introduced without a compact summary table; adding one would improve readability when comparing their token counts and FLOPs.
[§5] The text-to-video extension is described only briefly; a short paragraph or table contrasting the T2V metrics with the most recent published numbers would strengthen the claim of competitiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and detailed feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: §4 (Experimental Setup and Results): The SOTA claims on the four datasets rest on comparisons whose validity depends on whether the baselines (prior diffusion and transformer video models) were re-implemented and re-tuned with the same hyperparameter search, data splits, and augmentations used for the four Latte variants. The text states that best practices were chosen via 'rigorous experimental analysis' for Latte, but does not explicitly confirm equivalent optimization for baselines; this asymmetry would prevent attribution of gains to the spatio-temporal decomposition.

Authors: We agree that explicit documentation of matched conditions is necessary for clear attribution. All models were trained on the same data splits with identical augmentations. Baselines followed the hyperparameter settings from their original papers for reproducibility, while Latte incorporated additional tuning from our best-practice experiments. We will revise §4 to include an explicit statement confirming the shared setup and add a supplementary table summarizing configurations across methods. This addresses the concern directly. revision: yes
Referee: §3.2 (Model Variants): The four efficient variants are motivated by decomposing spatial and temporal dimensions, yet the results section provides no per-variant ablation isolating which decomposition (e.g., spatial-first vs. temporal-first) drives the reported metric improvements. Without these controls, it is unclear whether any single architectural change is load-bearing for the central performance claim.

Authors: We appreciate this observation. The manuscript reports results for the best variant after evaluating all four during development. To isolate contributions, we will add a dedicated ablation table in the revised results section reporting FVD and other metrics for each of the four variants on the primary datasets. This will clarify the relative impact of the different decomposition strategies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with SOTA claims resting on dataset evaluations, not derivations or self-referential fits

full rationale

The paper introduces Latte as a latent diffusion transformer, describes four efficient variants for spatio-temporal token decomposition, selects best practices via experimental analysis, and reports SOTA results on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD plus competitive T2V extension. No equations, first-principles derivations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear in the provided text. All load-bearing claims are empirical comparisons; no step reduces by construction to its own inputs or prior self-citations. The derivation chain is self-contained as standard model design plus benchmarking.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of latent diffusion models and vision transformers; no new physical entities are postulated. Free parameters include the usual collection of model sizes, learning rates, and embedding dimensions that are tuned during training.

free parameters (1)

model hyperparameters and embedding dimensions
Standard training-time choices that control capacity and are fitted to the video datasets.

axioms (2)

domain assumption Latent diffusion models can faithfully model video distributions when tokens are extracted from input videos
Core premise of the latent-space approach stated in the abstract.
domain assumption Transformer blocks can effectively capture spatio-temporal dependencies once tokens are properly embedded
Relies on prior success of transformers in vision and sequence modeling.

pith-pipeline@v0.9.0 · 5505 in / 1316 out tokens · 63365 ms · 2026-05-13T21:41:19.568483+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
cs.CV 2026-05 unverdicted novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
cs.LG 2026-04 unverdicted novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
MultiAnimate: Pose-Guided Image Animation Made Extensible
cs.CV 2026-02 unverdicted novelty 7.0

MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
cs.CV 2024-07 unverdicted novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity
cs.CV 2026-05 unverdicted novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
cs.CV 2026-05 unverdicted novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
DiffATS: Diffusion in Aligned Tensor Space
cs.LG 2026-05 unverdicted novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 conditional novelty 6.0

MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
cs.CV 2024-10 unverdicted novelty 6.0

Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity
cs.LG 2026-04 unverdicted novelty 5.0

Local optimization on token windows plus a continuity loss lets autoregressive video models train on fewer frames with less error accumulation, cutting training cost in half while matching baseline quality.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Cosmos World Foundation Model Platform for Physical AI
cs.CV 2025-01 unverdicted novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 25 Pith papers · 8 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

An image is worth 16x16 words: Transformers for image recognition at scale

13 Published in Transactions on Machine Learning Research (03/2025) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on ...

work page 2025
[4]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review arXiv
[5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

14 Published in Transactions on Machine Learning Research (03/2025) Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

work page arXiv
[8]

Cin- emo: Consistent and controllable image animation with motion diffusion models

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Cin- emo: Consistent and controllable image animation with motion diffusion models. arXiv preprint arXiv:2407.15642, 2024a. Ziyu Ma, Shutao Li, Bin Sun, Jianfei Cai, Zuxiang Long, and Fuyan Ma. Gerea: Question-aware prompt captions for knowledge-based visual question ...

work page arXiv
[9]

On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

work page arXiv
[10]

Zero- shot image-to-image translation

15 Published in Transactions on Machine Learning Research (03/2025) Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero- shot image-to-image translation. InACM Special Interest Group on Graphics and Interactive Techniques Conference, pp. 1–11,

work page 2025
[11]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces

Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179,

work page Pith review arXiv
[13]

First order motion model for image animation.Neural Information Processing Systems, 32,

16 Published in Transactions on Machine Learning Research (03/2025) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Neural Information Processing Systems, 32,

work page 2025
[14]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

G3an: Disentangling appearance and motion for video generation

Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. G3an: Disentangling appearance and motion for video generation. InComputer Vision and Pattern Recognition, pp. 5264–5273, 2020a. Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. InWinter Conference o...

work page 2025
[16]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

9 of Appendix

A Appendix A.1 The sampled video frames We provide the sampled video frames of different methods as shown in Fig. 9 of Appendix. A.2 The structure of S-AdaLN In Fig. 10 of Appendix, we show the structure of S-AdaLN. A.3 Discussion about the difference from concurrent works A similar idea has been explored in recent concurrent work VDT Lu et al. (2024), Ge...

work page 2024
[18]

GenTron and W.A.L.T mainly fo- cus on general purposes, i.e., text-to-video generation and text-to-image generation

VDT primarily focuses on generating various video tasks, including image-to-video generation and unconditional video generation, utilizing a mask learning strategy. GenTron and W.A.L.T mainly fo- cus on general purposes, i.e., text-to-video generation and text-to-image generation. Open-Sora Plan and HunyuanVideo focus on large-scale, open-source video gen...

work page 2025