pith. sign in

arxiv: 2401.03048 · v3 · submitted 2024-01-05 · 💻 cs.CV

Latte: Latent Diffusion Transformer for Video Generation

Pith reviewed 2026-05-13 21:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationlatent diffusiontransformerspatio-temporal tokensdiffusion modelstext-to-videoUCF101state-of-the-art
0
0 comments X

The pith

Latte generates higher-quality videos by running a transformer on latent spatio-temporal tokens with decomposed dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Latte as a diffusion model that first compresses videos into latent representations, extracts tokens carrying both spatial and temporal information, and then processes those tokens with transformer blocks. To keep the computation feasible when token counts grow large, it offers four variants that separate the spatial and temporal axes at different stages. The authors run systematic ablations to settle on the strongest choices for patch embedding, timing signals, positional encodings, and training schedules. When these pieces are combined, the resulting model produces videos that surpass previous methods on four established benchmarks covering faces, time-lapse scenes, human actions, and Tai Chi motions, and it also performs competitively when extended to text-conditioned generation.

Core claim

Latte extracts spatio-temporal tokens from input videos and models their distribution in latent space using a series of transformer blocks. Four efficient variants are introduced by decomposing the spatial and temporal dimensions of the tokens. Rigorous experiments identify the best practices for video clip patch embedding, model architecture choice, timestep-class injection, temporal positional embedding, and learning strategies, enabling state-of-the-art performance on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, along with competitive results on text-to-video tasks.

What carries the argument

Transformer blocks applied to spatio-temporal tokens in latent space, with four decomposition variants that separate spatial and temporal processing to manage token volume efficiently.

If this is right

  • Video diffusion models can handle larger token counts without proportional compute increases by using dimension decomposition.
  • Careful design of timestep injection and temporal positional embeddings measurably improves sample quality in transformer-based diffusion.
  • The same latent-token transformer backbone supports both unconditional and text-conditioned video generation.
  • Insights from the ablation study on embedding and learning strategies can be reused in other diffusion transformer architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition approach may extend to longer or higher-resolution videos by further factoring the temporal axis.
  • Combining the latent transformer with external control signals could enable finer-grained editing of motion and appearance.
  • The efficiency gains suggest similar token-decomposition patterns could help diffusion models on other high-dimensional sequences such as 3D point clouds.
  • If the best-practice findings generalize, future work could standardize a small set of transformer blocks for video diffusion rather than designing new ones from scratch.

Load-bearing premise

The performance improvements come from the proposed transformer architecture and chosen practices rather than from dataset-specific tuning or differences in experimental setup.

What would settle it

Reproducing the exact training protocol and baselines on one of the four datasets while keeping data splits and hyperparameters identical, then measuring no gain in generation quality metrics.

read the original abstract

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Latte, a latent diffusion model for video generation that extracts spatio-temporal tokens from input videos and processes them with a series of Transformer blocks in latent space. It introduces four efficient architectural variants based on different decompositions of spatial and temporal dimensions, selects best practices for patch embedding, timestep-class injection, temporal positional embeddings, and learning strategies via experimental analysis, and reports state-of-the-art results on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. The work also extends the model to text-to-video generation with competitive performance.

Significance. If the reported performance gains are shown to arise from the proposed token decomposition and Transformer blocks under matched experimental conditions, the paper would supply concrete evidence that Transformer-based diffusion models can scale effectively to video by handling large numbers of spatio-temporal tokens, offering practical design guidelines for future video generation architectures.

major comments (2)
  1. [§4] §4 (Experimental Setup and Results): The SOTA claims on the four datasets rest on comparisons whose validity depends on whether the baselines (prior diffusion and transformer video models) were re-implemented and re-tuned with the same hyperparameter search, data splits, and augmentations used for the four Latte variants. The text states that best practices were chosen via 'rigorous experimental analysis' for Latte, but does not explicitly confirm equivalent optimization for baselines; this asymmetry would prevent attribution of gains to the spatio-temporal decomposition.
  2. [§3.2] §3.2 (Model Variants): The four efficient variants are motivated by decomposing spatial and temporal dimensions, yet the results section provides no per-variant ablation isolating which decomposition (e.g., spatial-first vs. temporal-first) drives the reported metric improvements. Without these controls, it is unclear whether any single architectural change is load-bearing for the central performance claim.
minor comments (2)
  1. [§3] Notation for the four variants is introduced without a compact summary table; adding one would improve readability when comparing their token counts and FLOPs.
  2. [§5] The text-to-video extension is described only briefly; a short paragraph or table contrasting the T2V metrics with the most recent published numbers would strengthen the claim of competitiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and detailed feedback. We address each major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: §4 (Experimental Setup and Results): The SOTA claims on the four datasets rest on comparisons whose validity depends on whether the baselines (prior diffusion and transformer video models) were re-implemented and re-tuned with the same hyperparameter search, data splits, and augmentations used for the four Latte variants. The text states that best practices were chosen via 'rigorous experimental analysis' for Latte, but does not explicitly confirm equivalent optimization for baselines; this asymmetry would prevent attribution of gains to the spatio-temporal decomposition.

    Authors: We agree that explicit documentation of matched conditions is necessary for clear attribution. All models were trained on the same data splits with identical augmentations. Baselines followed the hyperparameter settings from their original papers for reproducibility, while Latte incorporated additional tuning from our best-practice experiments. We will revise §4 to include an explicit statement confirming the shared setup and add a supplementary table summarizing configurations across methods. This addresses the concern directly. revision: yes

  2. Referee: §3.2 (Model Variants): The four efficient variants are motivated by decomposing spatial and temporal dimensions, yet the results section provides no per-variant ablation isolating which decomposition (e.g., spatial-first vs. temporal-first) drives the reported metric improvements. Without these controls, it is unclear whether any single architectural change is load-bearing for the central performance claim.

    Authors: We appreciate this observation. The manuscript reports results for the best variant after evaluating all four during development. To isolate contributions, we will add a dedicated ablation table in the revised results section reporting FVD and other metrics for each of the four variants on the primary datasets. This will clarify the relative impact of the different decomposition strategies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with SOTA claims resting on dataset evaluations, not derivations or self-referential fits

full rationale

The paper introduces Latte as a latent diffusion transformer, describes four efficient variants for spatio-temporal token decomposition, selects best practices via experimental analysis, and reports SOTA results on FaceForensics, SkyTimelapse, UCF101, and Taichi-HD plus competitive T2V extension. No equations, first-principles derivations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear in the provided text. All load-bearing claims are empirical comparisons; no step reduces by construction to its own inputs or prior self-citations. The derivation chain is self-contained as standard model design plus benchmarking.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of latent diffusion models and vision transformers; no new physical entities are postulated. Free parameters include the usual collection of model sizes, learning rates, and embedding dimensions that are tuned during training.

free parameters (1)
  • model hyperparameters and embedding dimensions
    Standard training-time choices that control capacity and are fitted to the video datasets.
axioms (2)
  • domain assumption Latent diffusion models can faithfully model video distributions when tokens are extracted from input videos
    Core premise of the latent-space approach stated in the abstract.
  • domain assumption Transformer blocks can effectively capture spatio-temporal dependencies once tokens are properly embedded
    Relies on prior success of transformers in vision and sequence modeling.

pith-pipeline@v0.9.0 · 5505 in / 1316 out tokens · 63365 ms · 2026-05-13T21:41:19.568483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

  2. iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.

  3. Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

    cs.CV 2026-05 unverdicted novelty 7.0

    Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...

  4. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...

  5. StreamingEffect: Real-Time Human-Centric Video Effect Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.

  6. Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...

  7. HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

    cs.CV 2026-05 unverdicted novelty 7.0

    HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

  8. ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

    cs.LG 2026-04 unverdicted novelty 7.0

    ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

  9. MultiAnimate: Pose-Guided Image Animation Made Extensible

    cs.CV 2026-02 unverdicted novelty 7.0

    MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.

  10. VABench: A Comprehensive Benchmark for Audio-Video Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    VABench is a new multi-dimensional benchmark for evaluating synchronous audio-video generation across text-to-AV, image-to-AV, and stereo tasks.

  11. Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

    cs.CV 2025-10 unverdicted novelty 7.0

    Introduces noise aggregation analysis with single-step small-noise injection to enable efficient and accurate membership inference attacks on diffusion models.

  12. History-Guided Video Diffusion

    cs.LG 2025-02 unverdicted novelty 7.0

    DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.

  13. Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

    cs.CV 2024-11 unverdicted novelty 7.0

    VideoRepair detects text-video misalignments via MLLM-generated questions and performs localized, region-preserving refinement to improve alignment in existing T2V diffusion models.

  14. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    cs.CV 2024-07 unverdicted novelty 7.0

    OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

  15. SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public...

  16. Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    Flash-GRPO introduces iso-temporal grouping and temporal gradient rectification to enable single-step GRPO training that outperforms full-trajectory methods on video diffusion alignment under low compute budgets.

  17. ReactiveGWM: Steering NPC in Reactive Game World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.

  18. Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

    cs.CV 2026-05 unverdicted novelty 6.0

    Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

  19. FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

    cs.CV 2026-05 unverdicted novelty 6.0

    FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

  20. DiffATS: Diffusion in Aligned Tensor Space

    cs.LG 2026-05 unverdicted novelty 6.0

    DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...

  21. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.

  22. TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.

  23. AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

  24. VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

  25. Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

    cs.CV 2026-02 unverdicted novelty 6.0

    Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

  26. RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

    cs.CV 2025-10 unverdicted novelty 6.0

    RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

  27. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  28. Long-Context Autoregressive Video Modeling with Next-Frame Prediction

    cs.CV 2025-03 unverdicted novelty 6.0

    FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.

  29. Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

    cs.CV 2024-11 unverdicted novelty 6.0

    LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duratio...

  30. Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    cs.CV 2024-10 unverdicted novelty 6.0

    Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.

  31. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  32. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  33. CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    cs.CV 2024-06 unverdicted novelty 6.0

    CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.

  34. PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

    cs.CV 2024-05 unverdicted novelty 6.0

    PipeFusion applies patch partitioning and pipeline parallelism with one-step stale feature reuse to reduce communication overhead in DiT inference, reporting SOTA results on 8x L40 GPUs for Pixart, SD3, and Flux.1.

  35. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  36. Video Generation with Predictive Latents

    cs.CV 2026-05 unverdicted novelty 5.0

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  37. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

  38. Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.

  39. Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity

    cs.LG 2026-04 unverdicted novelty 5.0

    Local optimization on token windows plus a continuity loss lets autoregressive video models train on fewer frames with less error accumulation, cutting training cost in half while matching baseline quality.

  40. Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

    cs.CV 2026-02 unverdicted novelty 5.0

    Prompt reinjection alleviates progressive forgetting of text prompt semantics in MMDiT text branches, producing consistent improvements in text-to-image instruction following on GenEval, DPG, and T2I-CompBench++.

  41. DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment

    cs.RO 2025-04 unverdicted novelty 5.0

    DriVerse is a generative model that simulates driving scenes from an image and trajectory using multimodal prompting and motion alignment, achieving better performance on nuScenes and Waymo datasets with minimal training.

  42. Wan: Open and Advanced Large-Scale Video Generative Models

    cs.CV 2025-03 unverdicted novelty 5.0

    Wan releases open 1.3B and 14B video diffusion models claiming superior performance over open-source and commercial baselines across multiple tasks with consumer-grade efficiency.

  43. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  44. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  45. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  46. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

  47. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

  48. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 47 Pith papers · 8 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja...

  3. [3]

    An image is worth 16x16 words: Transformers for image recognition at scale

    13 Published in Transactions on Machine Learning Research (03/2025) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on ...

  4. [4]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  5. [5]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  6. [6]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    14 Published in Transactions on Machine Learning Research (03/2025) Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  7. [7]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

  8. [8]

    Cinemo: Consis- tent and controllable image animation with motion diffu- sion models

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Cin- emo: Consistent and controllable image animation with motion diffusion models. arXiv preprint arXiv:2407.15642, 2024a. Ziyu Ma, Shutao Li, Bin Sun, Jianfei Cai, Zuxiang Long, and Fuyan Ma. Gerea: Question-aware prompt captions for knowledge-based visual question ...

  9. [9]

    arXiv preprint arXiv:2104.11222 , year=

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

  10. [10]

    Zero- shot image-to-image translation

    15 Published in Transactions on Machine Learning Research (03/2025) Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero- shot image-to-image translation. InACM Special Interest Group on Graphics and Interactive Techniques Conference, pp. 1–11,

  11. [11]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

  12. [12]

    FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces

    Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179,

  13. [13]

    First order motion model for image animation.Neural Information Processing Systems, 32,

    16 Published in Transactions on Machine Learning Research (03/2025) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Neural Information Processing Systems, 32,

  14. [14]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

  15. [15]

    G3an: Disentangling appearance and motion for video generation

    Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. G3an: Disentangling appearance and motion for video generation. InComputer Vision and Pattern Recognition, pp. 5264–5273, 2020a. Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. InWinter Conference o...

  16. [16]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157,

  17. [17]

    9 of Appendix

    A Appendix A.1 The sampled video frames We provide the sampled video frames of different methods as shown in Fig. 9 of Appendix. A.2 The structure of S-AdaLN In Fig. 10 of Appendix, we show the structure of S-AdaLN. A.3 Discussion about the difference from concurrent works A similar idea has been explored in recent concurrent work VDT Lu et al. (2024), Ge...

  18. [18]

    GenTron and W.A.L.T mainly fo- cus on general purposes, i.e., text-to-video generation and text-to-image generation

    VDT primarily focuses on generating various video tasks, including image-to-video generation and unconditional video generation, utilizing a mask learning strategy. GenTron and W.A.L.T mainly fo- cus on general purposes, i.e., text-to-video generation and text-to-image generation. Open-Sora Plan and HunyuanVideo focus on large-scale, open-source video gen...