pith. machine review for the scientific record. sign in

arxiv: 2605.02134 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

Video Generation with Predictive Latents

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords video vaepredictive learninglatent video generationtemporal coherencevideo diffusionfuture predictionreconstruction objective
0
0 comments X

The pith

A video VAE trained to predict future frames from partial observations produces latents that generate higher-quality videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing video VAEs optimize reconstruction without necessarily improving generation because the resulting latents lack sufficient temporal structure for diffusion models. By randomly dropping future frames so the encoder sees only partial past observations, then training the decoder to both reconstruct the seen frames and predict the unseen ones, the latent space is forced to encode predictive dynamics. This unified objective yields faster convergence and stronger generative results on standard benchmarks. The design directly targets the diffusability problem that has limited prior video VAEs.

Core claim

The Predictive Video VAE encodes only past frames after randomly discarding future ones and trains its decoder to reconstruct the observed frames while simultaneously predicting the missing future frames; this produces a latent space with improved temporal coherence that supports superior video generation, delivering 52 percent faster convergence and a 34.42 FVD gain over the Wan2.2 VAE on UCF101.

What carries the argument

The predictive reconstruction objective that unifies reconstruction of observed frames with prediction of future frames from partial past inputs.

If this is right

  • Generative quality continues to rise as VAE training length increases, indicating the method scales.
  • Latents from the model improve performance on downstream video-understanding tasks that rely on motion understanding.
  • Video diffusion models built on these latents require less training time to reach a given quality level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masking-plus-prediction pattern could be applied during pretraining of other autoregressive or diffusion-based video models to strengthen their motion priors.
  • If the predictive latents capture coherent world dynamics, they may support longer-horizon video prediction without additional fine-tuning.
  • The approach suggests a general route to embed predictive world-modeling signals inside reconstruction objectives for any spatiotemporal generative task.

Load-bearing premise

That forcing the latent space to encode temporally predictive structures through simultaneous reconstruction and future prediction will produce latents whose diffusability directly improves downstream generative performance.

What would settle it

Train an otherwise identical video VAE without the future-prediction term and measure whether its generated-video FVD on UCF101 is at least 30 points worse than the predictive version; equal or better performance would falsify the central claim.

read the original abstract

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes Predictive Video VAE (PV-VAE), a video VAE trained with a predictive reconstruction objective: future frames are randomly discarded so that the encoder sees only partial past observations, while the decoder is trained to reconstruct the observed frames and predict the missing future frames simultaneously. This is argued to encourage temporally predictive structures in the latent space, improving diffusability for downstream diffusion-based video generation. The central empirical claims are a 52% faster convergence and 34.42 FVD improvement over the Wan2.2 VAE baseline on UCF101, plus favorable scalability and gains on downstream video understanding tasks.

Significance. If the predictive objective can be shown to specifically enhance latent diffusability (rather than merely altering reconstruction statistics or training dynamics), the approach would offer a lightweight, principle-driven way to improve video VAEs without architectural overhaul. The reported scalability with VAE training compute and consistent downstream benefits would strengthen its practical value for latent generative modeling.

major comments (3)
  1. [Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.
  2. [Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.
  3. [Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We address each of the major comments below and have revised the manuscript to incorporate additional details, diagnostics, and ablations as suggested. These changes strengthen the presentation of our results and the evidence for the benefits of the predictive objective.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.

    Authors: We agree with this observation and have revised the abstract to include a brief statement on the experimental controls: the Wan2.2 VAE baseline was re-trained with the same data, optimizer, and compute budget. We have also added details on statistical significance and variance (averaged over three independent runs) in the main text and supplementary material. This should allow readers to better evaluate the reported gains. revision: yes

  2. Referee: [Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.

    Authors: We acknowledge the importance of such diagnostics to isolate the effect on diffusability. In the revised manuscript, we have included new figures showing the diffusion training loss curves for PV-VAE latents versus the baseline, demonstrating faster convergence and lower error in noise prediction. Additionally, we report latent-space Fréchet distances to show improved alignment in the latent distribution. These additions help rule out alternative explanations related to reconstruction trade-offs. revision: yes

  3. Referee: [Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.

    Authors: We agree that this ablation is necessary to substantiate our central claim. We have added a dedicated ablation study in the revised manuscript (Section 4.3) where we train the same architecture with the predictive component disabled, using only reconstruction loss. The results confirm that the reconstruction-only variant performs comparably to the Wan2.2 baseline without the reported gains in FVD or convergence speed. This establishes that the predictive objective is indeed load-bearing. We have also updated the abstract to reference this ablation. revision: yes

Circularity Check

0 steps flagged

No circularity: predictive objective defined independently of generative metrics

full rationale

The paper defines its core training objective (randomly masking future frames, encoding partial observations, and jointly reconstructing observed frames while predicting future ones) as an independent design choice motivated by predictive world modeling. This objective is not derived from or fitted to the downstream FVD or convergence metrics; instead, the VAE is trained with the predictive loss and then evaluated separately on video generation tasks. No equations reduce the claimed diffusability improvement to a tautology, no self-citations bear the central load, and no fitted parameters are relabeled as predictions. The reported gains are empirical outcomes, not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5548 in / 1084 out tokens · 31895 ms · 2026-05-09T16:52:54.414527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 34 canonical work pages · 20 internal anchors

  1. [1]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  3. [3]

    Back to the features: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468,

    Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468, 2025

  4. [4]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  6. [6]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

  8. [8]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor,European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012

  9. [9]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

  10. [10]

    arXiv preprint arXiv:2409.01199 (2024)

    Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, and Li Yuan. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model.arXiv preprint arXiv:2409.01199, 2024

  11. [11]

    Leanvae: An ultra-efficient reconstruction vae for video diffusion models

    Yu Cheng and Fajie Yuan. Leanvae: An ultra-efficient reconstruction vae for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15692–15702, 2025

  12. [12]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  13. [13]

    Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022

  14. [14]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  15. [15]

    Optical flow estimation

    David Fleet and Yair Weiss. Optical flow estimation. InHandbook of mathematical models in computer vision, pages 237–257. Springer, 2006

  16. [16]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

  17. [17]

    Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020. 12

  18. [18]

    Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023

    Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023

  19. [19]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  20. [20]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

  21. [21]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  22. [22]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  23. [23]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62, 2022

  24. [24]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  25. [25]

    Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17778–17788, 2025

  26. [26]

    Delving into latent spectral biasing of video vaes for superior diffusability.arXiv preprint arXiv:2512.05394, 2025

    Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Jie Tang. Delving into latent spectral biasing of video vaes for superior diffusability.arXiv preprint arXiv:2512.05394, 2025

  27. [27]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  28. [28]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  29. [29]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  30. [30]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  31. [31]

    A benchmark dataset and evaluation methodology for video object segmentation

    Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

  32. [32]

    Variational autoencoder

    Lucas Pinheiro Cinelli, Matheus Araújo Marins, Eduardo Antúnio Barros da Silva, and Sérgio Lima Netto. Variational autoencoder. InVariational methods for machine learning with applications to deep networks, pages 111–149. Springer, 2021

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  34. [34]

    Human activity prediction: Early recognition of ongoing activities from streaming videos

    Michael S Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 international conference on computer vision, pages 1036–1043. IEEE, 2011

  35. [35]

    Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024

    Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024

  36. [36]

    Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020

    Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020. 13

  37. [37]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

  38. [38]

    Rectok: Reconstruction distillation along rectified flow

    Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, and Xuelong Li. Rectok: Reconstruction distillation along rectified flow.arXiv preprint arXiv:2512.13421, 2025

  39. [39]

    Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

  40. [40]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

  41. [41]

    Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023

  42. [42]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020

  43. [43]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  44. [44]

    arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

    Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026

  45. [45]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022

  46. [46]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

  47. [47]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

  48. [48]

    From image to video: An empirical study of diffusion representations.arXiv preprint arXiv:2502.07001, 2025

    Pedro Vélez, Luisa F Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi SM Sajjadi. From image to video: An empirical study of diffusion representations.arXiv preprint arXiv:2502.07001, 2025

  49. [49]

    Predicting actions from static scenes

    Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva, and Josef Sivic. Predicting actions from static scenes. In European Conference on Computer Vision, pages 421–436. Springer, 2014

  50. [50]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  51. [51]

    Vidtwin: Video vae with decoupled structure and dynamics

    Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, and Jiang Bian. Vidtwin: Video vae with decoupled structure and dynamics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22922–22932, 2025

  52. [52]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  53. [53]

    Improved video vae for latent video diffusion model

    Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18124–18133, 2025

  54. [54]

    H3ae: High compression, high speed, and high quality autoencoder for video diffusion models

    Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, and Sergey Tulyakov. H3ae: High compression, high speed, and high quality autoencoder for video diffusion models. arXiv preprint arXiv:2504.10567, 2025

  55. [55]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 14

  56. [56]

    Latent denoising makes good tokenizers

    Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. In The FourteenthInternational Conference on Learning Representations, 2026

  57. [57]

    Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

  58. [58]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  59. [59]

    Towards scalable pre-training of visual tokenizers for generation

    Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025

  60. [60]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

  61. [61]

    Deco-vae: Learning compact latents for video reconstruction via decoupled representation.arXiv preprint arXiv:2511.14530, 2025

    Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, and Xiaoyan Sun. Deco-vae: Learning compact latents for video reconstruction via decoupled representation.arXiv preprint arXiv:2511.14530, 2025

  62. [62]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  63. [63]

    Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

    Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

  64. [64]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  65. [65]

    Both semantics and reconstruction matter: Making rep- resentation encoders ready for text-to-image generation and editing

    Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

  66. [66]

    Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024

    Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024

  67. [67]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

  68. [68]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  69. [69]

    Stereo Magnification: Learning View Synthesis using Multiplane Images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

  70. [70]

    Deep learning in next-frame prediction: A benchmark review

    Yufan Zhou, Haiwei Dong, and Abdulmotaleb El Saddik. Deep learning in next-frame prediction: A benchmark review. IEEE Access, 8:69273–69283, 2020

  71. [71]

    Exploring pre-trained text-to-video diffusion models for referring video object segmentation

    Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, and Gang Hua. Exploring pre-trained text-to-video diffusion models for referring video object segmentation. InEuropean Conference on Computer Vision, pages 452–469. Springer, 2024. 15