arxiv: 2604.16479 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Latent-Compressed Variational Autoencoder for Video Diffusion Models

Jiarui Guan , Wenshuai Zhao , Zhengtao Zou , Juho Kannala , Arno Solin

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video VAElatent compressionhigh-frequency removalvideo diffusion modelsreconstruction qualitylatent diffusioncompression ratiogenerative video

0 comments

The pith

Removing high-frequency components from video latent representations improves reconstruction quality in variational autoencoders at fixed compression ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video variational autoencoders compress input videos into lower-dimensional latent spaces so that diffusion models can generate new videos more efficiently. Standard designs increase the number of latent channels to reach acceptable reconstruction fidelity, yet this excess dimensionality slows the diffusion model's training convergence and degrades its final output quality. The paper instead applies compression by stripping high-frequency content from the existing latent representations. This keeps the total compression ratio unchanged while delivering higher-fidelity video reconstructions than channel-reduction baselines. Readers should care because the method removes a practical obstacle between faithful encoding and effective generative modeling of video.

Core claim

The authors establish that a latent compression method which removes high-frequency components in video latent representations, rather than directly reducing the number of channels, achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio. This directly tackles the observed conflict where high channel counts support good VAE reconstruction yet impair downstream diffusion performance.

What carries the argument

High-frequency removal applied directly to video latent representations, which discards selected frequency components to compress the latent tensor without lowering its channel dimension.

If this is right

Latent diffusion models receive higher-quality inputs and therefore train to stronger generative performance.
Video reconstruction remains more accurate at any given compression ratio.
The same memory and compute budget for the diffusion stage can be retained without sacrificing encoding fidelity.
Downstream tasks that depend on the latent space inherit the improved reconstruction without additional channel overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frequency-selective approach may transfer to image or audio latent models facing similar channel-count versus fidelity trade-offs.
Combining high-frequency removal with existing quantization or pruning steps could produce further compression gains.
Empirical tests across motion-heavy versus static video datasets would clarify whether the benefit depends on content statistics.
The result suggests that latent-space dimensionality is less critical for perceptual quality than the distribution of energy across frequencies.

Load-bearing premise

High-frequency components in the latent space can be removed without losing the information required for high-fidelity video reconstruction or introducing artifacts that degrade the diffusion model's performance.

What would settle it

A controlled reconstruction experiment on held-out video sequences in which the proposed compressed latents yield lower PSNR or higher perceptual distortion than a baseline VAE that simply uses fewer channels at the identical compression ratio.

Figures

Figures reproduced from arXiv: 2604.16479 by Arno Solin, Jiarui Guan, Juho Kannala, Wenshuai Zhao, Zhengtao Zou.

**Figure 1.** Figure 1: Comparison between the schemes of video VAEs with and without the proposed latent compression. Our method performs frequency-aware latent compression for video generation. An input video is encoded and decomposed by multi-level 3D wavelet transforms (Multi-WT); low-frequency channels are retained as compact latent representations where diffusion operates. After denoising, the latent is zero-padded, process… view at source ↗

**Figure 2.** Figure 2: Validation PSNR curves of WF-VAE [27] for 4, 8, 16, and 32 latent channels. Increasing the number of channels yields only marginal PSNR gains, indicating substantial redundancy in the latent representation. we introduce latent video diffusion models that are trained in video latent spaces. Variational Autoencoders for Video Generation. As autoencoders define the latent space of LDMs, their design directly … view at source ↗

**Figure 3.** Figure 3: Energy and correlation distribution across frequencies. We visualize heatmaps of the normalized energy (left) and per-channel lag-1 temporal autocorrelation [40] (right) obtained by applying a 3D Haar wavelet transform to video latent representations encoded by WF-VAE [27]. Columns correspond to latent channels, and rows represent different frequency subbands. The visualization reveals that low-frequency… view at source ↗

**Figure 4.** Figure 4: Overview of our framework. The model first applies a multi-level wavelet transform (Multi-WT) to the latent features produced by the encoder. Low-frequency channels are then selected to retain compact yet informative representations in the wavelet domain, while the high-frequency subbands are zeroed out. During generation, diffusion operates within this favorable and compressed subspace. The sampled repres… view at source ↗

**Figure 5.** Figure 5: Validation performance during training. Across different compression ratios (Chn. = 4, 8, 16), our method consistently achieves higher PSNR than the baseline. shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Generated videos using LC-VAE with Latte [31] on SkyTimelapse (top) and UCF-101 (bottom) datasets. and thus benefits zero-shot transfer. To further evaluate this property, we assess the zero-shot reconstruction performance on three additional datasets: two used for video generation training, UCF-101 [39] and SkyTimelapse [48], as well as the newly introduced OpenVid [32]. The comparison results in Tab. 2 s… view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of reconstruction performance between LC-VAE and WF-VAE under the same compression ratios (equivalent channels). 8 Chn. Ground Truth WF-VAE (PTLC) LC-VAE (Ours) 16 Chn [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison between LC-VAE and WF-VAE (PTLC) at the same compression ratios (equivalent channels). WF-VAE (PTLC) exhibits noticeable artifacts, whereas LC-VAE trained with latent compression reconstructs videos accurately, highlighting the importance of integrating latent compression during autoencoder training. frequency subbands without tuning the compression ratios. In addition, we do not aim… view at source ↗

**Figure 9.** Figure 9: Illustration of the proposed Multi-WT. A 3D WT is first applied to the latent z to obtain eight subbands; two successive Temporal WT stages then further decompose them. In the MultiWT representation the three letters (e.g., LHL) index temporal decomposition stages rather than spatial axes as in Eq. (5). We retain only the low-frequency–dominant subbands ( LLL , LLH , LHL , HLL ) and zero out the rest; ⊕ d… view at source ↗

**Figure 10.** Figure 10: Overall energy distribution across wavelet subbands (WebVid-10M). Low-frequency subbands dominate, accounting for ∼85% of total energy. 0 2 4 6 8 10 12 14 Channel Index 0 2 4 6 8 Energy Subbands LLL LLH LHL LHH HLL HLH HHL HHH [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: , after 20k steps this scheme converges to the same subbands (LLL, LLH, LHL, HLL) as our fixed design, with a 98% channel overlap—empirically validating that our fixed mask closely approximates the data-driven optimum. 0 64 128 192 256 320 384 448 Chn. Indx LLL LLH LHL LHH HLL HLH HHL HHH Top-50% Ours [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of low-frequency wavelet subbands. Low-frequency components exhibit smooth spatial variations and clear structural patterns, encoding the majority of semantic content. Diverse per-channel activation patterns suggest that each channel captures distinct semantic factors [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of high-frequency wavelet subbands. High-frequency components contain rapid local fluctuations with little channel-wise variation, resembling noise-like textures and contributing minimal semantic information [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Non-curated reconstruction on OpenVid-1M. LC-VAE (left) vs. WF-VAE (right) [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Non-curated video generation on SkyTimelapse. Latte [31] under guidance-free sampling trained with LC-VAE (left) vs. WF-VAE (right) [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes filtering high frequencies out of video VAE latents instead of cutting channels, claiming better reconstruction at the same compression ratio for downstream diffusion.

read the letter

The central claim is that removing high-frequency components from the latent space gives stronger video reconstruction than standard channel reduction while holding the overall compression fixed. This targets a real tension in latent video diffusion: high channel counts aid the VAE but slow or destabilize the diffusion stage. The approach is positioned as a practical alternative rather than a theoretical derivation, and the abstract reports that experiments beat strong baselines on reconstruction quality. If the full results include proper metrics, datasets, and controls, this could be a useful engineering tweak for people already running video diffusion pipelines. The frequency idea itself is not brand new, but its targeted use here to preserve reconstruction fidelity looks like a modest but concrete step beyond the cited prior work. The reasoning avoids circularity and directly addresses the stated problem without inventing new entities or hidden parameters. The main limitation visible from the abstract is the absence of any numbers, ablation details, or implementation specifics on how the filtering is applied. That makes it impossible to judge effect size or whether artifacts appear in generated videos. The weakest link is the assumption that high-frequency removal keeps everything the diffusion model actually needs. Overall the argument is coherent on its own terms and the empirical framing is honest. This is for CV researchers working on latent video models who need incremental improvements in VAE design. It is worth sending to peer review so referees can check the tables and code-level details; the idea is narrow enough that a careful review would quickly show whether the gains are real and reproducible.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a latent compression technique for video variational autoencoders (VAEs) used in latent diffusion models. Instead of directly reducing the number of latent channels (which can degrade reconstruction fidelity), the method removes high-frequency components from the latent representations while preserving the overall compression ratio. The central empirical claim is that this yields superior video reconstruction quality relative to strong baselines.

Significance. If the reported experimental gains hold under rigorous controls, the work would offer a practical alternative for balancing latent-space compression against reconstruction fidelity in video diffusion pipelines. This addresses a documented tension between VAE capacity and downstream generative training stability, and could inform latent design choices in future video generation systems.

minor comments (2)

[Abstract] Abstract: the claim of 'superior video reconstruction quality' is stated without any numerical metrics, error bars, dataset names, or baseline identifiers. Adding a single sentence with key quantitative results (e.g., PSNR/SSIM deltas and the exact compression ratio) would strengthen the abstract.
The manuscript should explicitly state the precise definition of 'overall compression ratio' (bits per pixel, channel reduction factor, or latent dimensionality) and confirm that it is matched exactly between the proposed method and all baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The summary correctly identifies the core contribution: a frequency-based latent compression technique for video VAEs that preserves reconstruction quality better than channel-reduction baselines at equivalent compression ratios. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances an empirical method for latent compression in video VAEs by removing high-frequency components rather than reducing channel count, with the central claim resting on experimental comparisons of reconstruction quality against baselines at fixed compression ratios. No derivation chain, first-principles prediction, or uniqueness theorem is asserted; the approach is presented as a practical alternative motivated by observed trade-offs in prior work, without any step that reduces by construction to fitted inputs, self-citations, or renamed empirical patterns. The argument is self-contained as an experimental proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that high-frequency latent components are dispensable for reconstruction fidelity and that the compression does not interact negatively with the diffusion process.

axioms (1)

domain assumption High-frequency components in latent space can be removed without significant loss of reconstructible video information.
Implicit in the choice of compression strategy.

pith-pipeline@v0.9.0 · 5402 in / 1022 out tokens · 29727 ms · 2026-05-10T15:18:08.099428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 30 canonical work pages · 18 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review arXiv
[2]

Frozen in time: A joint video and image encoder for end-to- end retrieval

Max Bain, Arsha Nagrani, G¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 5, 6

2021
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2, 3, 5, 6

work page internal anchor Pith review arXiv 2023
[4]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 1, 3

2024
[5]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep com- pression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024. 2

work page arXiv 2024
[6]

Dc-videogen: Efficient video gen- eration with deep compression video autoencoder,

Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, et al. Dc-videogen: Efficient video generation with deep compression video autoencoder.arXiv preprint arXiv:2509.25182, 2025. 2

work page arXiv 2025
[7]

Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025. 2, 4, 7

2025
[8]

Od- vae: An omni-dimensional video compressor for improving latent video diffusion model

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, and Li Yuan. Od- vae: An omni-dimensional video compressor for improving latent video diffusion model. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 5, 6

2025
[9]

Panda-70m: Captioning 70m videos with multiple cross- modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5, 6

2024
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 5

2021
[11]

Video generation arena leader- board

Hugging Face. Video generation arena leader- board. https : / / huggingface . co / spaces / ArtificialAnalysis / Video - Generation - Arena-Leaderboard, 2025. Accessed: 2025-11-11. 3

2025
[12]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 3

work page internal anchor Pith review arXiv 2025
[13]

Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014. 5

2014
[14]

An introduction to wavelets.IEEE computa- tional science and engineering, 2(2):50–61, 1995

Amara Graps. An introduction to wavelets.IEEE computa- tional science and engineering, 2(2):50–61, 1995. 2

1995
[15]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yao- hui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2

work page internal anchor Pith review arXiv 2023
[16]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[17]

Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vish- wanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025. 2

work page arXiv 2025
[18]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

work page internal anchor Pith review arXiv
[19]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InProceedings of the International Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 1

2023
[20]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InProceedings of the International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010. 5

2010
[21]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review arXiv
[22]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional Bayes.arXiv preprint arXiv:1312.6114, 2013. 1

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Videopoet: A large language model for zero-shot video generation,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 3

work page arXiv 2023
[25]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Video autoencoder: self-supervised disentanglement of static 3d structure and motion

Zihang Lai, Sifei Liu, Alexei A Efros, and Xiaolong Wang. Video autoencoder: self-supervised disentanglement of static 3d structure and motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9730– 9740, 2021. 2

2021
[27]

Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 17778–17788,
[28]

Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Li- uhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 3, 5

work page arXiv 2024
[29]

Hi-vae: Ef- ficient video autoencoding with global and detailed motion

Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, and Changqing Zou. Hi-vae: Ef- ficient video autoencoding with global and detailed motion. arXiv preprint arXiv:2506.07136, 2025. 2

work page arXiv 2025
[30]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 3, 5, 7, 8

2025
[32]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 7

work page internal anchor Pith review arXiv 2024
[33]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review arXiv
[34]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 5, 6

2022
[35]

Temporal generative adversarial nets with singular value clipping

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 2830–2839, 2017. 5

2017
[36]

The JPEG 2000 still image compression standard

Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The JPEG 2000 still image compression standard. IEEE Signal Processing Magazine, 18(5):36–58, 2002. 3

2000
[37]

Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022. 5

2022
[38]

Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliak- sandr Siarohin. Improving the diffusability of autoencoders. arXiv preprint arXiv:2502.14831, 2025. 2, 3, 4, 7

work page arXiv 2025
[39]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5, 7

work page internal anchor Pith review arXiv 2012
[40]

Adapting LLMs to time series forecasting via temporal het- erogeneity modeling and semantic alignment.arXiv preprint arXiv:2508.07195, 2025

Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, and Min Wu. Adapting LLMs to time series forecasting via temporal het- erogeneity modeling and semantic alignment.arXiv preprint arXiv:2508.07195, 2025. 3, 4, 2

work page arXiv 2025
[41]

Haar wavelet based approach for image compression and quality assess- ment of compressed image.arXiv preprint arXiv:1010.4084,

Kamrul Hasan Talukder and Koichi Harada. Haar wavelet based approach for image compression and quality assess- ment of compressed image.arXiv preprint arXiv:1010.4084,

work page arXiv
[42]

FVD: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. InICLR Workshop on Deep Generative Models for Highly Structured Data, 2019. 5

2019
[43]

Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017. 2

2017
[44]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 5

2004
[46]

Improved video V AE for latent video diffusion model

Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video V AE for latent video diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18124–18133, 2025. 2

2025
[47]

H3ae: High compression, high speed, and high quality autoencoder for video diffusion models

Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, and Sergey Tulyakov. H3ae: High compression, high speed, and high quality autoencoder for video diffusion models. arXiv preprint arXiv:2504.10567, 2025. 2

work page arXiv 2025
[48]

Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks

Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2364–2373, 2018. 5, 7

2018
[49]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2

work page internal anchor Pith review arXiv 2024
[50]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 2, 3

work page internal anchor Pith review arXiv 2023
[51]

Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024. 2

work page arXiv 2024
[52]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 5

2018
[53]

A survey on perceptually optimized video coding

Yun Zhang, Linwei Zhu, Gangyi Jiang, Sam Kwong, and C- C Jay Kuo. A survey on perceptually optimized video coding. ACM Computing Surveys, 55(12):1–37, 2023. 3

2023
[54]

Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024

Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024. 5, 6

2024
[55]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 1, 2, 5, 6

work page internal anchor Pith review arXiv 2024
[56]

Waymo-Bbox

Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Alle- gro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024. 2, 3 Latent-Compressed Variational Autoencoder for Video Diffusion Models Supplementary Material Contents A . Training Details 1 B . Multi-Level Wavelet Transform of Latent 1 C . Latent Frequenc...

work page arXiv 2024