pith. machine review for the scientific record. sign in

arxiv: 2604.16479 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Latent-Compressed Variational Autoencoder for Video Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video VAElatent compressionhigh-frequency removalvideo diffusion modelsreconstruction qualitylatent diffusioncompression ratiogenerative video
0
0 comments X

The pith

Removing high-frequency components from video latent representations improves reconstruction quality in variational autoencoders at fixed compression ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video variational autoencoders compress input videos into lower-dimensional latent spaces so that diffusion models can generate new videos more efficiently. Standard designs increase the number of latent channels to reach acceptable reconstruction fidelity, yet this excess dimensionality slows the diffusion model's training convergence and degrades its final output quality. The paper instead applies compression by stripping high-frequency content from the existing latent representations. This keeps the total compression ratio unchanged while delivering higher-fidelity video reconstructions than channel-reduction baselines. Readers should care because the method removes a practical obstacle between faithful encoding and effective generative modeling of video.

Core claim

The authors establish that a latent compression method which removes high-frequency components in video latent representations, rather than directly reducing the number of channels, achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio. This directly tackles the observed conflict where high channel counts support good VAE reconstruction yet impair downstream diffusion performance.

What carries the argument

High-frequency removal applied directly to video latent representations, which discards selected frequency components to compress the latent tensor without lowering its channel dimension.

If this is right

  • Latent diffusion models receive higher-quality inputs and therefore train to stronger generative performance.
  • Video reconstruction remains more accurate at any given compression ratio.
  • The same memory and compute budget for the diffusion stage can be retained without sacrificing encoding fidelity.
  • Downstream tasks that depend on the latent space inherit the improved reconstruction without additional channel overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The frequency-selective approach may transfer to image or audio latent models facing similar channel-count versus fidelity trade-offs.
  • Combining high-frequency removal with existing quantization or pruning steps could produce further compression gains.
  • Empirical tests across motion-heavy versus static video datasets would clarify whether the benefit depends on content statistics.
  • The result suggests that latent-space dimensionality is less critical for perceptual quality than the distribution of energy across frequencies.

Load-bearing premise

High-frequency components in the latent space can be removed without losing the information required for high-fidelity video reconstruction or introducing artifacts that degrade the diffusion model's performance.

What would settle it

A controlled reconstruction experiment on held-out video sequences in which the proposed compressed latents yield lower PSNR or higher perceptual distortion than a baseline VAE that simply uses fewer channels at the identical compression ratio.

Figures

Figures reproduced from arXiv: 2604.16479 by Arno Solin, Jiarui Guan, Juho Kannala, Wenshuai Zhao, Zhengtao Zou.

Figure 1
Figure 1. Figure 1: Comparison between the schemes of video VAEs with and without the proposed latent compression. Our method performs frequency-aware latent compression for video generation. An input video is encoded and decomposed by multi-level 3D wavelet transforms (Multi-WT); low-frequency channels are retained as compact latent representations where diffusion operates. After denoising, the latent is zero-padded, process… view at source ↗
Figure 2
Figure 2. Figure 2: Validation PSNR curves of WF-VAE [27] for 4, 8, 16, and 32 latent channels. Increasing the number of channels yields only marginal PSNR gains, indicating substantial redundancy in the latent representation. we introduce latent video diffusion models that are trained in video latent spaces. Variational Autoencoders for Video Generation. As autoencoders define the latent space of LDMs, their design directly … view at source ↗
Figure 3
Figure 3. Figure 3: Energy and correlation distribution across frequen￾cies. We visualize heatmaps of the normalized energy (left) and per-channel lag-1 temporal autocorrelation [40] (right) obtained by applying a 3D Haar wavelet transform to video latent represen￾tations encoded by WF-VAE [27]. Columns correspond to latent channels, and rows represent different frequency subbands. The visualization reveals that low-frequency… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our framework. The model first applies a multi-level wavelet transform (Multi-WT) to the latent features produced by the encoder. Low-frequency channels are then selected to retain compact yet informative representations in the wavelet domain, while the high-frequency subbands are zeroed out. During generation, diffusion operates within this favorable and compressed subspace. The sampled repres… view at source ↗
Figure 5
Figure 5. Figure 5: Validation performance during training. Across differ￾ent compression ratios (Chn. = 4, 8, 16), our method consistently achieves higher PSNR than the baseline. shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generated videos using LC-VAE with Latte [31] on SkyTimelapse (top) and UCF-101 (bottom) datasets. and thus benefits zero-shot transfer. To further evaluate this property, we assess the zero-shot reconstruction performance on three additional datasets: two used for video generation training, UCF-101 [39] and SkyTimelapse [48], as well as the newly introduced OpenVid [32]. The comparison results in Tab. 2 s… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of reconstruction performance between LC-VAE and WF-VAE under the same compression ratios (equivalent channels). 8 Chn. Ground Truth WF-VAE (PTLC) LC-VAE (Ours) 16 Chn [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between LC-VAE and WF-VAE (PTLC) at the same compression ratios (equivalent channels). WF-VAE (PTLC) exhibits noticeable artifacts, whereas LC-VAE trained with latent compression reconstructs videos accurately, highlighting the importance of integrating latent compression during autoencoder training. frequency subbands without tuning the compression ratios. In addition, we do not aim… view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the proposed Multi-WT. A 3D WT is first applied to the latent z to obtain eight subbands; two successive Temporal WT stages then further decompose them. In the Multi￾WT representation the three letters (e.g., LHL) index temporal decomposition stages rather than spatial axes as in Eq. (5). We retain only the low-frequency–dominant subbands ( LLL , LLH , LHL , HLL ) and zero out the rest; ⊕ d… view at source ↗
Figure 10
Figure 10. Figure 10: Overall energy distribution across wavelet subbands (WebVid-10M). Low-frequency subbands dominate, accounting for ∼85% of total energy. 0 2 4 6 8 10 12 14 Channel Index 0 2 4 6 8 Energy Subbands LLL LLH LHL LHH HLL HLH HHL HHH [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: , after 20k steps this scheme converges to the same subbands (LLL, LLH, LHL, HLL) as our fixed design, with a 98% channel overlap—empirically validating that our fixed mask closely approximates the data-driven optimum. 0 64 128 192 256 320 384 448 Chn. Indx LLL LLH LHL LHH HLL HLH HHL HHH Top-50% Ours [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of low-frequency wavelet subbands. Low-frequency components exhibit smooth spatial variations and clear structural patterns, encoding the majority of semantic content. Diverse per-channel activation patterns suggest that each channel captures distinct semantic factors [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of high-frequency wavelet subbands. High-frequency components contain rapid local fluctuations with little channel-wise variation, resembling noise-like textures and contributing minimal semantic information [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Non-curated reconstruction on OpenVid-1M. LC-VAE (left) vs. WF-VAE (right) [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Non-curated video generation on SkyTimelapse. Latte [31] under guidance-free sampling trained with LC-VAE (left) vs. WF-VAE (right) [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
read the original abstract

Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive number of latent channels can impede the convergence of latent diffusion models and deteriorate their generative performance, even when reconstruction quality remains high. We propose a latent compression method that removes high-frequency components in video latent representations rather than directly reducing the number of channels, which often compromises reconstruction fidelity. Experimental results demonstrate that the proposed method achieves superior video reconstruction quality compared to strong baselines while maintaining the same overall compression ratio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a latent compression technique for video variational autoencoders (VAEs) used in latent diffusion models. Instead of directly reducing the number of latent channels (which can degrade reconstruction fidelity), the method removes high-frequency components from the latent representations while preserving the overall compression ratio. The central empirical claim is that this yields superior video reconstruction quality relative to strong baselines.

Significance. If the reported experimental gains hold under rigorous controls, the work would offer a practical alternative for balancing latent-space compression against reconstruction fidelity in video diffusion pipelines. This addresses a documented tension between VAE capacity and downstream generative training stability, and could inform latent design choices in future video generation systems.

minor comments (2)
  1. [Abstract] Abstract: the claim of 'superior video reconstruction quality' is stated without any numerical metrics, error bars, dataset names, or baseline identifiers. Adding a single sentence with key quantitative results (e.g., PSNR/SSIM deltas and the exact compression ratio) would strengthen the abstract.
  2. The manuscript should explicitly state the precise definition of 'overall compression ratio' (bits per pixel, channel reduction factor, or latent dimensionality) and confirm that it is matched exactly between the proposed method and all baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The summary correctly identifies the core contribution: a frequency-based latent compression technique for video VAEs that preserves reconstruction quality better than channel-reduction baselines at equivalent compression ratios. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances an empirical method for latent compression in video VAEs by removing high-frequency components rather than reducing channel count, with the central claim resting on experimental comparisons of reconstruction quality against baselines at fixed compression ratios. No derivation chain, first-principles prediction, or uniqueness theorem is asserted; the approach is presented as a practical alternative motivated by observed trade-offs in prior work, without any step that reduces by construction to fitted inputs, self-citations, or renamed empirical patterns. The argument is self-contained as an experimental proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated assumption that high-frequency latent components are dispensable for reconstruction fidelity and that the compression does not interact negatively with the diffusion process.

axioms (1)
  • domain assumption High-frequency components in latent space can be removed without significant loss of reconstructible video information.
    Implicit in the choice of compression strategy.

pith-pipeline@v0.9.0 · 5402 in / 1022 out tokens · 29727 ms · 2026-05-10T15:18:08.099428+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 30 canonical work pages · 18 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Frozen in time: A joint video and image encoder for end-to- end retrieval

    Max Bain, Arsha Nagrani, G¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to- end retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 5, 6

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2, 3, 5, 6

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 1, 3

  5. [5]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep com- pression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024. 2

  6. [6]

    Dc-videogen: Efficient video gen- eration with deep compression video autoencoder,

    Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, et al. Dc-videogen: Efficient video generation with deep compression video autoencoder.arXiv preprint arXiv:2509.25182, 2025. 2

  7. [7]

    Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space

    Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, and Han Cai. Dc-ae 1.5: Accelerating dif- fusion model convergence with structured latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19628–19637, 2025. 2, 4, 7

  8. [8]

    Od- vae: An omni-dimensional video compressor for improving latent video diffusion model

    Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, and Li Yuan. Od- vae: An omni-dimensional video compressor for improving latent video diffusion model. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025. 5, 6

  9. [9]

    Panda-70m: Captioning 70m videos with multiple cross- modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 5, 6

  10. [10]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 5

  11. [11]

    Video generation arena leader- board

    Hugging Face. Video generation arena leader- board. https : / / huggingface . co / spaces / ArtificialAnalysis / Video - Generation - Arena-Leaderboard, 2025. Accessed: 2025-11-11. 3

  12. [12]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 3

  13. [13]

    Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014. 5

  14. [14]

    An introduction to wavelets.IEEE computa- tional science and engineering, 2(2):50–61, 1995

    Amara Graps. An introduction to wavelets.IEEE computa- tional science and engineering, 2(2):50–61, 1995. 2

  15. [15]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yao- hui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2

  16. [16]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

  17. [17]

    Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025

    Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vish- wanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation.arXiv preprint arXiv:2501.09755, 2025. 2

  18. [18]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221,

  19. [19]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InProceedings of the International Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 1

  20. [20]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. InProceedings of the International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010. 5

  21. [21]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  22. [22]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  23. [23]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional Bayes.arXiv preprint arXiv:1312.6114, 2013. 1

  24. [24]

    Videopoet: A large language model for zero-shot video generation,

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jos´e Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 3

  25. [25]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

  26. [26]

    Video autoencoder: self-supervised disentanglement of static 3d structure and motion

    Zihang Lai, Sifei Liu, Alexei A Efros, and Xiaolong Wang. Video autoencoder: self-supervised disentanglement of static 3d structure and motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9730– 9740, 2021. 2

  27. [27]

    Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 17778–17788,

  28. [28]

    Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131,

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Li- uhan Chen, et al. Open-sora plan: Open-source large video generation model.arXiv preprint arXiv:2412.00131, 2024. 3, 5

  29. [29]

    Hi-vae: Ef- ficient video autoencoding with global and detailed motion

    Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, and Changqing Zou. Hi-vae: Ef- ficient video autoencoding with global and detailed motion. arXiv preprint arXiv:2506.07136, 2025. 2

  30. [30]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  31. [31]

    Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025. 3, 5, 7, 8

  32. [32]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation.arXiv preprint arXiv:2407.02371, 2024. 7

  33. [33]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  34. [34]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2, 5, 6

  35. [35]

    Temporal generative adversarial nets with singular value clipping

    Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE International Conference on Com- puter Vision, pages 2830–2839, 2017. 5

  36. [36]

    The JPEG 2000 still image compression standard

    Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The JPEG 2000 still image compression standard. IEEE Signal Processing Magazine, 18(5):36–58, 2002. 3

  37. [37]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2

    Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elho- seiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022. 5

  38. [38]

    Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Mena- pace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliak- sandr Siarohin. Improving the diffusability of autoencoders. arXiv preprint arXiv:2502.14831, 2025. 2, 3, 4, 7

  39. [39]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012. 5, 7

  40. [40]

    Adapting LLMs to time series forecasting via temporal het- erogeneity modeling and semantic alignment.arXiv preprint arXiv:2508.07195, 2025

    Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, and Min Wu. Adapting LLMs to time series forecasting via temporal het- erogeneity modeling and semantic alignment.arXiv preprint arXiv:2508.07195, 2025. 3, 4, 2

  41. [41]

    Haar wavelet based approach for image compression and quality assess- ment of compressed image.arXiv preprint arXiv:1010.4084,

    Kamrul Hasan Talukder and Koichi Harada. Haar wavelet based approach for image compression and quality assess- ment of compressed image.arXiv preprint arXiv:1010.4084,

  42. [42]

    FVD: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. InICLR Workshop on Deep Generative Models for Highly Structured Data, 2019. 5

  43. [43]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017. 2

  44. [44]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 3

  45. [45]

    Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004. 5

  46. [46]

    Improved video V AE for latent video diffusion model

    Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video V AE for latent video diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18124–18133, 2025. 2

  47. [47]

    H3ae: High compression, high speed, and high quality autoencoder for video diffusion models

    Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, and Sergey Tulyakov. H3ae: High compression, high speed, and high quality autoencoder for video diffusion models. arXiv preprint arXiv:2504.10567, 2025. 2

  48. [48]

    Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks

    Wei Xiong, Wenhan Luo, Lin Ma, Wei Liu, and Jiebo Luo. Learning to generate time-lapse videos using multi-stage dy- namic generative adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2364–2373, 2018. 5, 7

  49. [49]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2

  50. [50]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Ver- sari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023. 2, 3

  51. [51]

    Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

    Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024. 2

  52. [52]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 5

  53. [53]

    A survey on perceptually optimized video coding

    Yun Zhang, Linwei Zhu, Gangyi Jiang, Sam Kwong, and C- C Jay Kuo. A survey on perceptually optimized video coding. ACM Computing Surveys, 55(12):1–37, 2023. 3

  54. [54]

    Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024

    Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv- vae: A compatible video vae for latent generative video mod- els.Advances in Neural Information Processing Systems, 37: 12847–12871, 2024. 5, 6

  55. [55]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024. 1, 2, 5, 6

  56. [56]

    Waymo-Bbox

    Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Alle- gro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024. 2, 3 Latent-Compressed Variational Autoencoder for Video Diffusion Models Supplementary Material Contents A . Training Details 1 B . Multi-Level Wavelet Transform of Latent 1 C . Latent Frequenc...