pith. machine review for the scientific record. sign in

arxiv: 2512.00408 · v2 · submitted 2025-11-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video compressionsemantic codingdiffusion modelslow bitrategenerative reconstructionperceptual qualitymultimodal compression
0
0 comments X

The pith

DiSCo transmits only semantic text, degraded frames, and motion cues then uses a conditional diffusion model to reconstruct high-quality video at low bitrates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional video codecs collapse at ultra-low bitrates because they chase pixel accuracy instead of human perception. The paper decomposes each video into three compact modalities: a textual description, a spatiotemporally degraded version, and optional sketches or poses. These are sent with modality-specific codecs and a conditional video diffusion model reconstructs the full sequence. Experiments report 2-10X gains on perceptual metrics over both traditional and semantic baselines. The approach matters for any setting where bandwidth is severely limited yet recognizable video is still required.

Core claim

By sending only a textual semantic description, a spatiotemporally degraded video, and optional sketches or poses, a conditional video diffusion model can synthesize temporally coherent high-quality output that outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

What carries the argument

The conditional video diffusion model that reconstructs the video from the three compact multimodal inputs of text semantics, degraded appearance, and motion cues.

If this is right

  • Usable video can be delivered at bitrates where pixel-based codecs produce only artifacts.
  • Compression can prioritize semantic and motion cues over exact pixel fidelity.
  • Multimodal token interleaving and temporal forward filling become practical tools for maintaining coherence.
  • The same compact representation supports both compression and downstream generative editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to live streaming if the diffusion model is made fast enough for real-time inference.
  • Similar multimodal conditioning might apply to audio or 3D scene compression where generative priors are available.
  • Evaluation protocols will need new metrics that detect semantic hallucinations in addition to traditional distortion measures.

Load-bearing premise

The diffusion model must produce temporally coherent video without hallucinations or artifacts that lower perceived quality when given only the compact multimodal inputs.

What would settle it

A controlled perceptual study in which viewers rate DiSCo reconstructions against traditional codec outputs at the same low bitrate and flag any temporal inconsistencies or invented content.

Figures

Figures reproduced from arXiv: 2512.00408 by Divya Kothandaraman, Guan-Ming Su, Lingdong Wang, Mohammad Hajiesmaili, Ramesh K. Sitaraman, Tsung-Wei Huang.

Figure 1
Figure 1. Figure 1: Overview of proposed method. Red means trainable module, blue means frozen module, yellow means non-learning operations. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conditioning on sketch/pose modality at 0.005 BPP. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of token interleaving. rendered as skeleton images, and then encoded via VAE. We introduce a token interleaving strategy to condition the diffusion model on multiple modalities. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow of the degraded video modality. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Modality mixture caused by frame interleaving. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of compression results at 0.005 bits per pixel (BPP). Zoom in for details. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison on HEVC-B dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison on UVG dataset. Baseline semantic codecs outperform traditional codecs, as evidenced by their position in the bottom-left region of the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ultra-low bitrate compression at 0.00045 BPP. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes DiSCo, a semantic video compression framework that decomposes input video into three compact modalities (textual description, spatiotemporally degraded video, and optional sketches/poses) to capture semantic, appearance, and motion cues. These are transmitted at low bitrates and used to condition a video diffusion model for reconstructing high-quality, temporally coherent output. The authors introduce temporal forward filling and token interleaving to improve multimodal generation, along with modality-specific codecs. Experiments claim 2-10X gains on perceptual metrics versus baseline semantic and traditional codecs at low bitrates.

Significance. If the reported gains hold under rigorous evaluation, the work offers a meaningful shift from pixel-fidelity optimization to semantic-plus-generative reconstruction for ultra-low-bitrate video. The multimodal conditioning strategy and the specific generation techniques (temporal forward filling, token interleaving) are concrete contributions that could inform future perceptual codecs. The empirical comparisons and ablations provide initial support for the central claim.

minor comments (3)
  1. Section 4.1 and Table 2: the exact perceptual metrics (e.g., LPIPS, DISTS, or user-study scores), bitrate operating points, and test sequences should be stated explicitly in the caption or main text so readers can reproduce the 2-10X claim without consulting supplementary material.
  2. Figure 4: the visual comparison panels would benefit from zoomed insets or difference maps to highlight artifact reduction at the lowest bitrates; current resolution makes it hard to verify the claimed perceptual superiority.
  3. Section 3.2: the token-interleaving procedure is described in prose; a small pseudocode block or diagram would clarify the ordering of text, video, and sketch tokens across diffusion steps.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of DiSCo, the recognition of its potential impact on perceptual video compression, and the recommendation for minor revision. We are pleased that the multimodal conditioning approach and techniques such as temporal forward filling and token interleaving are viewed as concrete contributions.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical proposal for semantic video compression via conditional diffusion. It defines a multimodal input decomposition (text + degraded video + optional sketches/poses) and proposes engineering components (temporal forward filling, token interleaving, modality-specific codecs) whose value is demonstrated through quantitative comparisons and ablations against external baselines. No equations, parameter fittings, or derivations appear that reduce by construction to the paper's own inputs or prior self-citations. The central performance claims rest on externally measurable perceptual metrics rather than self-referential normalization or uniqueness theorems, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pretrained video diffusion models contain sufficiently rich generative priors to synthesize plausible details from semantic and motion cues; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pretrained conditional video diffusion models can generate temporally coherent output from compact semantic, appearance, and motion inputs.
    Invoked implicitly when stating that the diffusion model reconstructs high-quality videos from the three modalities.

pith-pipeline@v0.9.0 · 5458 in / 1153 out tokens · 31203 ms · 2026-05-17T03:30:03.913142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

    eess.IV 2026-04 unverdicted novelty 7.0

    NeuralLVC achieves better lossless compression than H.264 and H.265 on video sequences by combining masked diffusion with temporal conditioning on frame differences.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper

  1. [1]

    Variational image compression with a scale hyperprior, 2018

    Johannes Ball ´e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior, 2018. 1, 2

  2. [2]

    Calculation of average PSNR differences between RD-curves

    Gisle Bjontegaard. Calculation of average PSNR differences between RD-curves. Technical Report VCEG-M33, ITU-T SG16/Q6, Austin, TX, USA, April 2001. 5

  3. [3]

    Sullivan, and Jens-Rainer Ohm

    Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applica- tions.IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021. 1, 2, 5

  4. [4]

    Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end-to-end dubbing of video, 2024. 5

  5. [5]

    A computational approach to edge detection

    John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, PAMI-8(6):679–698, 1986. 3

  6. [6]

    Learning to generate line drawings that convey geometry and semantics

    Caroline Chan, Fr ´edo Durand, and Phillip Isola. Learning to generate line drawings that convey geometry and semantics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7915–7925, June 2022. 2, 3

  7. [7]

    Time-adaptive video frame interpolation based on residual diffusion, 2025

    Victor Fonte Chavez, Claudia Esteves, and Jean-Bernard Hayet. Time-adaptive video frame interpolation based on residual diffusion, 2025. 2

  8. [8]

    Nerv: Neural representations for videos

    Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser Nam Lim, and Abhinav Shrivastava. Nerv: Neural representations for videos. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21557– 21568. Curran Associates, Inc., 2021. 1, 2

  9. [9]

    Unire- store: Unified perceptual and task-oriented image restoration model using diffusion prior

    I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Yuan-Chun Chiang, Sy-Yen Kuo, and Ming-Hsuan Yang. Unire- store: Unified perceptual and task-oriented image restoration model using diffusion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17969–17979, June 2025. 2

  10. [10]

    Stable diffusion is a natural cross-modal decoder for layered ai-generated im- age compression

    Ruijie Chen, Qi Mao, and Zhengxue Cheng. Stable diffusion is a natural cross-modal decoder for layered ai-generated im- age compression. In2025 Data Compression Conference (DCC), pages 361–361, 2025. 1, 3

  11. [11]

    Ultra-low bitrate predictive portrait video com- pression with diffusion models.Symmetry, 17(6), 2025

    Xinyi Chen, Weimin Lei, Wei Zhang, Yanwen Wang, and Mingxin Liu. Ultra-low bitrate predictive portrait video com- pression with diffusion models.Symmetry, 17(6), 2025. 3

  12. [12]

    Dove: Efficient one- step diffusion model for real-world video super-resolution,

    Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution,

  13. [13]

    Learned image compression with discretized gaussian mixture likelihoods and attention modules

    Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Learned image compression with discretized gaussian mixture likelihoods and attention modules. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 1, 2

  14. [14]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 8

  15. [15]

    Ldmvfi: Video frame interpolation with latent diffusion models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(2):1472–1480, Mar

    Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(2):1472–1480, Mar. 2024. 2

  16. [16]

    Simoncelli

    Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2022. 5

  17. [17]

    Cgvc-t: Contextual generative video compression with transformers.IEEE Jour- nal on Emerging and Selected Topics in Circuits and Sys- tems, 14(2):209–223, 2024

    Pengli Du, Ying Liu, and Nam Ling. Cgvc-t: Contextual generative video compression with transformers.IEEE Jour- nal on Emerging and Selected Topics in Circuits and Sys- tems, 14(2):209–223, 2024. 3

  18. [18]

    Selic: Semantic-enhanced learned image compression via high-level textual guidance, 2025

    Haisheng Fu, Jie Liang, Zhenman Fang, and Jingning Han. Selic: Semantic-enhanced learned image compression via high-level textual guidance, 2025. 1, 3

  19. [19]

    Unimic: To- wards universal multi-modality perceptual image compres- sion, 2024

    Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Zongyu Guo, Yiting Lu, Yulin Ren, and Zhibo Chen. Unimic: To- wards universal multi-modality perceptual image compres- sion, 2024. 1, 3

  20. [20]

    Lei Guo, Wei Chen, Yuxuan Sun, Bo Ai, Nikolaos Pap- pas, and Tony Q. S. Quek. Diffusion-driven semantic communication for generative models with bandwidth con- straints.IEEE Transactions on Wireless Communications, 24(8):6490–6503, 2025. 3

  21. [21]

    Generative latent video compression, 2025

    Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li, and Yan Lu. Generative latent video compression, 2025. 3, 5

  22. [22]

    Ltx-video: Realtime video latent diffusion, 2024

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. 2, 4, 5

  23. [23]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol- ume 30. Curran As...

  24. [24]

    In-context lora for diffusion transformers, 2024

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jin- gren Zhou. In-context lora for diffusion transformers, 2024. 2, 4

  25. [25]

    Towards practical real-time neural video compression

    Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu. Towards practical real-time neural video compression. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 12543–12552, June 2025. 1, 2, 3, 5

  26. [26]

    Generative latent coding for ultra-low bitrate image com- pression

    Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image com- pression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26088–26098, June 2024. 3

  27. [27]

    C3: High-performance and low-complexity neural compression from a single image or video

    Hyunjik Kim, Matthias Bauer, Lucas Theis, Jonathan Richard Schwarz, and Emilien Dupont. C3: High-performance and low-complexity neural compression from a single image or video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9347–9358, June 2024. 1, 2

  28. [28]

    Nvrc: Neural video representation compression

    Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, and 9 David Bull. Nvrc: Neural video representation compression. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 132440– 132462. Curran Associates, Inc., 2024. 1, 2

  29. [29]

    Misc: Ultra-low bitrate image semantic compres- sion driven by large multimodal model.IEEE Transactions on Image Processing, 34:335–349, 2025

    Chunyi Li, Guo Lu, Donghui Feng, Haoning Wu, Zicheng Zhang, Xiaohong Liu, Guangtao Zhai, Weisi Lin, and Wen- jun Zhang. Misc: Ultra-low bitrate image semantic compres- sion driven by large multimodal model.IEEE Transactions on Image Processing, 34:335–349, 2025. 3

  30. [30]

    Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

    Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 7752–7762, June 2025. 5

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. 1, 3, 8

  32. [32]

    Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 5

  33. [33]

    Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

    Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM Multimedia Systems Conference, pages 331–336, 2020. 5

  34. [34]

    Lmm-driven se- mantic image-text coding for ultra low-bitrate learned image compression

    Shimon Murai, Heming Sun, and Jiro Katto. Lmm-driven se- mantic image-text coding for ultra low-bitrate learned image compression. In2024 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP), pages 1–5, 2024. 1, 3

  35. [35]

    Openvid-1m: A large-scale high-quality dataset for text-to- video generation, 2025

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation, 2025. 5

  36. [36]

    Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc)

    Jens-Rainer Ohm, Gary J Sullivan, Heiko Schwarz, Thiow Keng Tan, and Thomas Wiegand. Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc). InIEEE Transactions on Cir- cuits and Systems for Video Technology, volume 22, pages 1669–1684. IEEE, 2012. 5

  37. [37]

    Lightweight diffusion models for resource-constrained semantic communication, 2024

    Giovanni Pignata, Eleonora Grassucci, Giordano Cicchetti, and Danilo Comminiello. Lightweight diffusion models for resource-constrained semantic communication, 2024. 3

  38. [38]

    Generative latent coding for ultra-low bi- trate image and video compression.IEEE Transactions on Circuits and Systems for Video Technology, 35(10):10500– 10515, 2025

    Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bi- trate image and video compression.IEEE Transactions on Circuits and Systems for Video Technology, 35(10):10500– 10515, 2025. 3, 5

  39. [39]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 4

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022. 2

  41. [41]

    Extremely low-bitrate image compression semantically disentangled by lmms from a human perception perspective, 2025

    Juan Song, Lijie Yang, and Mingtao Feng. Extremely low-bitrate image compression semantically disentangled by lmms from a human perception perspective, 2025. 3

  42. [42]

    Converting video formats with ffmpeg

    Suramya Tomar. Converting video formats with ffmpeg. Linux J., 2006(146):10, June 2006. 1, 5

  43. [43]

    To- wards accurate generative models of video: A new metric & challenges, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 5

  44. [44]

    M3-cvc: Control- lable video compression with multimodal generative mod- els

    Rui Wan, Qi Zheng, and Yibo Fan. M3-cvc: Control- lable video compression with multimodal generative mod- els. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. 1, 3

  45. [45]

    Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset.IEEE Interna- tional Conference on Image Processing (ICIP), pages 1509– 1513, 2016

    Heng Wang, Ioannis Katsavounidis, Jiantong Zhou, Jonghoon Park, Shawmin Lei, Xin Zhou, Man-On Pun, Xin Jin, Ronggang Wang, Xin Wang, et al. Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset.IEEE Interna- tional Conference on Image Processing (ICIP), pages 1509– 1513, 2016. 5

  46. [46]

    Liftvsr: Lifting image diffusion to video super-resolution via hybrid temporal modeling with only 4×rtx 4090s, 2025

    Xijun Wang, Xin Li, Bingchen Li, and Zhibo Chen. Liftvsr: Lifting image diffusion to video super-resolution via hybrid temporal modeling with only 4×rtx 4090s, 2025. 2

  47. [47]

    T-gvc: Trajectory-guided gen- erative video coding at ultra-low bitrates, 2025

    Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xi- aopeng Fan, and Debin Zhao. T-gvc: Trajectory-guided gen- erative video coding at ultra-low bitrates, 2025. 1, 3, 5

  48. [48]

    Exploring video quality assessment on user gen- erated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gen- erated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20144–20154, October

  49. [49]

    Effec- tive whole-body pose estimation with two-stages distillation

    Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 4210–4220, October 2023. 2, 3

  50. [50]

    Conditional video generation for high-efficiency video compression, 2025

    Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, and Xue- long Li. Conditional video generation for high-efficiency video compression, 2025. 1, 3, 4, 5

  51. [51]

    Generative video semantic communication via multimodal semantic fusion with large model, 2025

    Hang Yin, Li Qiao, Yu Ma, Shuo Sun, Kan Li, Zhen Gao, and Dusit Niyato. Generative video semantic communication via multimodal semantic fusion with large model, 2025. 3

  52. [52]

    Resshift: Efficient diffusion model for image super- resolution by residual shifting

    Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, vol- ume 36, pages 13294–13307. Curran Associates, Inc., 2023. 2

  53. [53]

    Semantics-guided diffu- sion for deep joint source-channel coding in wireless image transmission, 2025

    Maojun Zhang, Haotian Wu, Guangxu Zhu, Richeng Jin, Xi- aoming Chen, and Deniz G ¨und¨uz. Semantics-guided diffu- sion for deep joint source-channel coding in wireless image transmission, 2025. 1, 3

  54. [54]

    When video coding meets multimodal large language models: A unified paradigm for video coding, 2025

    Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, 10 Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, and Shiqi Wang. When video coding meets multimodal large language models: A unified paradigm for video coding, 2025. 4

  55. [55]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2018. 5

  56. [56]

    Ssp-ir: Semantic and structure priors for diffusion-based realistic image restora- tion.IEEE Transactions on Circuits and Systems for Video Technology, 35(7):6259–6272, 2025

    Yuhong Zhang, Hengsheng Zhang, Zhengxue Cheng, Rong Xie, Li Song, and Wenjun Zhang. Ssp-ir: Semantic and structure priors for diffusion-based realistic image restora- tion.IEEE Transactions on Circuits and Systems for Video Technology, 35(7):6259–6272, 2025. 2

  57. [57]

    Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation

    Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2105– 2115, June 2025. 2 11 Low-Bitrate Video Compression through Semantic-Conditioned ...