arxiv: 2512.00408 · v2 · submitted 2025-11-29 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang , Guan-Ming Su , Divya Kothandaraman , Tsung-Wei Huang , Mohammad Hajiesmaili , Ramesh K. Sitaraman

Authors on Pith no claims yet

Pith reviewed 2026-05-17 03:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video compressionsemantic codingdiffusion modelslow bitrategenerative reconstructionperceptual qualitymultimodal compression

0 comments

The pith

DiSCo transmits only semantic text, degraded frames, and motion cues then uses a conditional diffusion model to reconstruct high-quality video at low bitrates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional video codecs collapse at ultra-low bitrates because they chase pixel accuracy instead of human perception. The paper decomposes each video into three compact modalities: a textual description, a spatiotemporally degraded version, and optional sketches or poses. These are sent with modality-specific codecs and a conditional video diffusion model reconstructs the full sequence. Experiments report 2-10X gains on perceptual metrics over both traditional and semantic baselines. The approach matters for any setting where bandwidth is severely limited yet recognizable video is still required.

Core claim

By sending only a textual semantic description, a spatiotemporally degraded video, and optional sketches or poses, a conditional video diffusion model can synthesize temporally coherent high-quality output that outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

What carries the argument

The conditional video diffusion model that reconstructs the video from the three compact multimodal inputs of text semantics, degraded appearance, and motion cues.

If this is right

Usable video can be delivered at bitrates where pixel-based codecs produce only artifacts.
Compression can prioritize semantic and motion cues over exact pixel fidelity.
Multimodal token interleaving and temporal forward filling become practical tools for maintaining coherence.
The same compact representation supports both compression and downstream generative editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to live streaming if the diffusion model is made fast enough for real-time inference.
Similar multimodal conditioning might apply to audio or 3D scene compression where generative priors are available.
Evaluation protocols will need new metrics that detect semantic hallucinations in addition to traditional distortion measures.

Load-bearing premise

The diffusion model must produce temporally coherent video without hallucinations or artifacts that lower perceived quality when given only the compact multimodal inputs.

What would settle it

A controlled perceptual study in which viewers rate DiSCo reconstructions against traditional codec outputs at the same low bitrate and flag any temporal inconsistencies or invented content.

Figures

Figures reproduced from arXiv: 2512.00408 by Divya Kothandaraman, Guan-Ming Su, Lingdong Wang, Mohammad Hajiesmaili, Ramesh K. Sitaraman, Tsung-Wei Huang.

**Figure 1.** Figure 1: Overview of proposed method. Red means trainable module, blue means frozen module, yellow means non-learning operations. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Conditioning on sketch/pose modality at 0.005 BPP. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Illustration of token interleaving. rendered as skeleton images, and then encoded via VAE. We introduce a token interleaving strategy to condition the diffusion model on multiple modalities. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Workflow of the degraded video modality. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Modality mixture caused by frame interleaving. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of compression results at 0.005 bits per pixel (BPP). Zoom in for details. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison on HEVC-B dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison on UVG dataset. Baseline semantic codecs outperform traditional codecs, as evidenced by their position in the bottom-left region of the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Ultra-low bitrate compression at 0.00045 BPP. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

read the original abstract

Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiSCo gives a workable multimodal semantic route to low-bitrate video that beats baselines on perceptual scores, backed by ablations in the full paper.

read the letter

The main thing to know is that DiSCo splits the video into text for semantics, a degraded version for appearance, and optional sketches or poses for motion, then feeds them to a conditional diffusion model for reconstruction. It adds temporal forward filling and token interleaving to keep frames coherent and reports 2-10X gains on perceptual metrics versus traditional and semantic codecs at low bitrates. The full manuscript supplies the test sequences, exact metrics, and statistical details that the abstract left out, so the experimental support is verifiable rather than just asserted. What is actually new is the concrete three-modality decomposition plus those two generation tricks; they build directly on prior semantic compression and diffusion work but target the ultra-low bitrate regime where pixel codecs break. The paper does well by including ablations that isolate the contribution of each component and by showing consistent outperformance across the reported comparisons. The setup looks reproducible enough for others to check. The soft spot is the usual one with generative reconstruction: even with the extra conditioning, the model can still introduce small hallucinations or temporal drift on some content, though the authors' tests and ablations suggest it stays within acceptable bounds for the metrics they track. This is a minor rather than load-bearing issue given the evidence presented. The work is aimed at researchers in video delivery and generative compression who need practical low-bitrate methods. A reader who follows semantic coding or diffusion applications would get concrete implementation ideas and results to build on. It deserves a serious referee because the claims rest on direct comparisons and the method is grounded enough to review in detail.

Referee Report

0 major / 3 minor

Summary. The paper proposes DiSCo, a semantic video compression framework that decomposes input video into three compact modalities (textual description, spatiotemporally degraded video, and optional sketches/poses) to capture semantic, appearance, and motion cues. These are transmitted at low bitrates and used to condition a video diffusion model for reconstructing high-quality, temporally coherent output. The authors introduce temporal forward filling and token interleaving to improve multimodal generation, along with modality-specific codecs. Experiments claim 2-10X gains on perceptual metrics versus baseline semantic and traditional codecs at low bitrates.

Significance. If the reported gains hold under rigorous evaluation, the work offers a meaningful shift from pixel-fidelity optimization to semantic-plus-generative reconstruction for ultra-low-bitrate video. The multimodal conditioning strategy and the specific generation techniques (temporal forward filling, token interleaving) are concrete contributions that could inform future perceptual codecs. The empirical comparisons and ablations provide initial support for the central claim.

minor comments (3)

Section 4.1 and Table 2: the exact perceptual metrics (e.g., LPIPS, DISTS, or user-study scores), bitrate operating points, and test sequences should be stated explicitly in the caption or main text so readers can reproduce the 2-10X claim without consulting supplementary material.
Figure 4: the visual comparison panels would benefit from zoomed insets or difference maps to highlight artifact reduction at the lowest bitrates; current resolution makes it hard to verify the claimed perceptual superiority.
Section 3.2: the token-interleaving procedure is described in prose; a small pseudocode block or diagram would clarify the ordering of text, video, and sketch tokens across diffusion steps.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of DiSCo, the recognition of its potential impact on perceptual video compression, and the recommendation for minor revision. We are pleased that the multimodal conditioning approach and techniques such as temporal forward filling and token interleaving are viewed as concrete contributions.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical proposal for semantic video compression via conditional diffusion. It defines a multimodal input decomposition (text + degraded video + optional sketches/poses) and proposes engineering components (temporal forward filling, token interleaving, modality-specific codecs) whose value is demonstrated through quantitative comparisons and ablations against external baselines. No equations, parameter fittings, or derivations appear that reduce by construction to the paper's own inputs or prior self-citations. The central performance claims rest on externally measurable perceptual metrics rather than self-referential normalization or uniqueness theorems, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pretrained video diffusion models contain sufficiently rich generative priors to synthesize plausible details from semantic and motion cues; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Pretrained conditional video diffusion models can generate temporally coherent output from compact semantic, appearance, and motion inputs.
Invoked implicitly when stating that the diffusion model reconstructs high-quality videos from the three modalities.

pith-pipeline@v0.9.0 · 5458 in / 1153 out tokens · 31203 ms · 2026-05-17T03:30:03.913142+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. ... token interleaving mechanism ... temporal forward-filling scheme
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
eess.IV 2026-04 unverdicted novelty 7.0

NeuralLVC achieves better lossless compression than H.264 and H.265 on video sequences by combining masked diffusion with temporal conditioning on frame differences.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper

[1]

Variational image compression with a scale hyperprior, 2018

Johannes Ball ´e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational image compression with a scale hyperprior, 2018. 1, 2

work page 2018
[2]

Calculation of average PSNR differences between RD-curves

Gisle Bjontegaard. Calculation of average PSNR differences between RD-curves. Technical Report VCEG-M33, ITU-T SG16/Q6, Austin, TX, USA, April 2001. 5

work page 2001
[3]

Sullivan, and Jens-Rainer Ohm

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm. Overview of the versatile video coding (vvc) standard and its applica- tions.IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3736–3764, 2021. 1, 2, 5

work page 2021
[4]

Kevin Cai, Chonghua Liu, and David M. Chan. Anim-400k: A large-scale dataset for automated end-to-end dubbing of video, 2024. 5

work page 2024
[5]

A computational approach to edge detection

John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, PAMI-8(6):679–698, 1986. 3

work page 1986
[6]

Learning to generate line drawings that convey geometry and semantics

Caroline Chan, Fr ´edo Durand, and Phillip Isola. Learning to generate line drawings that convey geometry and semantics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7915–7925, June 2022. 2, 3

work page 2022
[7]

Time-adaptive video frame interpolation based on residual diffusion, 2025

Victor Fonte Chavez, Claudia Esteves, and Jean-Bernard Hayet. Time-adaptive video frame interpolation based on residual diffusion, 2025. 2

work page 2025
[8]

Nerv: Neural representations for videos

Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser Nam Lim, and Abhinav Shrivastava. Nerv: Neural representations for videos. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21557– 21568. Curran Associates, Inc., 2021. 1, 2

work page 2021
[9]

Unire- store: Unified perceptual and task-oriented image restoration model using diffusion prior

I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Yuan-Chun Chiang, Sy-Yen Kuo, and Ming-Hsuan Yang. Unire- store: Unified perceptual and task-oriented image restoration model using diffusion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17969–17979, June 2025. 2

work page 2025
[10]

Stable diffusion is a natural cross-modal decoder for layered ai-generated im- age compression

Ruijie Chen, Qi Mao, and Zhengxue Cheng. Stable diffusion is a natural cross-modal decoder for layered ai-generated im- age compression. In2025 Data Compression Conference (DCC), pages 361–361, 2025. 1, 3

work page 2025
[11]

Ultra-low bitrate predictive portrait video com- pression with diffusion models.Symmetry, 17(6), 2025

Xinyi Chen, Weimin Lei, Wei Zhang, Yanwen Wang, and Mingxin Liu. Ultra-low bitrate predictive portrait video com- pression with diffusion models.Symmetry, 17(6), 2025. 3

work page 2025
[12]

Dove: Efficient one- step diffusion model for real-world video super-resolution,

Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution,

work page
[13]

Learned image compression with discretized gaussian mixture likelihoods and attention modules

Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Learned image compression with discretized gaussian mixture likelihoods and attention modules. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 1, 2

work page 2020
[14]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 8

work page 2025
[15]

Ldmvfi: Video frame interpolation with latent diffusion models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(2):1472–1480, Mar

Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(2):1472–1480, Mar. 2024. 2

work page 2024
[16]

Simoncelli

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P. Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(5):2567–2581, 2022. 5

work page 2022
[17]

Cgvc-t: Contextual generative video compression with transformers.IEEE Jour- nal on Emerging and Selected Topics in Circuits and Sys- tems, 14(2):209–223, 2024

Pengli Du, Ying Liu, and Nam Ling. Cgvc-t: Contextual generative video compression with transformers.IEEE Jour- nal on Emerging and Selected Topics in Circuits and Sys- tems, 14(2):209–223, 2024. 3

work page 2024
[18]

Selic: Semantic-enhanced learned image compression via high-level textual guidance, 2025

Haisheng Fu, Jie Liang, Zhenman Fang, and Jingning Han. Selic: Semantic-enhanced learned image compression via high-level textual guidance, 2025. 1, 3

work page 2025
[19]

Unimic: To- wards universal multi-modality perceptual image compres- sion, 2024

Yixin Gao, Xin Li, Xiaohan Pan, Runsen Feng, Zongyu Guo, Yiting Lu, Yulin Ren, and Zhibo Chen. Unimic: To- wards universal multi-modality perceptual image compres- sion, 2024. 1, 3

work page 2024
[20]

Lei Guo, Wei Chen, Yuxuan Sun, Bo Ai, Nikolaos Pap- pas, and Tony Q. S. Quek. Diffusion-driven semantic communication for generative models with bandwidth con- straints.IEEE Transactions on Wireless Communications, 24(8):6490–6503, 2025. 3

work page 2025
[21]

Generative latent video compression, 2025

Zongyu Guo, Zhaoyang Jia, Jiahao Li, Xiaoyi Zhang, Bin Li, and Yan Lu. Generative latent video compression, 2025. 3, 5

work page 2025
[22]

Ltx-video: Realtime video latent diffusion, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. 2, 4, 5

work page 2024
[23]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, vol- ume 30. Curran As...

work page 2017
[24]

In-context lora for diffusion transformers, 2024

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jin- gren Zhou. In-context lora for diffusion transformers, 2024. 2, 4

work page 2024
[25]

Towards practical real-time neural video compression

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu. Towards practical real-time neural video compression. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 12543–12552, June 2025. 1, 2, 3, 5

work page 2025
[26]

Generative latent coding for ultra-low bitrate image com- pression

Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bitrate image com- pression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26088–26098, June 2024. 3

work page 2024
[27]

C3: High-performance and low-complexity neural compression from a single image or video

Hyunjik Kim, Matthias Bauer, Lucas Theis, Jonathan Richard Schwarz, and Emilien Dupont. C3: High-performance and low-complexity neural compression from a single image or video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9347–9358, June 2024. 1, 2

work page 2024
[28]

Nvrc: Neural video representation compression

Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, and 9 David Bull. Nvrc: Neural video representation compression. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 132440– 132462. Curran Associates, Inc., 2024. 1, 2

work page 2024
[29]

Misc: Ultra-low bitrate image semantic compres- sion driven by large multimodal model.IEEE Transactions on Image Processing, 34:335–349, 2025

Chunyi Li, Guo Lu, Donghui Feng, Haoning Wu, Zicheng Zhang, Xiaohong Liu, Guangtao Zhai, Weisi Lin, and Wen- jun Zhang. Misc: Ultra-low bitrate image semantic compres- sion driven by large multimodal model.IEEE Transactions on Image Processing, 34:335–349, 2025. 3

work page 2025
[30]

Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation

Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, and Siyu Zhu. Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 7752–7762, June 2025. 5

work page 2025
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. 1, 3, 8

work page 2023
[32]

Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022. 5

work page 2022
[33]

Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM Multimedia Systems Conference, pages 331–336, 2020. 5

work page 2020
[34]

Lmm-driven se- mantic image-text coding for ultra low-bitrate learned image compression

Shimon Murai, Heming Sun, and Jiro Katto. Lmm-driven se- mantic image-text coding for ultra low-bitrate learned image compression. In2024 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP), pages 1–5, 2024. 1, 3

work page 2024
[35]

Openvid-1m: A large-scale high-quality dataset for text-to- video generation, 2025

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation, 2025. 5

work page 2025
[36]

Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc)

Jens-Rainer Ohm, Gary J Sullivan, Heiko Schwarz, Thiow Keng Tan, and Thomas Wiegand. Comparison of the coding efficiency of video coding standards—including high efficiency video coding (hevc). InIEEE Transactions on Cir- cuits and Systems for Video Technology, volume 22, pages 1669–1684. IEEE, 2012. 5

work page 2012
[37]

Lightweight diffusion models for resource-constrained semantic communication, 2024

Giovanni Pignata, Eleonora Grassucci, Giordano Cicchetti, and Danilo Comminiello. Lightweight diffusion models for resource-constrained semantic communication, 2024. 3

work page 2024
[38]

Generative latent coding for ultra-low bi- trate image and video compression.IEEE Transactions on Circuits and Systems for Video Technology, 35(10):10500– 10515, 2025

Linfeng Qi, Zhaoyang Jia, Jiahao Li, Bin Li, Houqiang Li, and Yan Lu. Generative latent coding for ultra-low bi- trate image and video compression.IEEE Transactions on Circuits and Systems for Video Technology, 35(10):10500– 10515, 2025. 3, 5

work page 2025
[39]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learn- ing Research, 21(140):1–67, 2020. 4

work page 2020
[40]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022. 2

work page 2022
[41]

Extremely low-bitrate image compression semantically disentangled by lmms from a human perception perspective, 2025

Juan Song, Lijie Yang, and Mingtao Feng. Extremely low-bitrate image compression semantically disentangled by lmms from a human perception perspective, 2025. 3

work page 2025
[42]

Converting video formats with ffmpeg

Suramya Tomar. Converting video formats with ffmpeg. Linux J., 2006(146):10, June 2006. 1, 5

work page 2006
[43]

To- wards accurate generative models of video: A new metric & challenges, 2019

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges, 2019. 5

work page 2019
[44]

M3-cvc: Control- lable video compression with multimodal generative mod- els

Rui Wan, Qi Zheng, and Yibo Fan. M3-cvc: Control- lable video compression with multimodal generative mod- els. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. 1, 3

work page 2025
[45]

Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset.IEEE Interna- tional Conference on Image Processing (ICIP), pages 1509– 1513, 2016

Heng Wang, Ioannis Katsavounidis, Jiantong Zhou, Jonghoon Park, Shawmin Lei, Xin Zhou, Man-On Pun, Xin Jin, Ronggang Wang, Xin Wang, et al. Mcl-jcv: A jnd-based h.264/avc video quality assessment dataset.IEEE Interna- tional Conference on Image Processing (ICIP), pages 1509– 1513, 2016. 5

work page 2016
[46]

Liftvsr: Lifting image diffusion to video super-resolution via hybrid temporal modeling with only 4×rtx 4090s, 2025

Xijun Wang, Xin Li, Bingchen Li, and Zhibo Chen. Liftvsr: Lifting image diffusion to video super-resolution via hybrid temporal modeling with only 4×rtx 4090s, 2025. 2

work page 2025
[47]

T-gvc: Trajectory-guided gen- erative video coding at ultra-low bitrates, 2025

Zhitao Wang, Hengyu Man, Wenrui Li, Xingtao Wang, Xi- aopeng Fan, and Debin Zhao. T-gvc: Trajectory-guided gen- erative video coding at ultra-low bitrates, 2025. 1, 3, 5

work page 2025
[48]

Exploring video quality assessment on user gen- erated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jing- wen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user gen- erated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20144–20154, October

work page
[49]

Effec- tive whole-body pose estimation with two-stages distillation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec- tive whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 4210–4220, October 2023. 2, 3

work page 2023
[50]

Conditional video generation for high-efficiency video compression, 2025

Fangqiu Yi, Jingyu Xu, Jiawei Shao, Chi Zhang, and Xue- long Li. Conditional video generation for high-efficiency video compression, 2025. 1, 3, 4, 5

work page 2025
[51]

Generative video semantic communication via multimodal semantic fusion with large model, 2025

Hang Yin, Li Qiao, Yu Ma, Shuo Sun, Kan Li, Zhen Gao, and Dusit Niyato. Generative video semantic communication via multimodal semantic fusion with large model, 2025. 3

work page 2025
[52]

Resshift: Efficient diffusion model for image super- resolution by residual shifting

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, vol- ume 36, pages 13294–13307. Curran Associates, Inc., 2023. 2

work page 2023
[53]

Semantics-guided diffu- sion for deep joint source-channel coding in wireless image transmission, 2025

Maojun Zhang, Haotian Wu, Guangxu Zhu, Richeng Jin, Xi- aoming Chen, and Deniz G ¨und¨uz. Semantics-guided diffu- sion for deep joint source-channel coding in wireless image transmission, 2025. 1, 3

work page 2025
[54]

When video coding meets multimodal large language models: A unified paradigm for video coding, 2025

Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, 10 Long Xu, Haoliang Li, Nicu Sebe, Sam Kwong, and Shiqi Wang. When video coding meets multimodal large language models: A unified paradigm for video coding, 2025. 4

work page 2025
[55]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2018. 5

work page 2018
[56]

Ssp-ir: Semantic and structure priors for diffusion-based realistic image restora- tion.IEEE Transactions on Circuits and Systems for Video Technology, 35(7):6259–6272, 2025

Yuhong Zhang, Hengsheng Zhang, Zhengxue Cheng, Rong Xie, Li Song, and Wenjun Zhang. Ssp-ir: Semantic and structure priors for diffusion-based realistic image restora- tion.IEEE Transactions on Circuits and Systems for Video Technology, 35(7):6259–6272, 2025. 2

work page 2025
[57]

Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation

Zihao Zhang, Haoran Chen, Haoyu Zhao, Guansong Lu, Yanwei Fu, Hang Xu, and Zuxuan Wu. Eden: Enhanced diffusion for high-quality large-motion video frame interpo- lation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 2105– 2115, June 2025. 2 11 Low-Bitrate Video Compression through Semantic-Conditioned ...

work page 2025