Perception-Aware Video Semantic Communication

Yinhuan Huang; Zhijin Qin

arxiv: 2605.19397 · v1 · pith:R4NJQUHRnew · submitted 2026-05-19 · 📡 eess.IV · cs.MM

Perception-Aware Video Semantic Communication

Yinhuan Huang , Zhijin Qin This is my paper

Pith reviewed 2026-05-20 02:40 UTC · model grok-4.3

classification 📡 eess.IV cs.MM

keywords video semantic communicationperception-aware transmissionwireless videospatio-temporal featuresreal-time decodingbandwidth reductionLPIPSDISTS

0 comments

The pith

A perception-aware semantic communication system encodes video features for wireless transmission to cut bandwidth use while preserving human visual quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PVSC as a framework that replaces traditional separated source-channel coding with direct transmission of compact spatio-temporal features. It removes the need for explicit motion vectors and adds side-information formatting plus reference-buffer management to support stable decoding. Experiments show the approach maintains perceptual quality metrics across different datasets, resolutions, and channel conditions while using substantially less bandwidth than conventional baselines. This matters because rising ultra-high-resolution and immersive video traffic is pushing wireless links toward their limits, where pixel-exact methods often waste resources or fail under latency constraints. The design also allows a single model to adapt to varying bandwidth without retraining.

Core claim

PVSC generates channel-robust symbol streams through spatio-temporal feature coding without transmitting motion vectors, then uses specified side-information formatting, reference-buffer management, and lightweight rate control to achieve stable receiver reconstruction and bandwidth-adaptive inference from one learned model.

What carries the argument

The PVSC framework that combines perception-aware spatio-temporal feature coding with explicit side-information formatting and reference-buffer management to produce compact symbol streams.

If this is right

PVSC delivers comparable or better LPIPS and DISTS scores while using up to 75 percent and 87 percent less bandwidth than an engineered VTM plus 5G LDPC baseline.
The same model supports real-time inference on a single consumer GPU across varied resolutions and group-of-pictures lengths.
Performance remains superior under multiple channel conditions without requiring separate models for each bandwidth level.
Elimination of explicit motion-vector transmission reduces overhead and improves robustness in short-blocklength wireless settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend naturally to live streaming of 360-degree or volumetric video where motion compensation is especially costly.
If the rate-control logic generalizes, it may reduce the need for frequent model updates in deployed wireless video systems.
Similar feature-based semantic coding might apply to audio or sensor data streams facing the same bandwidth-latency trade-offs.

Load-bearing premise

The single learned model with the chosen side-information formatting, reference-buffer rules, and rate control will maintain stable reconstruction quality and correct bandwidth adaptation under every real-world wireless channel and every type of video content.

What would settle it

A controlled test on rapidly fading channels or high-motion video sequences that shows either a sharp rise in required bandwidth to meet target LPIPS/DISTS scores or visible reconstruction artifacts at the receiver.

Figures

Figures reproduced from arXiv: 2605.19397 by Yinhuan Huang, Zhijin Qin.

**Figure 2.** Figure 2: System model of the proposed PVSC. AE, AD, Q, PNG, and RM denote arithmetic encoding, arithmetic decoding, quantization, portable network graphics coding, and rate matching, respectively. PVSC models implicit spatio-temporal dependencies, where F tx/rx t−1 is used to generate C tx/rx e,t−1 and C tx/rx f,t−1 . C tx/rx e,t−1 , C tx/rx f,t−1 , and C tx/rx s,t−1 denote the temporal contexts used at time t for … view at source ↗

**Figure 3.** Figure 3: (a) Transmitter-side buffer update branch for local buffer updating and temporal-context generation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Feature coding with ViT/contextual ViT blocks for spatial-temporal modeling and FC layers for complex [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Transmission and reception pipeline. Fext(·), and the generator G(·), which provides lightweight rate adaptation and improves the flexibility of PVSC [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Rate-perception-distortion performance on 1080p/2K video datasets over different channels. The channel bandwidth [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison of different methods over an AWGN channel (SNR = 6 dB). Zoom in for a better view. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Ultra-high-resolution streaming and emerging immersive services are driving rapidly increasing wireless video traffic. However, perceptually pleasing video transmission over bandwidth-limited and latency-constrained wireless links remains challenging for conventional separated source-channel systems, which primarily target bit-level reliability and often suffer performance degradation under short-blocklength transmission. In addition, pixel-level distortion optimization does not necessarily align with human perception, while existing learned video codecs may incur high complexity and raise deployment issues. This paper proposes PVSC, a perception-aware video semantic communication framework for real-time wireless video transmission. PVSC eliminates explicit motion-vector transmission and exploits spatio-temporal feature coding to generate compact and channel-robust symbol streams. It also specifies side-information formatting, reference-buffer management, and lightweight rate control, enabling stable receiver-side reconstruction and bandwidth-adaptive inference with a single model. Extensive experiments demonstrate that PVSC achieves superior performance across diverse datasets, resolutions, GOP configurations, and channel conditions. Compared with the engineered ``VTM + 5G LDPC'' baseline, PVSC saves up to about 75% and 87% bandwidth at comparable LPIPS and DISTS, respectively, while enabling real-time inference on a single NVIDIA RTX 4090 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PVSC gives a workable semantic video framework that drops motion vectors and uses one model for perceptual quality plus bandwidth adaptation, but the robustness to real wireless channels is the part that still needs more evidence.

read the letter

The main takeaway is that this paper builds a perception-aware semantic communication system for video called PVSC. It skips sending motion vectors, codes spatio-temporal features into compact symbols, and adds side-information formatting, reference-buffer management, and lightweight rate control so one model handles different bandwidths at the receiver. The experiments report up to 75% bandwidth savings at matched LPIPS and 87% at matched DISTS versus a VTM plus 5G LDPC baseline, plus real-time inference on a single RTX 4090. That combination of concrete architecture choices and perceptual metrics is what stands out as useful progress in this area. The work does a solid job showing how semantic coding can align better with human perception than pixel-level targets, and the single-model adaptation claim is a practical plus for deployment. The tests span multiple datasets, resolutions, and GOP setups, which adds some breadth. The soft spot is the generalization to channel conditions. The headline savings rest on the model staying stable under varied wireless environments, but if training used simpler AWGN or block-fading while real links have Doppler, clustered delay, or bursty interference, the perceptual scores could degrade even at matched average SNR. The abstract says it works across channel conditions, yet without seeing explicit out-of-distribution tests or the exact training ensemble, that part feels less secured than the rest. This paper is for people working on learned communication systems for real-time video. A reader who wants architecture ideas for semantic feature coding and perceptual rate control would get value from the details and numbers. It deserves a serious referee because the claims are specific enough to evaluate and the topic is timely for wireless multimedia. I would recommend sending it for peer review, with the main request being clearer evidence on channel robustness.

Referee Report

2 major / 2 minor

Summary. The paper proposes PVSC, a perception-aware video semantic communication framework for real-time wireless video transmission over bandwidth-limited links. It eliminates explicit motion-vector transmission by exploiting spatio-temporal feature coding to produce compact, channel-robust symbol streams, and specifies side-information formatting, reference-buffer management, and lightweight rate control to support stable receiver-side reconstruction and bandwidth-adaptive inference using a single model. Extensive experiments are reported to demonstrate superior performance across diverse datasets, resolutions, GOP configurations, and channel conditions, with up to 75% and 87% bandwidth savings versus the VTM + 5G LDPC baseline at comparable LPIPS and DISTS, respectively, while enabling real-time inference on a single NVIDIA RTX 4090 GPU.

Significance. If the reported bandwidth savings and perceptual-quality results hold under realistic conditions, the work would offer a meaningful advance for semantic communication in wireless video, particularly by aligning transmission with human perception rather than pixel-level distortion and by achieving real-time operation on commodity hardware. The single-model adaptive inference via the described rate control and buffer management is a practical strength that could reduce deployment complexity compared with separate source-channel systems.

major comments (2)

[Abstract / Experimental evaluation] Abstract and experimental evaluation: the headline claim of up to 75% / 87% bandwidth reduction at matched LPIPS/DISTS 'across … channel conditions' rests on an untested distributional-robustness assumption. No explicit description is given of the training channel ensemble (e.g., AWGN or block-fading) versus the test conditions, nor are out-of-distribution evaluations (3GPP TR 38.901 clustered delay line, Doppler, bursty interference) reported. This directly affects the load-bearing assertion that the learned symbol mapping remains stable under the full range of real-world wireless statistics.
[Methods / Results] Methods and results sections: the abstract states concrete percentage savings and real-time performance, yet the manuscript provides insufficient detail on dataset splits, number of sequences, statistical significance tests, hyper-parameter selection, and any post-hoc choices. Without these, the degree to which the data support the central performance claim cannot be independently verified.

minor comments (2)

Notation for side-information formatting and reference-buffer management could be clarified with a small diagram or pseudocode to aid reproducibility.
Consider adding a short discussion of failure modes (e.g., high-motion content or very low SNR) to temper the generalization statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us identify areas where additional clarity and transparency will strengthen the manuscript. We provide point-by-point responses below and commit to revisions that directly address the concerns raised.

read point-by-point responses

Referee: [Abstract / Experimental evaluation] Abstract and experimental evaluation: the headline claim of up to 75% / 87% bandwidth reduction at matched LPIPS/DISTS 'across … channel conditions' rests on an untested distributional-robustness assumption. No explicit description is given of the training channel ensemble (e.g., AWGN or block-fading) versus the test conditions, nor are out-of-distribution evaluations (3GPP TR 38.901 clustered delay line, Doppler, bursty interference) reported. This directly affects the load-bearing assertion that the learned symbol mapping remains stable under the full range of real-world wireless statistics.

Authors: We agree that an explicit description of the channel models is necessary to support the robustness claims. In the revised manuscript we will insert a dedicated paragraph in Section III-C (Channel Model) that specifies the training ensemble as AWGN (SNR uniformly sampled from 0–30 dB) together with block-fading channels whose coherence time is drawn from {10, 20, 50} ms. All quantitative results, including the reported bandwidth savings at matched LPIPS/DISTS, were generated under exactly these conditions. While we did not include the full 3GPP TR 38.901 clustered-delay-line or bursty-interference scenarios, the single-model rate-control mechanism already demonstrates stable reconstruction across the tested SNR and coherence-time range (see Figures 6–8 and the associated ablation). We will add a short limitations paragraph acknowledging that broader 3GPP-style evaluations remain future work, thereby avoiding any overstatement of distributional robustness. revision: yes
Referee: [Methods / Results] Methods and results sections: the abstract states concrete percentage savings and real-time performance, yet the manuscript provides insufficient detail on dataset splits, number of sequences, statistical significance tests, hyper-parameter selection, and any post-hoc choices. Without these, the degree to which the data support the central performance claim cannot be independently verified.

Authors: We accept that the current experimental description lacks sufficient granularity for independent verification. In the revised manuscript we will expand Section IV-A (Datasets and Implementation Details) to report: (i) explicit train/validation/test splits (80/10/10 per dataset), (ii) the precise number of sequences evaluated (UVG: 7 sequences; MCL-JCV: 30 sequences; etc.), (iii) results of paired t-tests confirming statistical significance (p < 0.05) for the reported LPIPS/DISTS savings, (iv) the hyper-parameter search procedure (grid search over learning rate, loss weights, and buffer size, with final values tabulated), and (v) an explicit statement that no post-hoc sequence selection occurred—all test sequences were included. These additions will be placed before the main results tables so that readers can fully assess the supporting evidence. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper proposes an empirical framework (PVSC) for semantic video transmission and reports experimental bandwidth savings versus an external engineered baseline (VTM + 5G LDPC). No equations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance claims rest on direct comparisons across datasets and conditions rather than any internal derivation chain, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Without the full manuscript, concrete free parameters, mathematical axioms, or newly postulated entities cannot be extracted; the ledger is therefore empty beyond the overall framework itself.

pith-pipeline@v0.9.0 · 5732 in / 1136 out tokens · 60871 ms · 2026-05-20T02:40:29.183088+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PVSC eliminates explicit motion-vector transmission and exploits spatio-temporal feature coding to generate compact and channel-robust symbol streams... lightweight rate control, enabling stable receiver-side reconstruction and bandwidth-adaptive inference with a single model.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments demonstrate that PVSC achieves superior performance across diverse datasets, resolutions, GOP configurations, and channel conditions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

Rate-efficient perception-oriented generative semantic video communication,

Y . Huang and Z. Qin, “Rate-efficient perception-oriented generative semantic video communication,” inProc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), 2026

work page 2026
[2]

Ericsson mobility report, june 2025,

“Ericsson mobility report, june 2025,” White Paper, Jun. 2025

work page 2025
[3]

Channel coding rate in the finite blocklength regime,

Y . Polyanskiy, H. V . Poor, and S. Verd ´u, “Channel coding rate in the finite blocklength regime,”IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, 2010

work page 2010
[4]

Toward massive, ultrareliable, and low-latency wireless communication with short packets,

G. Durisi, T. Koch, and P. Popovski, “Toward massive, ultrareliable, and low-latency wireless communication with short packets,”Proc. IEEE, vol. 104, no. 9, pp. 1711–1726, 2016

work page 2016
[5]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948

work page 1948
[6]

Deepwive: Deep-learning-aided wireless video transmission,

T.-Y . Tung and D. G ¨und¨uz, “Deepwive: Deep-learning-aided wireless video transmission,”IEEE J. Select. Areas Commun., vol. 40, no. 9, pp. 2570–2583, 2022. 13

work page 2022
[7]

Image quality assessment: From error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004

work page 2004
[8]

The perception-distortion tradeoff,

Y . Blau and T. Michaeli, “The perception-distortion tradeoff,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2018

work page 2018
[9]

Overview of the versatile video coding (VVC) standard and its applications,

B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

work page 2021
[10]

Deep contextual video compression,

J. Li, B. Li, and Y . Lu, “Deep contextual video compression,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021, pp. 18 114– 18 125

work page 2021
[11]

Neural video compression with diverse contexts,

——, “Neural video compression with diverse contexts,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 22 616–22 626

work page 2023
[12]

Neural video compression with feature modulation,

——, “Neural video compression with feature modulation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2024, pp. 26 099– 26 108

work page 2024
[13]

Towards practical real-time neural video compression,

Z. Jia, B. Li, J. Li, W. Xie, L. Qi, H. Li, and Y . Lu, “Towards practical real-time neural video compression,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Nashville, TN, USA, Jun. 2025, pp. 11–25

work page 2025
[14]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2018

work page 2018
[15]

Image quality assessment: Unifying structure and texture similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2567–2581, 2022

work page 2022
[16]

Perceptual learned video compression with recurrent conditional GAN,

R. Yang, R. Timofte, and L. Van Gool, “Perceptual learned video compression with recurrent conditional GAN,” inProc. Int. Joint Conf. Artif. Intell. (IJCAI), Jul. 2022, pp. 1537–1544

work page 2022
[17]

Generative latent coding for ultra-low bitrate image and video compression,

L. Qi, Z. Jia, J. Li, B. Li, H. Li, and Y . Lu, “Generative latent coding for ultra-low bitrate image and video compression,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 10, pp. 10 500–10 515, 2025

work page 2025
[18]

AI empowered wireless communications: From bits to semantics,

Z. Qin, L. Liang, Z. Wang, S. Jin, X. Tao, W. Tong, and G. Y . Li, “AI empowered wireless communications: From bits to semantics,”Proc. IEEE, vol. 112, no. 7, pp. 621–652, Jul. 2024

work page 2024
[19]

Task-oriented multi-user semantic communications,

H. Xie, Z. Qin, X. Tao, and K. B. Letaief, “Task-oriented multi-user semantic communications,”IEEE J. Select. Areas Commun., vol. 40, no. 9, pp. 2584–2597, Sept. 2022

work page 2022
[20]

Deep learning enabled semantic communication systems,

H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,”IEEE Trans. Signal Process., vol. 69, pp. 2663–2675, Apr. 2021

work page 2021
[21]

Robust semantic communications for speech transmission,

Z. Weng, Z. Qin, and G. Y . Li, “Robust semantic communications for speech transmission,” inProc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2025, pp. 1–5

work page 2025
[22]

Nonlinear transform source-channel coding for semantic communications,

J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,”IEEE J. Select. Areas Commun., vol. 40, no. 8, pp. 2300–2316, Aug. 2022

work page 2022
[23]

Progressive learned image transmission for semantic communication using hierarchical vae,

G. Zhang, H. Li, Y . Cai, Q. Hu, G. Yu, and Z. Qin, “Progressive learned image transmission for semantic communication using hierarchical vae,” IEEE Trans. Cognit. Comm. Netw., vol. 11, no. 6, pp. 3640–3654, 2025

work page 2025
[24]

Image semantic communication with quadtree partition-based coding,

Y . Huang and Z. Qin, “Image semantic communication with quadtree partition-based coding,”IEEE J. Select. Areas Commun., vol. 44, pp. 2765–2778, 2026

work page 2026
[25]

Joint semantic-channel coding and modulation for token communications,

J. Ying, Z. Qin, Y . Feng, L. Wang, and X. Tao, “Joint semantic-channel coding and modulation for token communications,”IEEE Trans. Wirel. Commun., vol. 25, pp. 8179–8193, 2026

work page 2026
[26]

Wireless deep video semantic transmission,

S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong, X. Qin, and P. Zhang, “Wireless deep video semantic transmission,”IEEE J. Select. Areas Commun., vol. 41, no. 1, pp. 214–229, 2023

work page 2023
[27]

Deep learning enabled video semantic transmission against multi-dimensional noise,

H. Niu, L. Wang, Z. Lu, K. Du, and X. Wen, “Deep learning enabled video semantic transmission against multi-dimensional noise,” in2023 IEEE Globecom Workshops (GC Wkshps), 2023, pp. 1267–1272

work page 2023
[28]

Wireless video transmission with joint semantic- channel coding,

Y . Huang and Z. Qin, “Wireless video transmission with joint semantic- channel coding,” inProc. IEEE Globecom Workshops (GC Wkshps), 2024, pp. 1–6

work page 2024
[29]

Md- vsc—efficient wireless model division video semantic communication,

Z. Bao, H. Liang, C. Dong, C. Li, X. Xu, and P. Zhang, “Md- vsc—efficient wireless model division video semantic communication,” IEEE Internet Things J., vol. 12, no. 2, pp. 1109–1124, 2025

work page 2025
[30]

Vista: Video transmission over a semantic communication ap- proach,

C. Liang, X. Deng, Y . Sun, R. Cheng, L. Xia, D. Niyato, and M. A. Imran, “Vista: Video transmission over a semantic communication ap- proach,” inProc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), 2023, pp. 1777–1782

work page 2023
[31]

Bidirectional motion-enhanced semantic communication for wireless video transmission,

Z. Zhang, Q. Yang, S. He, and Z. Shi, “Bidirectional motion-enhanced semantic communication for wireless video transmission,”IEEE Internet Things J., vol. 13, no. 8, pp. 15 607–15 620, 2026

work page 2026
[32]

Goal-oriented semantic communication for wireless video transmission via generative ai,

N. Li, Y . Deng, and D. Niyato, “Goal-oriented semantic communication for wireless video transmission via generative ai,”IEEE Trans. Wirel. Commun., vol. 25, pp. 10 841–10 854, 2026

work page 2026
[33]

Object-attribute- relation representation-based video semantic communication,

Q. Du, Y . Duan, Q. Yang, X. Tao, and M. Debbah, “Object-attribute- relation representation-based video semantic communication,”IEEE J. Select. Areas Commun., vol. 43, no. 7, pp. 2446–2461, 2025

work page 2025
[34]

Wireless semantic communi- cations for video conferencing,

P. Jiang, C.-K. Wen, S. Jin, and G. Y . Li, “Wireless semantic communi- cations for video conferencing,”IEEE J. Select. Areas Commun., vol. 41, no. 1, pp. 230–244, 2023

work page 2023
[35]

Synchronous multi-modal semantic communication system with packet-level coding,

Y . Tian, J. Ying, Z. Qin, Y . Jin, and X. Tao, “Synchronous multi-modal semantic communication system with packet-level coding,”IEEE Trans. Wirel. Commun., vol. 24, no. 5, pp. 3684–3697, 2025

work page 2025
[36]

Agnolucci, L

L. Agnolucci, L. Galteri, M. Bertini, and A. D. Bimbo,IEEE Trans. Multimedia

work page
[37]

VideoQA-SC: Adaptive semantic communication for video question answering,

J. Guo, W. Chen, Y . Sun, J. Xu, and B. Ai, “VideoQA-SC: Adaptive semantic communication for video question answering,”IEEE J. Select. Areas Commun., vol. 43, no. 7, pp. 2462–2477, 2025

work page 2025
[38]

Massive MIMO networks: Spectral, energy, and hardware efficiency,

E. Bj ¨ornson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,”Found. Trends Signal Pro- cess., vol. 11, no. 3-4, pp. 154–655, 2017

work page 2017
[39]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cognit. Comm. Netw., vol. 5, no. 3, pp. 567–579, May. 2019

work page 2019
[40]

End-to-end optimized image compression,

J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inProc. Int. Conf. Learn. Represent. (ICLR), Toulon, France, Apr. 2017

work page 2017
[41]

Roelofs and R

G. Roelofs and R. Koman,PNG: The Definitive Guide. USA: O’Reilly & Associates, Inc., 1999

work page 1999
[42]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2021, pp. 12 873–12 883

work page 2021
[43]

Generative adversarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 27, 2014

work page 2014
[44]

Checkerboard context model for efficient learned image compression,

D. He, Y . Zheng, B. Sun, Y . Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2021, pp. 14 771–14 780

work page 2021
[45]

Robust semantic communications with masked VQ-V AE enabled codebook,

Q. Hu, G. Zhang, Z. Qin, Y . Cai, G. Yu, and G. Y . Li, “Robust semantic communications with masked VQ-V AE enabled codebook,”IEEE Trans. Wirel. Commun., vol. 22, no. 12, pp. 8707–8722, Dec. 2023

work page 2023
[46]

Video enhance- ment with task-oriented flow,

T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance- ment with task-oriented flow,”Int. J. Comput. Vis., vol. 127, no. 8, pp. 1106–1125, 2019

work page 2019
[47]

BVI-DVC: A training database for deep video compression,

D. Ma, F. Zhang, and D. R. Bull, “BVI-DVC: A training database for deep video compression,”IEEE Trans. Multimedia, vol. 24, pp. 3847– 3858, 2021

work page 2021
[48]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[49]

Calculation of average psnr differences between rd- curves,

G. Bjontegaard, “Calculation of average psnr differences between rd- curves,”ITU SG16 Doc. VCEG-M33, 2001

work page 2001
[50]

Common test conditions and software reference configurations for HEVC range extensions,

D. Flynn, K. Sharman, and C. Rosewarne, “Common test conditions and software reference configurations for HEVC range extensions,”JCT-VC Doc. JCTVC-N1006, vol. 16, p. 6, 2013

work page 2013
[51]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development,

A. Mercat, M. Viitanen, and J. Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” inProc. ACM Multimedia Syst. Conf. (MMSys), Istanbul, Turkey, 2020, pp. 297–302

work page 2020
[52]

MCL-JCV: A JND- based H.264/A VC video quality assessment dataset,

H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: A JND- based H.264/A VC video quality assessment dataset,” inProc. IEEE Int. Conf. Image Process. (ICIP), 2016, pp. 1509–1513

work page 2016
[53]

FFmpeg reference software,

“FFmpeg reference software,” https://www.ffmpeg.org/, accessed: 2025- 04-13

work page 2025
[54]

HEVC official test model,

“HEVC official test model,” https://hevc.hhi.fraunhofer.de, accessed: 2025-04-13

work page 2025
[55]

VVC official test model,

“VVC official test model,” https://vcgit.hhi.fraunhofer.de/jvet/ VVCSoftware VTM, accessed: 2025-04-13

work page 2025
[56]

Design of low-density parity check codes for 5g new radio,

T. Richardson and S. Kudekar, “Design of low-density parity check codes for 5g new radio,”IEEE Commun. Mag., vol. 56, no. 3, pp. 28–34, Mar. 2018

work page 2018
[57]

3GPP TS 38.214 version 16.2.0 Release 16: 5G; NR; Physical layer procedures for data,

3GPP, “3GPP TS 38.214 version 16.2.0 Release 16: 5G; NR; Physical layer procedures for data,” https://www.etsi.org/deliver/etsi ts/138200 138299/138214/16.02.00 60/ts 138214v160200p.pdf, 2020, accessed: 2025-04-13

work page 2020
[58]

Hoydis, S

J. Hoydis, S. Cammerer, F. Ait Aoudia, M. Nimier-David, L. Maggi, G. Marcus, A. Vem, and A. Keller, “Sionna,” 2022, https://nvlabs.github.io/sionna/

work page 2022

[1] [1]

Rate-efficient perception-oriented generative semantic video communication,

Y . Huang and Z. Qin, “Rate-efficient perception-oriented generative semantic video communication,” inProc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), 2026

work page 2026

[2] [2]

Ericsson mobility report, june 2025,

“Ericsson mobility report, june 2025,” White Paper, Jun. 2025

work page 2025

[3] [3]

Channel coding rate in the finite blocklength regime,

Y . Polyanskiy, H. V . Poor, and S. Verd ´u, “Channel coding rate in the finite blocklength regime,”IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359, 2010

work page 2010

[4] [4]

Toward massive, ultrareliable, and low-latency wireless communication with short packets,

G. Durisi, T. Koch, and P. Popovski, “Toward massive, ultrareliable, and low-latency wireless communication with short packets,”Proc. IEEE, vol. 104, no. 9, pp. 1711–1726, 2016

work page 2016

[5] [5]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948

work page 1948

[6] [6]

Deepwive: Deep-learning-aided wireless video transmission,

T.-Y . Tung and D. G ¨und¨uz, “Deepwive: Deep-learning-aided wireless video transmission,”IEEE J. Select. Areas Commun., vol. 40, no. 9, pp. 2570–2583, 2022. 13

work page 2022

[7] [7]

Image quality assessment: From error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,”IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004

work page 2004

[8] [8]

The perception-distortion tradeoff,

Y . Blau and T. Michaeli, “The perception-distortion tradeoff,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2018

work page 2018

[9] [9]

Overview of the versatile video coding (VVC) standard and its applications,

B. Bross, Y .-K. Wang, Y . Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Trans. Circuit Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021

work page 2021

[10] [10]

Deep contextual video compression,

J. Li, B. Li, and Y . Lu, “Deep contextual video compression,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, 2021, pp. 18 114– 18 125

work page 2021

[11] [11]

Neural video compression with diverse contexts,

——, “Neural video compression with diverse contexts,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2023, pp. 22 616–22 626

work page 2023

[12] [12]

Neural video compression with feature modulation,

——, “Neural video compression with feature modulation,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2024, pp. 26 099– 26 108

work page 2024

[13] [13]

Towards practical real-time neural video compression,

Z. Jia, B. Li, J. Li, W. Xie, L. Qi, H. Li, and Y . Lu, “Towards practical real-time neural video compression,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Nashville, TN, USA, Jun. 2025, pp. 11–25

work page 2025

[14] [14]

The unreasonable effectiveness of deep features as a perceptual metric,

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2018

work page 2018

[15] [15]

Image quality assessment: Unifying structure and texture similarity,

K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 5, pp. 2567–2581, 2022

work page 2022

[16] [16]

Perceptual learned video compression with recurrent conditional GAN,

R. Yang, R. Timofte, and L. Van Gool, “Perceptual learned video compression with recurrent conditional GAN,” inProc. Int. Joint Conf. Artif. Intell. (IJCAI), Jul. 2022, pp. 1537–1544

work page 2022

[17] [17]

Generative latent coding for ultra-low bitrate image and video compression,

L. Qi, Z. Jia, J. Li, B. Li, H. Li, and Y . Lu, “Generative latent coding for ultra-low bitrate image and video compression,”IEEE Trans. Circuit Syst. Video Technol., vol. 35, no. 10, pp. 10 500–10 515, 2025

work page 2025

[18] [18]

AI empowered wireless communications: From bits to semantics,

Z. Qin, L. Liang, Z. Wang, S. Jin, X. Tao, W. Tong, and G. Y . Li, “AI empowered wireless communications: From bits to semantics,”Proc. IEEE, vol. 112, no. 7, pp. 621–652, Jul. 2024

work page 2024

[19] [19]

Task-oriented multi-user semantic communications,

H. Xie, Z. Qin, X. Tao, and K. B. Letaief, “Task-oriented multi-user semantic communications,”IEEE J. Select. Areas Commun., vol. 40, no. 9, pp. 2584–2597, Sept. 2022

work page 2022

[20] [20]

Deep learning enabled semantic communication systems,

H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,”IEEE Trans. Signal Process., vol. 69, pp. 2663–2675, Apr. 2021

work page 2021

[21] [21]

Robust semantic communications for speech transmission,

Z. Weng, Z. Qin, and G. Y . Li, “Robust semantic communications for speech transmission,” inProc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2025, pp. 1–5

work page 2025

[22] [22]

Nonlinear transform source-channel coding for semantic communications,

J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,”IEEE J. Select. Areas Commun., vol. 40, no. 8, pp. 2300–2316, Aug. 2022

work page 2022

[23] [23]

Progressive learned image transmission for semantic communication using hierarchical vae,

G. Zhang, H. Li, Y . Cai, Q. Hu, G. Yu, and Z. Qin, “Progressive learned image transmission for semantic communication using hierarchical vae,” IEEE Trans. Cognit. Comm. Netw., vol. 11, no. 6, pp. 3640–3654, 2025

work page 2025

[24] [24]

Image semantic communication with quadtree partition-based coding,

Y . Huang and Z. Qin, “Image semantic communication with quadtree partition-based coding,”IEEE J. Select. Areas Commun., vol. 44, pp. 2765–2778, 2026

work page 2026

[25] [25]

Joint semantic-channel coding and modulation for token communications,

J. Ying, Z. Qin, Y . Feng, L. Wang, and X. Tao, “Joint semantic-channel coding and modulation for token communications,”IEEE Trans. Wirel. Commun., vol. 25, pp. 8179–8193, 2026

work page 2026

[26] [26]

Wireless deep video semantic transmission,

S. Wang, J. Dai, Z. Liang, K. Niu, Z. Si, C. Dong, X. Qin, and P. Zhang, “Wireless deep video semantic transmission,”IEEE J. Select. Areas Commun., vol. 41, no. 1, pp. 214–229, 2023

work page 2023

[27] [27]

Deep learning enabled video semantic transmission against multi-dimensional noise,

H. Niu, L. Wang, Z. Lu, K. Du, and X. Wen, “Deep learning enabled video semantic transmission against multi-dimensional noise,” in2023 IEEE Globecom Workshops (GC Wkshps), 2023, pp. 1267–1272

work page 2023

[28] [28]

Wireless video transmission with joint semantic- channel coding,

Y . Huang and Z. Qin, “Wireless video transmission with joint semantic- channel coding,” inProc. IEEE Globecom Workshops (GC Wkshps), 2024, pp. 1–6

work page 2024

[29] [29]

Md- vsc—efficient wireless model division video semantic communication,

Z. Bao, H. Liang, C. Dong, C. Li, X. Xu, and P. Zhang, “Md- vsc—efficient wireless model division video semantic communication,” IEEE Internet Things J., vol. 12, no. 2, pp. 1109–1124, 2025

work page 2025

[30] [30]

Vista: Video transmission over a semantic communication ap- proach,

C. Liang, X. Deng, Y . Sun, R. Cheng, L. Xia, D. Niyato, and M. A. Imran, “Vista: Video transmission over a semantic communication ap- proach,” inProc. IEEE Int. Conf. Commun. Workshops (ICC Workshops), 2023, pp. 1777–1782

work page 2023

[31] [31]

Bidirectional motion-enhanced semantic communication for wireless video transmission,

Z. Zhang, Q. Yang, S. He, and Z. Shi, “Bidirectional motion-enhanced semantic communication for wireless video transmission,”IEEE Internet Things J., vol. 13, no. 8, pp. 15 607–15 620, 2026

work page 2026

[32] [32]

Goal-oriented semantic communication for wireless video transmission via generative ai,

N. Li, Y . Deng, and D. Niyato, “Goal-oriented semantic communication for wireless video transmission via generative ai,”IEEE Trans. Wirel. Commun., vol. 25, pp. 10 841–10 854, 2026

work page 2026

[33] [33]

Object-attribute- relation representation-based video semantic communication,

Q. Du, Y . Duan, Q. Yang, X. Tao, and M. Debbah, “Object-attribute- relation representation-based video semantic communication,”IEEE J. Select. Areas Commun., vol. 43, no. 7, pp. 2446–2461, 2025

work page 2025

[34] [34]

Wireless semantic communi- cations for video conferencing,

P. Jiang, C.-K. Wen, S. Jin, and G. Y . Li, “Wireless semantic communi- cations for video conferencing,”IEEE J. Select. Areas Commun., vol. 41, no. 1, pp. 230–244, 2023

work page 2023

[35] [35]

Synchronous multi-modal semantic communication system with packet-level coding,

Y . Tian, J. Ying, Z. Qin, Y . Jin, and X. Tao, “Synchronous multi-modal semantic communication system with packet-level coding,”IEEE Trans. Wirel. Commun., vol. 24, no. 5, pp. 3684–3697, 2025

work page 2025

[36] [36]

Agnolucci, L

L. Agnolucci, L. Galteri, M. Bertini, and A. D. Bimbo,IEEE Trans. Multimedia

work page

[37] [37]

VideoQA-SC: Adaptive semantic communication for video question answering,

J. Guo, W. Chen, Y . Sun, J. Xu, and B. Ai, “VideoQA-SC: Adaptive semantic communication for video question answering,”IEEE J. Select. Areas Commun., vol. 43, no. 7, pp. 2462–2477, 2025

work page 2025

[38] [38]

Massive MIMO networks: Spectral, energy, and hardware efficiency,

E. Bj ¨ornson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks: Spectral, energy, and hardware efficiency,”Found. Trends Signal Pro- cess., vol. 11, no. 3-4, pp. 154–655, 2017

work page 2017

[39] [39]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cognit. Comm. Netw., vol. 5, no. 3, pp. 567–579, May. 2019

work page 2019

[40] [40]

End-to-end optimized image compression,

J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” inProc. Int. Conf. Learn. Represent. (ICLR), Toulon, France, Apr. 2017

work page 2017

[41] [41]

Roelofs and R

G. Roelofs and R. Koman,PNG: The Definitive Guide. USA: O’Reilly & Associates, Inc., 1999

work page 1999

[42] [42]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Jun. 2021, pp. 12 873–12 883

work page 2021

[43] [43]

Generative adversarial nets,

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 27, 2014

work page 2014

[44] [44]

Checkerboard context model for efficient learned image compression,

D. He, Y . Zheng, B. Sun, Y . Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2021, pp. 14 771–14 780

work page 2021

[45] [45]

Robust semantic communications with masked VQ-V AE enabled codebook,

Q. Hu, G. Zhang, Z. Qin, Y . Cai, G. Yu, and G. Y . Li, “Robust semantic communications with masked VQ-V AE enabled codebook,”IEEE Trans. Wirel. Commun., vol. 22, no. 12, pp. 8707–8722, Dec. 2023

work page 2023

[46] [46]

Video enhance- ment with task-oriented flow,

T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance- ment with task-oriented flow,”Int. J. Comput. Vis., vol. 127, no. 8, pp. 1106–1125, 2019

work page 2019

[47] [47]

BVI-DVC: A training database for deep video compression,

D. Ma, F. Zhang, and D. R. Bull, “BVI-DVC: A training database for deep video compression,”IEEE Trans. Multimedia, vol. 24, pp. 3847– 3858, 2021

work page 2021

[48] [48]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[49] [49]

Calculation of average psnr differences between rd- curves,

G. Bjontegaard, “Calculation of average psnr differences between rd- curves,”ITU SG16 Doc. VCEG-M33, 2001

work page 2001

[50] [50]

Common test conditions and software reference configurations for HEVC range extensions,

D. Flynn, K. Sharman, and C. Rosewarne, “Common test conditions and software reference configurations for HEVC range extensions,”JCT-VC Doc. JCTVC-N1006, vol. 16, p. 6, 2013

work page 2013

[51] [51]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development,

A. Mercat, M. Viitanen, and J. Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” inProc. ACM Multimedia Syst. Conf. (MMSys), Istanbul, Turkey, 2020, pp. 297–302

work page 2020

[52] [52]

MCL-JCV: A JND- based H.264/A VC video quality assessment dataset,

H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo, “MCL-JCV: A JND- based H.264/A VC video quality assessment dataset,” inProc. IEEE Int. Conf. Image Process. (ICIP), 2016, pp. 1509–1513

work page 2016

[53] [53]

FFmpeg reference software,

“FFmpeg reference software,” https://www.ffmpeg.org/, accessed: 2025- 04-13

work page 2025

[54] [54]

HEVC official test model,

“HEVC official test model,” https://hevc.hhi.fraunhofer.de, accessed: 2025-04-13

work page 2025

[55] [55]

VVC official test model,

“VVC official test model,” https://vcgit.hhi.fraunhofer.de/jvet/ VVCSoftware VTM, accessed: 2025-04-13

work page 2025

[56] [56]

Design of low-density parity check codes for 5g new radio,

T. Richardson and S. Kudekar, “Design of low-density parity check codes for 5g new radio,”IEEE Commun. Mag., vol. 56, no. 3, pp. 28–34, Mar. 2018

work page 2018

[57] [57]

3GPP TS 38.214 version 16.2.0 Release 16: 5G; NR; Physical layer procedures for data,

3GPP, “3GPP TS 38.214 version 16.2.0 Release 16: 5G; NR; Physical layer procedures for data,” https://www.etsi.org/deliver/etsi ts/138200 138299/138214/16.02.00 60/ts 138214v160200p.pdf, 2020, accessed: 2025-04-13

work page 2020

[58] [58]

Hoydis, S

J. Hoydis, S. Cammerer, F. Ait Aoudia, M. Nimier-David, L. Maggi, G. Marcus, A. Vem, and A. Keller, “Sionna,” 2022, https://nvlabs.github.io/sionna/

work page 2022