Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems
Pith reviewed 2026-06-30 01:28 UTC · model grok-4.3
The pith
Semantic-aware token selection transmits images at 0.074 bpp using 44.6% of the bits of standard methods while achieving 29.9 dB PSNR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fusing prediction-entropy-based recoverability estimates with instance-segmentation semantic scores, a spatial dispersal sampler selects tokens for transmission such that MaskGIT reconstruction at the receiver yields 29.9 dB PSNR at 0.074 bpp, using only 44.6% of the bits needed by DeepJSCC/WITT at 0.167 bpp, and better preserves objects for downstream detection than random masking.
What carries the argument
The semantic-aware spatial dispersal sampler fuses recoverability from prediction entropy and local complexity with semantic importance from instance segmentation to choose which tokens to transmit under a bitrate budget.
Load-bearing premise
That combining prediction entropy with semantic scores from instance segmentation produces token selections MaskGIT can reliably reconstruct under the tested channels and scenes.
What would settle it
An experiment measuring PSNR and object detection accuracy when using the proposed masking versus random masking at identical bitrates on a held-out dataset with different noise conditions.
Figures
read the original abstract
Resource-constrained visual Internet of Things (IoT) systems, such as edge cameras, unmanned sensing platforms, industrial inspection nodes, and remote monitoring sensors, often need to transmit task-relevant visual evidence over low-rate wireless links to an edge/cloud service. Existing image communication methods usually compress or transmit complete global representations, leaving limited room to exploit receiver-side generative restoration. This paper proposes a semantic-aware generative image transmission framework for edge-assisted visual IoT. The image captured by an IoT visual sensor is encoded into a discrete token grid by a VQ encoder. At the IoT transmitter or nearby gateway, token recoverability, estimated from prediction entropy and local structure complexity, is fused with semantic importance obtained from instance segmentation and category-aware scoring. A spatial dispersal sampler then selects the tokens to be transmitted under a bitrate budget. The transmitter sends only the quantization indices of kept tokens and a binary mask map, while the edge/cloud receiver recovers masked tokens through MaskGIT with Halton sequence scheduling. Experiments on Kodak and VisDrone scenes under AWGN and Rayleigh channels show that the proposed method provides a flexible bitrate-quality tradeoff for narrowband visual IoT links. At 0.074 bpp, it uses 44.6% of the transmitted bits of the 0.167-bpp DeepJSCC/WITT reference while achieving 29.9 dB PSNR. A pseudo-GT downstream detection study on Kodak further shows that semantic-aware masking preserves task-relevant objects better than random masking at both 30% and 50% mask ratios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a semantic-aware generative image transmission framework for resource-constrained visual IoT. Images are encoded into discrete token grids using a VQ encoder. Tokens are selected for transmission via a spatial dispersal sampler that fuses token recoverability (from prediction entropy and local structure) with semantic importance (from instance segmentation). Only the indices of selected tokens and a binary mask are transmitted over the channel. The receiver uses MaskGIT with Halton sequence scheduling to inpaint the masked tokens. On Kodak and VisDrone datasets under AWGN and Rayleigh channels, the method claims to achieve 29.9 dB PSNR at 0.074 bpp, using 44.6% of the bits required by DeepJSCC/WITT at 0.167 bpp, while also showing better preservation of task-relevant objects in downstream detection compared to random masking.
Significance. If the performance claims hold under rigorous verification, the work could offer a valuable contribution to efficient visual data transmission in IoT by combining semantic awareness with generative inpainting to reduce bandwidth while maintaining quality. The approach addresses a practical need in edge-assisted systems, but its significance is tempered by the absence of detailed ablations and statistical validation in the reported results.
major comments (3)
- [Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.
- [Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.
- [Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.
minor comments (1)
- [Abstract] Abstract: The phrase 'pseudo-GT downstream detection study' is introduced without definition of how ground truth is approximated or how detection performance is quantified.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We provide point-by-point responses below and will revise the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.
Authors: The performance figures are computed as averages over the 24 images in the Kodak dataset, with details provided in Section IV of the manuscript. We will update the abstract to specify the dataset size and averaging process. For error bars and statistical testing, we will include standard deviations in the experimental results section of the revised manuscript. revision: partial
-
Referee: [Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.
Authors: We agree that an ablation study on the fusion rule would strengthen the paper. The fusion is defined in Section III-B as a weighted combination of recoverability (from prediction entropy and local structure) and semantic importance (from instance segmentation). We will add an ablation study in the revised manuscript or supplementary material to show the contribution of each component and sensitivity to the weighting factor. revision: yes
-
Referee: [Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.
Authors: The study provides visual comparisons to illustrate the advantage. To address this, we will include quantitative mAP results using a pre-trained object detector in the revised version, along with implementation details such as the detector model used. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a proposed framework that encodes images to tokens via VQ, fuses prediction-entropy recoverability with instance-segmentation scores for token selection under bitrate constraints, transmits indices plus mask, and recovers via MaskGIT. Reported results are empirical comparisons against external baselines (DeepJSCC/WITT) on Kodak/VisDrone under AWGN/Rayleigh, with no equations shown that reduce performance metrics to fitted inputs by construction, no self-citation load-bearing the central claims, and no renaming or ansatz smuggling. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- bitrate budget
- mask ratios for evaluation
axioms (1)
- domain assumption MaskGIT with Halton sequence scheduling recovers masked tokens with quality sufficient for the reported PSNR and detection preservation
Reference graph
Works this paper leans on
-
[1]
A mathematical theory of communication,
C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948
1948
-
[2]
Generative AI for physical layer communications: A survey,
N. V . Huynh, J. Wang, H. Du, D. T. Hoang, D. Niyato, D. N. Nguyen, D. I. Kim, and K. B. Letaief, “Generative AI for physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 3, pp. 706–728, 2024
2024
-
[3]
Neural discrete rep- resentation learning,
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete rep- resentation learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6306–6315
2017
-
[4]
Taming transformers for high- resolution image synthesis,
P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12 873–12 883
2021
-
[5]
MaskGIT: Masked generative image transformer,
H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked generative image transformer,” inProc. IEEE/CVF Conf. Com- put. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 315–11 325
2022
-
[6]
Semantic communication: A survey on research landscape, challenges, and future directions,
T. M. Getu, G. Kaddoum, and M. Bennis, “Semantic communication: A survey on research landscape, challenges, and future directions,”Proc. IEEE, vol. 112, no. 11, pp. 1–30, 2024
2024
-
[7]
A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,
L. X. Nguyen, A. D. Raha, P. S. Aung, D. Niyato, Z. Han, and C. S. Hong, “A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,”IEEE Commun. Surveys Tuts., 2025
2025
-
[8]
Semantic communication empowered 6G networks: Techniques, applications, and challenges,
Y . Wang, H. Han, Y . Feng, J. Zheng, and B. Zhang, “Semantic communication empowered 6G networks: Techniques, applications, and challenges,”IEEE Access, vol. 13, 2025
2025
-
[9]
Enabling distributed generative AI in 6G: Mobile edge generation,
R. Zhong, X. Mu, M. Jaber, and Y . Liu, “Enabling distributed generative AI in 6G: Mobile edge generation,”IEEE Internet Things J., vol. 12, no. 6, pp. 6607–6620, 2025
2025
-
[10]
Generative AI for secure physical layer communications: A survey,
C. Zhao, H. Du, D. Niyato, J. Kang, Z. Xiong, D. I. Kim, X. Shen, and K. B. Letaief, “Generative AI for secure physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 3–26, 2025
2025
-
[11]
Generative AI-driven semantic communication networks: Architecture, technologies and applications,
C. Liang, H. Du, Y . Sun, D. Niyato, J. Kang, D. Zhao, and M. A. Imran, “Generative AI-driven semantic communication networks: Architecture, technologies and applications,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 5, pp. 1911–1931, 2024
1911
-
[12]
SING: Semantic image communications using null-space and INN-guided dif- fusion models,
J. Chen, S. F. Yilmaz, D. You, P. L. Dragotti, and D. G ¨und¨uz, “SING: Semantic image communications using null-space and INN-guided dif- fusion models,” inProc. IEEE Int. Conf. Commun. (ICC), 2025
2025
-
[13]
Generative semantic communication for joint image transmission and segmentation,
X. Yuan, J. Ren, Y . Wang, Z. Wang, X. Feng, H. Kim, and C. Wu, “Generative semantic communication for joint image transmission and segmentation,” inProc. IEEE Int. Conf. Commun. (ICC), 2025
2025
-
[14]
Deep joint source- channel coding for wireless image transmission,
E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019
2019
-
[15]
WITT: A wireless image transmis- sion transformer for semantic communications,
M. Yang, C. Bian, and H.-S. Kim, “WITT: A wireless image transmis- sion transformer for semantic communications,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5
2023
-
[16]
SwinJSCC: Taming Swin transformer for deep joint source-channel coding,
K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 90–104, 2024
2024
-
[17]
MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,
C. Wang, C. Li, Y . Liao, C. Ding, and Z. Ye, “MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,”Comput. Mater. Continua, vol. 85, no. 2, 2025
2025
-
[18]
Process- and-forward: Deep joint source-channel coding over cooperative relay networks,
C. Bian, Y . Shao, H. Wu, E. Ozfatura, and D. G ¨und¨uz, “Process- and-forward: Deep joint source-channel coding over cooperative relay networks,”IEEE J. Sel. Areas Commun., 2024
2024
-
[19]
Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,
C.-H. Huang and J.-L. Wu, “Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,”Entropy, vol. 26, no. 5, p. 357, 2024
2024
-
[20]
Joint source-channel coding: Fundamentals and recent progress in practical designs,
D. G ¨und¨uz, M. A. Wigger, T. M. Getuet al., “Joint source-channel coding: Fundamentals and recent progress in practical designs,”arXiv preprint arXiv:2409.17557, 2024
arXiv 2024
-
[21]
TokenFlow: Unified image tokenizer for multimodal understanding and generation,
L. Qu, S. Liu, H. Zhang, X. Chen, X. Wang, and Y . Jiang, “TokenFlow: Unified image tokenizer for multimodal understanding and generation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025
2025
-
[22]
UniTok: A unified tokenizer for visual generation and understanding,
M. Chen, S. Liu, J. Wanget al., “UniTok: A unified tokenizer for visual generation and understanding,”arXiv preprint arXiv:2502.20321, 2025
arXiv 2025
-
[23]
MaskBit: Embedding-free image generation via bit tokens,
M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “MaskBit: Embedding-free image generation via bit tokens,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2025
2025
-
[24]
Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,
J. Li, X. Wang, Y . Zhanget al., “Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,” arXiv preprint arXiv:2504.14807, 2025
arXiv 2025
-
[25]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008
2017
-
[26]
On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,
J. H. Halton, “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,”Numer. Math., vol. 2, no. 1, pp. 84–90, 1960
1960
-
[27]
Halton scheduler for masked generative image transformer,
V . Besnier, M. Chen, D. Hurych, E. Valle, and M. Cord, “Halton scheduler for masked generative image transformer,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025
2025
-
[28]
Dis- secting the effectiveness of deep features as metric of perceptual image quality,
P. Hern ´andez-C´amara, J. Vila-Tom ´as, V . Laparra, and J. Malo, “Dis- secting the effectiveness of deep features as metric of perceptual image quality,”Neural Netw., vol. 185, p. 107189, 2025
2025
-
[29]
MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,
T. Wu, Z. Chen, M. Tao, Y . Sun, X. Xu, W. Zhang, and P. Zhang, “MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,” inProc. IEEE Global Commun. Conf. (GLOBECOM), 2024
2024
-
[30]
Diffusion-aided joint source channel coding for high realism wireless image transmission,
S. F. Yilmaz, C. Karakus, and D. G ¨und¨uz, “Diffusion-aided joint source channel coding for high realism wireless image transmission,” inProc. IEEE Int. Conf. Commun. (ICC), 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.