pith. sign in

arxiv: 2606.28398 · v1 · pith:NPF4VQWOnew · submitted 2026-06-24 · 💻 cs.CV · eess.IV

Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

Pith reviewed 2026-06-30 01:28 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords semantic-aware image transmissiongenerative reconstructionvisual IoTtoken selectionMaskGITlow bitrateinstance segmentationAWGN Rayleigh channels
0
0 comments X

The pith

Semantic-aware token selection transmits images at 0.074 bpp using 44.6% of the bits of standard methods while achieving 29.9 dB PSNR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for transmitting images from resource-limited IoT devices over narrowband wireless links. It encodes images into discrete tokens, selects a subset based on a combination of how easily each token can be recovered and its semantic importance from object detection, and sends only those plus a mask. The receiver then uses a generative model called MaskGIT to fill in the missing tokens. This approach is shown to work on Kodak and VisDrone datasets under noisy channels, preserving task-relevant content better than random selection.

Core claim

By fusing prediction-entropy-based recoverability estimates with instance-segmentation semantic scores, a spatial dispersal sampler selects tokens for transmission such that MaskGIT reconstruction at the receiver yields 29.9 dB PSNR at 0.074 bpp, using only 44.6% of the bits needed by DeepJSCC/WITT at 0.167 bpp, and better preserves objects for downstream detection than random masking.

What carries the argument

The semantic-aware spatial dispersal sampler fuses recoverability from prediction entropy and local complexity with semantic importance from instance segmentation to choose which tokens to transmit under a bitrate budget.

Load-bearing premise

That combining prediction entropy with semantic scores from instance segmentation produces token selections MaskGIT can reliably reconstruct under the tested channels and scenes.

What would settle it

An experiment measuring PSNR and object detection accuracy when using the proposed masking versus random masking at identical bitrates on a held-out dataset with different noise conditions.

Figures

Figures reproduced from arXiv: 2606.28398 by Changwang Liu, Chenyang Zhang, Jia Guo, Jiayi Chang, Jinqi Zhu, Shuqing He, Yuxuan Wang.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed semantic-aware generative image transmission framework for resource-constrained visual IoT systems. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rate-distortion performance comparison across methods (SNR = 13 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: PSNR versus SNR curves under Rayleigh fading channel (QPSK, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: PSNR versus SNR curves under AWGN channel at equal bitrate (bpp [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: PSNR versus SNR curves on the VisDrone2019 dataset under AWGN [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PSNR versus SNR curves on the VisDrone2019 dataset under Rayleigh [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Resource-constrained visual Internet of Things (IoT) systems, such as edge cameras, unmanned sensing platforms, industrial inspection nodes, and remote monitoring sensors, often need to transmit task-relevant visual evidence over low-rate wireless links to an edge/cloud service. Existing image communication methods usually compress or transmit complete global representations, leaving limited room to exploit receiver-side generative restoration. This paper proposes a semantic-aware generative image transmission framework for edge-assisted visual IoT. The image captured by an IoT visual sensor is encoded into a discrete token grid by a VQ encoder. At the IoT transmitter or nearby gateway, token recoverability, estimated from prediction entropy and local structure complexity, is fused with semantic importance obtained from instance segmentation and category-aware scoring. A spatial dispersal sampler then selects the tokens to be transmitted under a bitrate budget. The transmitter sends only the quantization indices of kept tokens and a binary mask map, while the edge/cloud receiver recovers masked tokens through MaskGIT with Halton sequence scheduling. Experiments on Kodak and VisDrone scenes under AWGN and Rayleigh channels show that the proposed method provides a flexible bitrate-quality tradeoff for narrowband visual IoT links. At 0.074 bpp, it uses 44.6% of the transmitted bits of the 0.167-bpp DeepJSCC/WITT reference while achieving 29.9 dB PSNR. A pseudo-GT downstream detection study on Kodak further shows that semantic-aware masking preserves task-relevant objects better than random masking at both 30% and 50% mask ratios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a semantic-aware generative image transmission framework for resource-constrained visual IoT. Images are encoded into discrete token grids using a VQ encoder. Tokens are selected for transmission via a spatial dispersal sampler that fuses token recoverability (from prediction entropy and local structure) with semantic importance (from instance segmentation). Only the indices of selected tokens and a binary mask are transmitted over the channel. The receiver uses MaskGIT with Halton sequence scheduling to inpaint the masked tokens. On Kodak and VisDrone datasets under AWGN and Rayleigh channels, the method claims to achieve 29.9 dB PSNR at 0.074 bpp, using 44.6% of the bits required by DeepJSCC/WITT at 0.167 bpp, while also showing better preservation of task-relevant objects in downstream detection compared to random masking.

Significance. If the performance claims hold under rigorous verification, the work could offer a valuable contribution to efficient visual data transmission in IoT by combining semantic awareness with generative inpainting to reduce bandwidth while maintaining quality. The approach addresses a practical need in edge-assisted systems, but its significance is tempered by the absence of detailed ablations and statistical validation in the reported results.

major comments (3)
  1. [Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.
  2. [Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.
  3. [Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'pseudo-GT downstream detection study' is introduced without definition of how ground truth is approximated or how detection performance is quantified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.

    Authors: The performance figures are computed as averages over the 24 images in the Kodak dataset, with details provided in Section IV of the manuscript. We will update the abstract to specify the dataset size and averaging process. For error bars and statistical testing, we will include standard deviations in the experimental results section of the revised manuscript. revision: partial

  2. Referee: [Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.

    Authors: We agree that an ablation study on the fusion rule would strengthen the paper. The fusion is defined in Section III-B as a weighted combination of recoverability (from prediction entropy and local structure) and semantic importance (from instance segmentation). We will add an ablation study in the revised manuscript or supplementary material to show the contribution of each component and sensitivity to the weighting factor. revision: yes

  3. Referee: [Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.

    Authors: The study provides visual comparisons to illustrate the advantage. To address this, we will include quantitative mAP results using a pre-trained object detector in the revised version, along with implementation details such as the detector model used. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a proposed framework that encodes images to tokens via VQ, fuses prediction-entropy recoverability with instance-segmentation scores for token selection under bitrate constraints, transmits indices plus mask, and recovers via MaskGIT. Reported results are empirical comparisons against external baselines (DeepJSCC/WITT) on Kodak/VisDrone under AWGN/Rayleigh, with no equations shown that reduce performance metrics to fitted inputs by construction, no self-citation load-bearing the central claims, and no renaming or ansatz smuggling. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Framework depends on the unstated assumption that MaskGIT recovery quality correlates with the paper's recoverability metric; no free parameters are explicitly named in the abstract, but bitrate budget and mask-ratio thresholds function as operational choices.

free parameters (2)
  • bitrate budget
    Controls which tokens are kept; value not stated in abstract.
  • mask ratios for evaluation
    30% and 50% used in downstream detection study.
axioms (1)
  • domain assumption MaskGIT with Halton sequence scheduling recovers masked tokens with quality sufficient for the reported PSNR and detection preservation
    Central to the receiver-side restoration step.

pith-pipeline@v0.9.1-grok · 5832 in / 1159 out tokens · 28649 ms · 2026-06-30T01:28:32.586539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references

  1. [1]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948

  2. [2]

    Generative AI for physical layer communications: A survey,

    N. V . Huynh, J. Wang, H. Du, D. T. Hoang, D. Niyato, D. N. Nguyen, D. I. Kim, and K. B. Letaief, “Generative AI for physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 3, pp. 706–728, 2024

  3. [3]

    Neural discrete rep- resentation learning,

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete rep- resentation learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6306–6315

  4. [4]

    Taming transformers for high- resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12 873–12 883

  5. [5]

    MaskGIT: Masked generative image transformer,

    H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked generative image transformer,” inProc. IEEE/CVF Conf. Com- put. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 315–11 325

  6. [6]

    Semantic communication: A survey on research landscape, challenges, and future directions,

    T. M. Getu, G. Kaddoum, and M. Bennis, “Semantic communication: A survey on research landscape, challenges, and future directions,”Proc. IEEE, vol. 112, no. 11, pp. 1–30, 2024

  7. [7]

    A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,

    L. X. Nguyen, A. D. Raha, P. S. Aung, D. Niyato, Z. Han, and C. S. Hong, “A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,”IEEE Commun. Surveys Tuts., 2025

  8. [8]

    Semantic communication empowered 6G networks: Techniques, applications, and challenges,

    Y . Wang, H. Han, Y . Feng, J. Zheng, and B. Zhang, “Semantic communication empowered 6G networks: Techniques, applications, and challenges,”IEEE Access, vol. 13, 2025

  9. [9]

    Enabling distributed generative AI in 6G: Mobile edge generation,

    R. Zhong, X. Mu, M. Jaber, and Y . Liu, “Enabling distributed generative AI in 6G: Mobile edge generation,”IEEE Internet Things J., vol. 12, no. 6, pp. 6607–6620, 2025

  10. [10]

    Generative AI for secure physical layer communications: A survey,

    C. Zhao, H. Du, D. Niyato, J. Kang, Z. Xiong, D. I. Kim, X. Shen, and K. B. Letaief, “Generative AI for secure physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 3–26, 2025

  11. [11]

    Generative AI-driven semantic communication networks: Architecture, technologies and applications,

    C. Liang, H. Du, Y . Sun, D. Niyato, J. Kang, D. Zhao, and M. A. Imran, “Generative AI-driven semantic communication networks: Architecture, technologies and applications,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 5, pp. 1911–1931, 2024

  12. [12]

    SING: Semantic image communications using null-space and INN-guided dif- fusion models,

    J. Chen, S. F. Yilmaz, D. You, P. L. Dragotti, and D. G ¨und¨uz, “SING: Semantic image communications using null-space and INN-guided dif- fusion models,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

  13. [13]

    Generative semantic communication for joint image transmission and segmentation,

    X. Yuan, J. Ren, Y . Wang, Z. Wang, X. Feng, H. Kim, and C. Wu, “Generative semantic communication for joint image transmission and segmentation,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

  14. [14]

    Deep joint source- channel coding for wireless image transmission,

    E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019

  15. [15]

    WITT: A wireless image transmis- sion transformer for semantic communications,

    M. Yang, C. Bian, and H.-S. Kim, “WITT: A wireless image transmis- sion transformer for semantic communications,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5

  16. [16]

    SwinJSCC: Taming Swin transformer for deep joint source-channel coding,

    K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 90–104, 2024

  17. [17]

    MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,

    C. Wang, C. Li, Y . Liao, C. Ding, and Z. Ye, “MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,”Comput. Mater. Continua, vol. 85, no. 2, 2025

  18. [18]

    Process- and-forward: Deep joint source-channel coding over cooperative relay networks,

    C. Bian, Y . Shao, H. Wu, E. Ozfatura, and D. G ¨und¨uz, “Process- and-forward: Deep joint source-channel coding over cooperative relay networks,”IEEE J. Sel. Areas Commun., 2024

  19. [19]

    Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,

    C.-H. Huang and J.-L. Wu, “Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,”Entropy, vol. 26, no. 5, p. 357, 2024

  20. [20]

    Joint source-channel coding: Fundamentals and recent progress in practical designs,

    D. G ¨und¨uz, M. A. Wigger, T. M. Getuet al., “Joint source-channel coding: Fundamentals and recent progress in practical designs,”arXiv preprint arXiv:2409.17557, 2024

  21. [21]

    TokenFlow: Unified image tokenizer for multimodal understanding and generation,

    L. Qu, S. Liu, H. Zhang, X. Chen, X. Wang, and Y . Jiang, “TokenFlow: Unified image tokenizer for multimodal understanding and generation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025

  22. [22]

    UniTok: A unified tokenizer for visual generation and understanding,

    M. Chen, S. Liu, J. Wanget al., “UniTok: A unified tokenizer for visual generation and understanding,”arXiv preprint arXiv:2502.20321, 2025

  23. [23]

    MaskBit: Embedding-free image generation via bit tokens,

    M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “MaskBit: Embedding-free image generation via bit tokens,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2025

  24. [24]

    Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,

    J. Li, X. Wang, Y . Zhanget al., “Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,” arXiv preprint arXiv:2504.14807, 2025

  25. [25]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

  26. [26]

    On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,

    J. H. Halton, “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,”Numer. Math., vol. 2, no. 1, pp. 84–90, 1960

  27. [27]

    Halton scheduler for masked generative image transformer,

    V . Besnier, M. Chen, D. Hurych, E. Valle, and M. Cord, “Halton scheduler for masked generative image transformer,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  28. [28]

    Dis- secting the effectiveness of deep features as metric of perceptual image quality,

    P. Hern ´andez-C´amara, J. Vila-Tom ´as, V . Laparra, and J. Malo, “Dis- secting the effectiveness of deep features as metric of perceptual image quality,”Neural Netw., vol. 185, p. 107189, 2025

  29. [29]

    MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,

    T. Wu, Z. Chen, M. Tao, Y . Sun, X. Xu, W. Zhang, and P. Zhang, “MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,” inProc. IEEE Global Commun. Conf. (GLOBECOM), 2024

  30. [30]

    Diffusion-aided joint source channel coding for high realism wireless image transmission,

    S. F. Yilmaz, C. Karakus, and D. G ¨und¨uz, “Diffusion-aided joint source channel coding for high realism wireless image transmission,” inProc. IEEE Int. Conf. Commun. (ICC), 2024