Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

Changwang Liu; Chenyang Zhang; Jia Guo; Jiayi Chang; Jinqi Zhu; Shuqing He; Yuxuan Wang

arxiv: 2606.28398 · v1 · pith:NPF4VQWOnew · submitted 2026-06-24 · 💻 cs.CV · eess.IV

Semantic-Aware Generative Image Transmission for Resource-Constrained Visual IoT Systems

Chenyang Zhang , Changwang Liu , Jinqi Zhu , Jiayi Chang , Yuxuan Wang , Shuqing He , Jia Guo This is my paper

Pith reviewed 2026-06-30 01:28 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords semantic-aware image transmissiongenerative reconstructionvisual IoTtoken selectionMaskGITlow bitrateinstance segmentationAWGN Rayleigh channels

0 comments

The pith

Semantic-aware token selection transmits images at 0.074 bpp using 44.6% of the bits of standard methods while achieving 29.9 dB PSNR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for transmitting images from resource-limited IoT devices over narrowband wireless links. It encodes images into discrete tokens, selects a subset based on a combination of how easily each token can be recovered and its semantic importance from object detection, and sends only those plus a mask. The receiver then uses a generative model called MaskGIT to fill in the missing tokens. This approach is shown to work on Kodak and VisDrone datasets under noisy channels, preserving task-relevant content better than random selection.

Core claim

By fusing prediction-entropy-based recoverability estimates with instance-segmentation semantic scores, a spatial dispersal sampler selects tokens for transmission such that MaskGIT reconstruction at the receiver yields 29.9 dB PSNR at 0.074 bpp, using only 44.6% of the bits needed by DeepJSCC/WITT at 0.167 bpp, and better preserves objects for downstream detection than random masking.

What carries the argument

The semantic-aware spatial dispersal sampler fuses recoverability from prediction entropy and local complexity with semantic importance from instance segmentation to choose which tokens to transmit under a bitrate budget.

Load-bearing premise

That combining prediction entropy with semantic scores from instance segmentation produces token selections MaskGIT can reliably reconstruct under the tested channels and scenes.

What would settle it

An experiment measuring PSNR and object detection accuracy when using the proposed masking versus random masking at identical bitrates on a held-out dataset with different noise conditions.

Figures

Figures reproduced from arXiv: 2606.28398 by Changwang Liu, Chenyang Zhang, Jia Guo, Jiayi Chang, Jinqi Zhu, Shuqing He, Yuxuan Wang.

**Figure 1.** Figure 1: Overall architecture of the proposed semantic-aware generative image transmission framework for resource-constrained visual IoT systems. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Rate-distortion performance comparison across methods (SNR = 13 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: PSNR versus SNR curves under Rayleigh fading channel (QPSK, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 3.** Figure 3: PSNR versus SNR curves under AWGN channel at equal bitrate (bpp [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: PSNR versus SNR curves on the VisDrone2019 dataset under AWGN [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: PSNR versus SNR curves on the VisDrone2019 dataset under Rayleigh [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Resource-constrained visual Internet of Things (IoT) systems, such as edge cameras, unmanned sensing platforms, industrial inspection nodes, and remote monitoring sensors, often need to transmit task-relevant visual evidence over low-rate wireless links to an edge/cloud service. Existing image communication methods usually compress or transmit complete global representations, leaving limited room to exploit receiver-side generative restoration. This paper proposes a semantic-aware generative image transmission framework for edge-assisted visual IoT. The image captured by an IoT visual sensor is encoded into a discrete token grid by a VQ encoder. At the IoT transmitter or nearby gateway, token recoverability, estimated from prediction entropy and local structure complexity, is fused with semantic importance obtained from instance segmentation and category-aware scoring. A spatial dispersal sampler then selects the tokens to be transmitted under a bitrate budget. The transmitter sends only the quantization indices of kept tokens and a binary mask map, while the edge/cloud receiver recovers masked tokens through MaskGIT with Halton sequence scheduling. Experiments on Kodak and VisDrone scenes under AWGN and Rayleigh channels show that the proposed method provides a flexible bitrate-quality tradeoff for narrowband visual IoT links. At 0.074 bpp, it uses 44.6% of the transmitted bits of the 0.167-bpp DeepJSCC/WITT reference while achieving 29.9 dB PSNR. A pseudo-GT downstream detection study on Kodak further shows that semantic-aware masking preserves task-relevant objects better than random masking at both 30% and 50% mask ratios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fuses prediction-entropy recoverability with instance-segmentation scores to pick VQ tokens for MaskGIT reconstruction, claiming 0.074 bpp at 29.9 dB PSNR on Kodak/VisDrone, but the fusion rule lacks any ablation or sensitivity checks.

read the letter

The main point is a selection pipeline that scores each VQ token by how hard it is to predict (via entropy and local complexity) and how semantically important it is (via instance segmentation and category weights), then keeps a subset under a bitrate budget and lets MaskGIT inpaint the rest with Halton scheduling. The transmitter only sends the kept indices plus a binary mask.

This combination for generative IoT transmission is not in the cited baselines. The framing for narrowband visual sensors is clear, and the headline numbers—at 0.074 bpp using 44.6 % of the bits of a 0.167-bpp DeepJSCC/WITT reference while reaching 29.9 dB PSNR—plus the pseudo-GT detection check showing semantic masking beats random at 30 % and 50 % mask ratios, give a concrete sense of the tradeoff on Kodak and VisDrone under AWGN and Rayleigh.

The soft spot is that those numbers rest on the fusion producing token sets MaskGIT can actually recover. The abstract supplies no ablation on the relative weighting of recoverability versus semantic score, no per-token recovery statistics, and no sensitivity to SNR or the exact entropy estimator. Without those, it is difficult to tell whether the reported PSNR holds when the fusion rule changes or when the scenes differ from the test set.

The work is aimed at people building edge visual links who already use discrete tokens and generative models. A reader looking for a practical way to cut bits while preserving task objects would get value from the idea and the baseline comparisons. The paper shows clear thinking on the problem and honest engagement with existing codecs, so it deserves a serious referee to check the missing implementation details and ablations.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a semantic-aware generative image transmission framework for resource-constrained visual IoT. Images are encoded into discrete token grids using a VQ encoder. Tokens are selected for transmission via a spatial dispersal sampler that fuses token recoverability (from prediction entropy and local structure) with semantic importance (from instance segmentation). Only the indices of selected tokens and a binary mask are transmitted over the channel. The receiver uses MaskGIT with Halton sequence scheduling to inpaint the masked tokens. On Kodak and VisDrone datasets under AWGN and Rayleigh channels, the method claims to achieve 29.9 dB PSNR at 0.074 bpp, using 44.6% of the bits required by DeepJSCC/WITT at 0.167 bpp, while also showing better preservation of task-relevant objects in downstream detection compared to random masking.

Significance. If the performance claims hold under rigorous verification, the work could offer a valuable contribution to efficient visual data transmission in IoT by combining semantic awareness with generative inpainting to reduce bandwidth while maintaining quality. The approach addresses a practical need in edge-assisted systems, but its significance is tempered by the absence of detailed ablations and statistical validation in the reported results.

major comments (3)

[Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.
[Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.
[Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.

minor comments (1)

[Abstract] Abstract: The phrase 'pseudo-GT downstream detection study' is introduced without definition of how ground truth is approximated or how detection performance is quantified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (0.074 bpp achieving 29.9 dB PSNR while using 44.6% of the bits of the 0.167-bpp DeepJSCC/WITT reference) is presented without error bars, dataset split information, number of test images, or statistical significance testing, which is load-bearing for validating the reported efficiency and quality gains.

Authors: The performance figures are computed as averages over the 24 images in the Kodak dataset, with details provided in Section IV of the manuscript. We will update the abstract to specify the dataset size and averaging process. For error bars and statistical testing, we will include standard deviations in the experimental results section of the revised manuscript. revision: partial
Referee: [Abstract] Abstract: No ablation study, sensitivity analysis, or per-token recovery statistics are reported on the fusion rule between prediction-entropy recoverability and instance-segmentation semantic scores that drives the spatial dispersal sampler; this fusion is load-bearing for the claim that the selected token subsets remain reliably reconstructible by MaskGIT under the tested AWGN and Rayleigh channels.

Authors: We agree that an ablation study on the fusion rule would strengthen the paper. The fusion is defined in Section III-B as a weighted combination of recoverability (from prediction entropy and local structure) and semantic importance (from instance segmentation). We will add an ablation study in the revised manuscript or supplementary material to show the contribution of each component and sensitivity to the weighting factor. revision: yes
Referee: [Abstract] Abstract: The pseudo-GT downstream detection study asserts that semantic-aware masking preserves task-relevant objects better than random masking at 30% and 50% mask ratios, but provides no quantitative detection metrics (e.g., mAP), implementation details, or comparison baselines, weakening support for the semantic-preservation advantage.

Authors: The study provides visual comparisons to illustrate the advantage. To address this, we will include quantitative mAP results using a pre-trained object detector in the revised version, along with implementation details such as the detector model used. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a proposed framework that encodes images to tokens via VQ, fuses prediction-entropy recoverability with instance-segmentation scores for token selection under bitrate constraints, transmits indices plus mask, and recovers via MaskGIT. Reported results are empirical comparisons against external baselines (DeepJSCC/WITT) on Kodak/VisDrone under AWGN/Rayleigh, with no equations shown that reduce performance metrics to fitted inputs by construction, no self-citation load-bearing the central claims, and no renaming or ansatz smuggling. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Framework depends on the unstated assumption that MaskGIT recovery quality correlates with the paper's recoverability metric; no free parameters are explicitly named in the abstract, but bitrate budget and mask-ratio thresholds function as operational choices.

free parameters (2)

bitrate budget
Controls which tokens are kept; value not stated in abstract.
mask ratios for evaluation
30% and 50% used in downstream detection study.

axioms (1)

domain assumption MaskGIT with Halton sequence scheduling recovers masked tokens with quality sufficient for the reported PSNR and detection preservation
Central to the receiver-side restoration step.

pith-pipeline@v0.9.1-grok · 5832 in / 1159 out tokens · 28649 ms · 2026-06-30T01:28:32.586539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references

[1]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948

1948
[2]

Generative AI for physical layer communications: A survey,

N. V . Huynh, J. Wang, H. Du, D. T. Hoang, D. Niyato, D. N. Nguyen, D. I. Kim, and K. B. Letaief, “Generative AI for physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 3, pp. 706–728, 2024

2024
[3]

Neural discrete rep- resentation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete rep- resentation learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6306–6315

2017
[4]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12 873–12 883

2021
[5]

MaskGIT: Masked generative image transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked generative image transformer,” inProc. IEEE/CVF Conf. Com- put. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 315–11 325

2022
[6]

Semantic communication: A survey on research landscape, challenges, and future directions,

T. M. Getu, G. Kaddoum, and M. Bennis, “Semantic communication: A survey on research landscape, challenges, and future directions,”Proc. IEEE, vol. 112, no. 11, pp. 1–30, 2024

2024
[7]

A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,

L. X. Nguyen, A. D. Raha, P. S. Aung, D. Niyato, Z. Han, and C. S. Hong, “A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,”IEEE Commun. Surveys Tuts., 2025

2025
[8]

Semantic communication empowered 6G networks: Techniques, applications, and challenges,

Y . Wang, H. Han, Y . Feng, J. Zheng, and B. Zhang, “Semantic communication empowered 6G networks: Techniques, applications, and challenges,”IEEE Access, vol. 13, 2025

2025
[9]

Enabling distributed generative AI in 6G: Mobile edge generation,

R. Zhong, X. Mu, M. Jaber, and Y . Liu, “Enabling distributed generative AI in 6G: Mobile edge generation,”IEEE Internet Things J., vol. 12, no. 6, pp. 6607–6620, 2025

2025
[10]

Generative AI for secure physical layer communications: A survey,

C. Zhao, H. Du, D. Niyato, J. Kang, Z. Xiong, D. I. Kim, X. Shen, and K. B. Letaief, “Generative AI for secure physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 3–26, 2025

2025
[11]

Generative AI-driven semantic communication networks: Architecture, technologies and applications,

C. Liang, H. Du, Y . Sun, D. Niyato, J. Kang, D. Zhao, and M. A. Imran, “Generative AI-driven semantic communication networks: Architecture, technologies and applications,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 5, pp. 1911–1931, 2024

1911
[12]

SING: Semantic image communications using null-space and INN-guided dif- fusion models,

J. Chen, S. F. Yilmaz, D. You, P. L. Dragotti, and D. G ¨und¨uz, “SING: Semantic image communications using null-space and INN-guided dif- fusion models,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

2025
[13]

Generative semantic communication for joint image transmission and segmentation,

X. Yuan, J. Ren, Y . Wang, Z. Wang, X. Feng, H. Kim, and C. Wu, “Generative semantic communication for joint image transmission and segmentation,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

2025
[14]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019

2019
[15]

WITT: A wireless image transmis- sion transformer for semantic communications,

M. Yang, C. Bian, and H.-S. Kim, “WITT: A wireless image transmis- sion transformer for semantic communications,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5

2023
[16]

SwinJSCC: Taming Swin transformer for deep joint source-channel coding,

K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 90–104, 2024

2024
[17]

MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,

C. Wang, C. Li, Y . Liao, C. Ding, and Z. Ye, “MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,”Comput. Mater. Continua, vol. 85, no. 2, 2025

2025
[18]

Process- and-forward: Deep joint source-channel coding over cooperative relay networks,

C. Bian, Y . Shao, H. Wu, E. Ozfatura, and D. G ¨und¨uz, “Process- and-forward: Deep joint source-channel coding over cooperative relay networks,”IEEE J. Sel. Areas Commun., 2024

2024
[19]

Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,

C.-H. Huang and J.-L. Wu, “Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,”Entropy, vol. 26, no. 5, p. 357, 2024

2024
[20]

Joint source-channel coding: Fundamentals and recent progress in practical designs,

D. G ¨und¨uz, M. A. Wigger, T. M. Getuet al., “Joint source-channel coding: Fundamentals and recent progress in practical designs,”arXiv preprint arXiv:2409.17557, 2024

arXiv 2024
[21]

TokenFlow: Unified image tokenizer for multimodal understanding and generation,

L. Qu, S. Liu, H. Zhang, X. Chen, X. Wang, and Y . Jiang, “TokenFlow: Unified image tokenizer for multimodal understanding and generation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025

2025
[22]

UniTok: A unified tokenizer for visual generation and understanding,

M. Chen, S. Liu, J. Wanget al., “UniTok: A unified tokenizer for visual generation and understanding,”arXiv preprint arXiv:2502.20321, 2025

arXiv 2025
[23]

MaskBit: Embedding-free image generation via bit tokens,

M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “MaskBit: Embedding-free image generation via bit tokens,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2025

2025
[24]

Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,

J. Li, X. Wang, Y . Zhanget al., “Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,” arXiv preprint arXiv:2504.14807, 2025

arXiv 2025
[25]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

2017
[26]

On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,

J. H. Halton, “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,”Numer. Math., vol. 2, no. 1, pp. 84–90, 1960

1960
[27]

Halton scheduler for masked generative image transformer,

V . Besnier, M. Chen, D. Hurych, E. Valle, and M. Cord, “Halton scheduler for masked generative image transformer,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025
[28]

Dis- secting the effectiveness of deep features as metric of perceptual image quality,

P. Hern ´andez-C´amara, J. Vila-Tom ´as, V . Laparra, and J. Malo, “Dis- secting the effectiveness of deep features as metric of perceptual image quality,”Neural Netw., vol. 185, p. 107189, 2025

2025
[29]

MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,

T. Wu, Z. Chen, M. Tao, Y . Sun, X. Xu, W. Zhang, and P. Zhang, “MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,” inProc. IEEE Global Commun. Conf. (GLOBECOM), 2024

2024
[30]

Diffusion-aided joint source channel coding for high realism wireless image transmission,

S. F. Yilmaz, C. Karakus, and D. G ¨und¨uz, “Diffusion-aided joint source channel coding for high realism wireless image transmission,” inProc. IEEE Int. Conf. Commun. (ICC), 2024

2024

[1] [1]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 1948

1948

[2] [2]

Generative AI for physical layer communications: A survey,

N. V . Huynh, J. Wang, H. Du, D. T. Hoang, D. Niyato, D. N. Nguyen, D. I. Kim, and K. B. Letaief, “Generative AI for physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 3, pp. 706–728, 2024

2024

[3] [3]

Neural discrete rep- resentation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete rep- resentation learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6306–6315

2017

[4] [4]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 12 873–12 883

2021

[5] [5]

MaskGIT: Masked generative image transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked generative image transformer,” inProc. IEEE/CVF Conf. Com- put. Vis. Pattern Recognit. (CVPR), 2022, pp. 11 315–11 325

2022

[6] [6]

Semantic communication: A survey on research landscape, challenges, and future directions,

T. M. Getu, G. Kaddoum, and M. Bennis, “Semantic communication: A survey on research landscape, challenges, and future directions,”Proc. IEEE, vol. 112, no. 11, pp. 1–30, 2024

2024

[7] [7]

A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,

L. X. Nguyen, A. D. Raha, P. S. Aung, D. Niyato, Z. Han, and C. S. Hong, “A contemporary survey on semantic communications: Theory of mind, generative AI, and deep joint source-channel coding,”IEEE Commun. Surveys Tuts., 2025

2025

[8] [8]

Semantic communication empowered 6G networks: Techniques, applications, and challenges,

Y . Wang, H. Han, Y . Feng, J. Zheng, and B. Zhang, “Semantic communication empowered 6G networks: Techniques, applications, and challenges,”IEEE Access, vol. 13, 2025

2025

[9] [9]

Enabling distributed generative AI in 6G: Mobile edge generation,

R. Zhong, X. Mu, M. Jaber, and Y . Liu, “Enabling distributed generative AI in 6G: Mobile edge generation,”IEEE Internet Things J., vol. 12, no. 6, pp. 6607–6620, 2025

2025

[10] [10]

Generative AI for secure physical layer communications: A survey,

C. Zhao, H. Du, D. Niyato, J. Kang, Z. Xiong, D. I. Kim, X. Shen, and K. B. Letaief, “Generative AI for secure physical layer communications: A survey,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 3–26, 2025

2025

[11] [11]

Generative AI-driven semantic communication networks: Architecture, technologies and applications,

C. Liang, H. Du, Y . Sun, D. Niyato, J. Kang, D. Zhao, and M. A. Imran, “Generative AI-driven semantic communication networks: Architecture, technologies and applications,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 5, pp. 1911–1931, 2024

1911

[12] [12]

SING: Semantic image communications using null-space and INN-guided dif- fusion models,

J. Chen, S. F. Yilmaz, D. You, P. L. Dragotti, and D. G ¨und¨uz, “SING: Semantic image communications using null-space and INN-guided dif- fusion models,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

2025

[13] [13]

Generative semantic communication for joint image transmission and segmentation,

X. Yuan, J. Ren, Y . Wang, Z. Wang, X. Feng, H. Kim, and C. Wu, “Generative semantic communication for joint image transmission and segmentation,” inProc. IEEE Int. Conf. Commun. (ICC), 2025

2025

[14] [14]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019

2019

[15] [15]

WITT: A wireless image transmis- sion transformer for semantic communications,

M. Yang, C. Bian, and H.-S. Kim, “WITT: A wireless image transmis- sion transformer for semantic communications,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5

2023

[16] [16]

SwinJSCC: Taming Swin transformer for deep joint source-channel coding,

K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Netw., vol. 11, no. 1, pp. 90–104, 2024

2024

[17] [17]

MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,

C. Wang, C. Li, Y . Liao, C. Ding, and Z. Ye, “MNTSCC: A VMamba- based nonlinear joint source-channel coding for semantic communica- tions,”Comput. Mater. Continua, vol. 85, no. 2, 2025

2025

[18] [18]

Process- and-forward: Deep joint source-channel coding over cooperative relay networks,

C. Bian, Y . Shao, H. Wu, E. Ozfatura, and D. G ¨und¨uz, “Process- and-forward: Deep joint source-channel coding over cooperative relay networks,”IEEE J. Sel. Areas Commun., 2024

2024

[19] [19]

Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,

C.-H. Huang and J.-L. Wu, “Unveiling the future of human and machine coding: A survey of end-to-end learned image compression,”Entropy, vol. 26, no. 5, p. 357, 2024

2024

[20] [20]

Joint source-channel coding: Fundamentals and recent progress in practical designs,

D. G ¨und¨uz, M. A. Wigger, T. M. Getuet al., “Joint source-channel coding: Fundamentals and recent progress in practical designs,”arXiv preprint arXiv:2409.17557, 2024

arXiv 2024

[21] [21]

TokenFlow: Unified image tokenizer for multimodal understanding and generation,

L. Qu, S. Liu, H. Zhang, X. Chen, X. Wang, and Y . Jiang, “TokenFlow: Unified image tokenizer for multimodal understanding and generation,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025

2025

[22] [22]

UniTok: A unified tokenizer for visual generation and understanding,

M. Chen, S. Liu, J. Wanget al., “UniTok: A unified tokenizer for visual generation and understanding,”arXiv preprint arXiv:2502.20321, 2025

arXiv 2025

[23] [23]

MaskBit: Embedding-free image generation via bit tokens,

M. Weber, L. Yu, Q. Yu, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “MaskBit: Embedding-free image generation via bit tokens,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2025

2025

[24] [24]

Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,

J. Li, X. Wang, Y . Zhanget al., “Discrete visual tokenization: A comprehensive survey of vector quantization for image generation,” arXiv preprint arXiv:2504.14807, 2025

arXiv 2025

[25] [25]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

2017

[26] [26]

On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,

J. H. Halton, “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals,”Numer. Math., vol. 2, no. 1, pp. 84–90, 1960

1960

[27] [27]

Halton scheduler for masked generative image transformer,

V . Besnier, M. Chen, D. Hurych, E. Valle, and M. Cord, “Halton scheduler for masked generative image transformer,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

2025

[28] [28]

Dis- secting the effectiveness of deep features as metric of perceptual image quality,

P. Hern ´andez-C´amara, J. Vila-Tom ´as, V . Laparra, and J. Malo, “Dis- secting the effectiveness of deep features as metric of perceptual image quality,”Neural Netw., vol. 185, p. 107189, 2025

2025

[29] [29]

MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,

T. Wu, Z. Chen, M. Tao, Y . Sun, X. Xu, W. Zhang, and P. Zhang, “MambaJSCC: Adaptive deep joint source-channel coding with gen- IEEE INTERNET OF THINGS JOURNAL 11 eralized state space model,” inProc. IEEE Global Commun. Conf. (GLOBECOM), 2024

2024

[30] [30]

Diffusion-aided joint source channel coding for high realism wireless image transmission,

S. F. Yilmaz, C. Karakus, and D. G ¨und¨uz, “Diffusion-aided joint source channel coding for high realism wireless image transmission,” inProc. IEEE Int. Conf. Commun. (ICC), 2024

2024