On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation
Pith reviewed 2026-05-24 08:01 UTC · model grok-4.3
The pith
A Vision Transformer model for semantic communications yields a 0.5 dB PSNR gain over CNN versions and enables the first hardware prototype.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a ViT-based semantic communications system achieves a peak signal-to-noise ratio gain of 0.5 dB compared to convolutional neural network variants. Novel measures of average cosine similarity and Fourier analysis are introduced to examine the inner workings of semantic communications systems. The work includes the first hardware implementation validated over a real wireless channel using software-defined radio, along with open-source code for reproducibility.
What carries the argument
The ViT-based encoder-decoder architecture for joint semantic source and channel coding, which leverages self-attention to handle image nuisances more robustly than local convolutional filters.
If this is right
- Semantic communications systems can achieve higher image reconstruction quality under channel noise using transformer architectures.
- Analysis via cosine similarity and Fourier methods allows identification of optimal operating points in the semantic communications pipeline.
- Real-world validation on SDR hardware shows that simulation gains translate to physical wireless channels.
- Providing open-source neural network and LabVIEW code enables other researchers to build upon the prototype.
Where Pith is reading between the lines
- ViT's ability to capture long-range dependencies in images may explain its edge in preserving semantic content over noisy channels.
- The analysis tools could extend to evaluating semantic fidelity in other communication modalities beyond images.
- Scaling this approach might reduce the bandwidth needed for high-quality image transmission in future wireless networks.
Load-bearing premise
The 0.5 dB PSNR improvement results from the Vision Transformer architecture itself and not from differences in training, hyperparameters, or data handling between the models.
What would settle it
Running an ablation study with identical training procedures, hyperparameters, and preprocessing for both ViT and CNN models on the same dataset and channel conditions to verify if the PSNR difference remains.
Figures
read the original abstract
Semantic communications have shown promising advancements by optimizing source and channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances, we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures, average cosine similarity and Fourier analysis, to analyze the inner workings of semantic communications and optimize the system's performance. We also validate our approach through a real wireless channel prototype using software-defined radio (SDR). To the best of our knowledge, this is the first investigation of the fundamental workings of a semantic communications system, accompanied by the pioneering hardware implementation. To facilitate reproducibility and encourage further research, we provide open-source code, including neural network implementations and LabVIEW codes for SDR-based wireless transmission systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a Vision Transformer (ViT)-based semantic communication system for joint source-channel coding. It reports a +0.5 dB PSNR gain relative to CNN variants, introduces average cosine similarity and Fourier analysis as tools to examine system internals, and validates the approach via a software-defined radio (SDR) prototype over a real wireless channel. The work claims to be the first fundamental investigation of semantic communications dynamics accompanied by hardware implementation and releases open-source code for the neural networks and LabVIEW SDR transmission.
Significance. If the reported PSNR gain can be shown to arise specifically from the ViT inductive bias rather than uncontrolled differences in training or data handling, and if the cosine-similarity and Fourier measures yield new, actionable insights into semantic feature transmission, the paper would usefully extend the literature on architecture choice in semantic communications. The SDR prototype and open-source release constitute concrete strengths for reproducibility and practical validation.
major comments (2)
- [Abstract] Abstract: The central claim of a +0.5 dB PSNR advantage over CNN variants is presented without any description of matched training schedules, identical loss weighting, shared data augmentation, or hyperparameter controls that would isolate the architecture choice as the sole differing factor. This isolation is required to attribute the numerical result to the ViT rather than to optimization or preprocessing differences.
- [Abstract] Abstract / hardware validation section: The prototype results and the +0.5 dB numerical gain are reported without experimental details such as baseline model descriptions, error bars, number of trials, or statistical tests. This absence prevents verification of the empirical claims that underpin both the performance and the “pioneering hardware implementation” assertions.
minor comments (2)
- [Introduction] The abstract’s phrasing “to the best of our knowledge, this is the first investigation” would be strengthened by an explicit literature-review paragraph in the introduction that cites and differentiates prior semantic-communications analysis papers.
- Notation for the new average cosine similarity and Fourier measures should be defined with explicit formulas (including any normalization) at their first appearance to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The two major comments both highlight the need for greater experimental transparency in the abstract and validation sections. We address each point below and agree that revisions are warranted to strengthen the attribution of results to the ViT architecture and to improve reproducibility of the prototype claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of a +0.5 dB PSNR advantage over CNN variants is presented without any description of matched training schedules, identical loss weighting, shared data augmentation, or hyperparameter controls that would isolate the architecture choice as the sole differing factor. This isolation is required to attribute the numerical result to the ViT rather than to optimization or preprocessing differences.
Authors: We agree that the abstract does not explicitly list these controls, which is a valid concern for isolating the effect of the ViT inductive bias. In the full manuscript (Section III), the ViT and CNN models share the same training schedule, loss function and weighting, data augmentation pipeline, optimizer settings, and dataset splits; only the backbone architecture differs. To make this isolation explicit, we will revise the abstract to include a concise statement confirming matched training conditions. This revision will directly address the referee's requirement to attribute the +0.5 dB gain to architecture rather than uncontrolled factors. revision: yes
-
Referee: [Abstract] Abstract / hardware validation section: The prototype results and the +0.5 dB numerical gain are reported without experimental details such as baseline model descriptions, error bars, number of trials, or statistical tests. This absence prevents verification of the empirical claims that underpin both the performance and the “pioneering hardware implementation” assertions.
Authors: We acknowledge that the abstract and summary statements lack these quantitative details. The full manuscript describes the CNN baselines in Section IV and the SDR prototype setup (including LabVIEW code and channel conditions) in Section V, with open-source release of both neural-network and transmission code. However, we did not report error bars, trial counts, or statistical tests in the abstract. We will add these elements to the revised abstract and hardware section (e.g., number of independent runs, standard deviation of PSNR, and any significance testing), thereby supporting verification of both the numerical gain and the hardware-validation claims. revision: yes
Circularity Check
No circularity in derivation chain; empirical results and prototype stand alone
full rationale
The paper reports empirical PSNR gains, introduces analysis measures (cosine similarity, Fourier), and describes an SDR prototype. No equations, fitted parameters presented as predictions, or derivation steps appear in the abstract or described content. The +0.5 dB claim is an observed experimental outcome rather than a quantity forced by self-definition or self-citation. The 'first investigation' phrasing is a priority claim, not a load-bearing mathematical premise. No steps reduce by construction to inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures, average cosine similarity and Fourier analysis...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Encoders are high-pass filters, while decoders are low-pass filters... ViTs behave like strong LPFs in the decoder.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A mathematical theory of communication,
C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, July 1948
work page 1948
-
[2]
The source-channel separation theorem revisited,
S. Vembu, S. Verdu, and Y . Steinberg, “The source-channel separation theorem revisited,” IEEE Trans. Inf. Theory , vol. 41, no. 1, pp. 44–54, Jan. 1995
work page 1995
-
[3]
D. G ¨und¨uz et al. , “Guest editorial special issue on beyond transmitting bits: Context, semantics, and task-oriented communications,” IEEE J. Sel. Areas Commun. , vol. 41, no. 1, pp. 1–4, Nov. 2023
work page 2023
-
[4]
Deep joint source- channel coding for wireless image transmission,
E. Bourtsoulatze, D. Burth Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,” IEEE Trans. Cogn. Commun. and Netw. , vol. 5, no. 3, pp. 567–579, May 2019
work page 2019
-
[5]
Deep learning for joint source- channel coding of text,
N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source- channel coding of text,” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP) , Sept. 2018, pp. 2326–2330
work page 2018
-
[6]
A lite distributed semantic communication system for internet of things,
H. Xie and Z. Qin, “A lite distributed semantic communication system for internet of things,” IEEE J. Sel. Areas Commun. , vol. 39, no. 1, pp. 142–153, Jan. 2021
work page 2021
-
[7]
DeepJSCC-f: Deep joint source-channel coding of images with feedback,
D. B. Kurka and D. G ¨und¨uz, “DeepJSCC-f: Deep joint source-channel coding of images with feedback,” IEEE J. Sel. Areas Inf. Theory , vol. 1, no. 1, pp. 178–193, Apr. 2020
work page 2020
-
[8]
Nonlinear transform source-channel coding for semantic communications,
J. Dai et al. , “Nonlinear transform source-channel coding for semantic communications,” IEEE J. Sel. Areas Commun. , vol. 40, no. 8, pp. 2300– 2316, June 2022
work page 2022
-
[9]
C. Liaskos et al. , “XR-RF imaging enabled by software-defined meta- surfaces and machine learning: Foundational vision, technologies and challenges,” IEEE Access , vol. 10, pp. 119 841–119 862, Nov. 2022
work page 2022
-
[10]
6G networks: Beyond shannon towards semantic and goal-oriented communications,
E. C. Strinati and S. Barbarossa, “6G networks: Beyond shannon towards semantic and goal-oriented communications,” Comput. Netw., vol. 190, p. 107930, May 2021
work page 2021
-
[11]
Demo: Real-time semantic communications with a vision transformer,
H. Yoo, T. Jung, L. Dai, S. Kim, and C.-B. Chae, “Demo: Real-time semantic communications with a vision transformer,” in Proc. IEEE Int. Conf. on Commun. Workshops (ICC WKSHPS) , May 2022, pp. 1–2
work page 2022
-
[12]
Beyond transmitting bits: Context, semantics, and task-oriented communications,
D. G ¨und¨uz et al. , “Beyond transmitting bits: Context, semantics, and task-oriented communications,” IEEE J. Sel. Areas Commun. , vol. 41, no. 1, pp. 5–41, Nov. 2023
work page 2023
-
[13]
Extracting and composing robust features with denoising autoencoders,
P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. Int. Conf. Mach. Learn. (ICML) , July 2008, pp. 1096–1103
work page 2008
-
[14]
Deep learning enabled semantic communication systems,
H. Xie, Z. Qin, G. Y . Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Trans. Signal Process., vol. 69, pp. 2663–2675, Apr. 2021
work page 2021
-
[15]
Semantic communication systems for speech transmission,
Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE J. Sel. Areas Commun. , vol. 39, no. 8, pp. 2434– 2444, Aug. 2021
work page 2021
-
[16]
DeepWiVe: Deep-learning-aided wireless video transmission,
T.-Y . Tung and D. G ¨und¨uz, “DeepWiVe: Deep-learning-aided wireless video transmission,” IEEE J. Sel. Areas Commun. , vol. 40, no. 9, pp. 2570–2583, July 2022
work page 2022
-
[17]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR), June 2018
work page 2018
-
[18]
Robust semantic commu- nications against semantic noise,
Q. Hu, G. Zhang, Z. Qin, Y . Cai, and G. Yu, “Robust semantic commu- nications against semantic noise,” arXiv preprint , vol. abs/2202.03338, Sept. 2022
-
[19]
On the relationship between self-attention and convolutional layers,
J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and convolutional layers,” in Proc. Int. Conf. Learn. Representations (ICLR), May 2019
work page 2019
-
[20]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations (ICLR), May 2021
work page 2021
-
[21]
A. Vaswani et al. , “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 30, Dec. 2017
work page 2017
-
[22]
CoAtNet: Marrying convolution and attention for all data sizes,
Z. Dai, H. Liu, Q. V . Le, and M. Tan, “CoAtNet: Marrying convolution and attention for all data sizes,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 34, Dec. 2021, pp. 3965–3977
work page 2021
-
[23]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , Aug. 2020, pp. 213–229. 13
work page 2020
-
[24]
Restormer: Efficient transformer for high-resolution image restoration,
S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR) , June 2022, pp. 5728–5739
work page 2022
-
[25]
Intriguing properties of vision transformers,
M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan, and M.-H. Yang, “Intriguing properties of vision transformers,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 34, Dec. 2021, pp. 23 296–23 308
work page 2021
-
[26]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
D. Hendrycks and T. G. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” arXiv preprint , vol. abs/1903.12261, Mar. 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[27]
How do vision transformers work?
N. Park and S. Kim, “How do vision transformers work?” in Proc. Int. Conf. Learn. Representations (ICLR) , Apr. 2022
work page 2022
-
[28]
End-to-end optimized image compression,
J. Ball ´e, V . Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proc. Int. Conf. Learn. Representations (ICLR) , Apr. 2017
work page 2017
-
[29]
Early convolutions help transformers see better,
T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Doll ´ar, and R. Girshick, “Early convolutions help transformers see better,” Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 34, pp. 30 392–30 400, Dec. 2021
work page 2021
-
[30]
L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint, vol. abs/1607.06450, July 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Towards end-to-end image compression and analysis with transformers,
Y . Bai et al., “Towards end-to-end image compression and analysis with transformers,” Proc. AAAI Conf. Artif. Intell. (AAAI) , vol. 36, no. 1, pp. 104–112, June 2022
work page 2022
-
[32]
Learning multiple layers of features from tiny images,
A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” Apr. 2009
work page 2009
- [33]
-
[34]
Predicting parameters in deep learning,
M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas, “Predicting parameters in deep learning,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , vol. 26, Dec. 2013
work page 2013
-
[35]
Joint global and local hierarchical pri- ors for learned image compression,
J.-H. Kim, B. Heo, and J.-S. Lee, “Joint global and local hierarchical pri- ors for learned image compression,” in Proc. IEEE/CVF Conf. Comput. Vis. and Pattern Recognit. (CVPR) , June 2022, pp. 5992–6001
work page 2022
-
[36]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004
work page 2004
-
[37]
Multiscale structural similarity for image quality assessment,
Z. Wang, E. Simoncelli, and A. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Asilomar Conf. on Signal, Syst. and Comput. , vol. 2, Nov. 2003, pp. 1398–1402 V ol.2
work page 2003
-
[38]
Perceptual losses for real-time style transfer and super-resolution,
J. Johnson, A. Alahi, and F.-F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur . Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 694–711
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.