Demo: Real-Time Semantic Communications with a Vision Transformer

Chan-Byoung Chae; Hanju Yoo; Linglong Dai; Songkuk Kim; Taehun Jung

arxiv: 2205.03886 · v1 · submitted 2022-05-08 · 📡 eess.SP · cs.AI

Demo: Real-Time Semantic Communications with a Vision Transformer

Hanju Yoo , Taehun Jung , Linglong Dai , Songkuk Kim , Chan-Byoung Chae This is my paper

Pith reviewed 2026-05-24 12:21 UTC · model grok-4.3

classification 📡 eess.SP cs.AI

keywords semantic communicationsvision transformerFPGA prototypeimage transmissionwireless channelCIFAR-10real-time implementation

0 comments

The pith

An FPGA prototype shows a vision transformer architecture can transmit images semantically over wireless channels in real time and outperform 256-QAM at low SNR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an end-to-end neural network that uses a vision transformer to extract semantic features from images, send them across a wireless link, and reconstruct the image at the receiver. The architecture is implemented on an FPGA to achieve real-time operation while modeling the physical channel. Experiments on the CIFAR-10 dataset indicate better reconstruction quality than a conventional 256-quadrature amplitude modulation system when the signal-to-noise ratio is low. The work presents this as the first hardware demonstration of real-time semantic communications that relies on a vision transformer.

Core claim

The authors implement and test a prototype in which an end-to-end trained vision transformer extracts semantic meaning from CIFAR-10 images, transmits the resulting features over a modeled wireless channel, and reconstructs the images at the receiver. Realized on an FPGA, the system runs in real time and produces higher-quality reconstructions than a traditional 256-QAM scheme specifically in the low signal-to-noise ratio regime.

What carries the argument

End-to-end trained vision transformer that extracts and reconstructs semantic image features for transmission over a wireless channel.

If this is right

Semantic communications can be realized as a real-time hardware system rather than remaining a simulation-only concept.
Performance advantages appear concentrated in low signal-to-noise ratio operating points.
Vision transformers can serve as the core network for joint source-channel coding of images in wireless settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture could be tested on higher-resolution image datasets to check whether the low-SNR gains scale.
Integration with existing wireless standards would require mapping the transformer output symbols onto existing modulation and coding schemes.
If the semantic features prove robust, bandwidth savings could appear in applications that tolerate approximate rather than pixel-perfect image delivery.

Load-bearing premise

The vision transformer model trained on simulated channels continues to extract and reconstruct semantic content correctly when real hardware distortions and channel effects appear, and the FPGA faithfully reproduces the simulated behavior without unmodeled impairments.

What would settle it

A direct over-the-air comparison in which the vision-transformer system produces worse image reconstructions than 256-QAM at the same low SNR values would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2205.03886 by Chan-Byoung Chae, Hanju Yoo, Linglong Dai, Songkuk Kim, Taehun Jung.

**Figure 2.** Figure 2: (a) Transmitted images and (b) structural similarity index measure (SSIM) results of the proposed and baseline systems. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Semantic communications are expected to enable the more effective delivery of meaning rather than a precise transfer of symbols. In this paper, we propose an end-to-end deep neural network-based architecture for image transmission and demonstrate its feasibility in a real-time wireless channel by implementing a prototype based on a field-programmable gate array (FPGA). We demonstrate that this system outperforms the traditional 256-quadrature amplitude modulation system in the low signal-to-noise ratio regime with the popular CIFAR-10 dataset. To the best of our knowledge, this is the first work that implements and investigates real-time semantic communications with a vision transformer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a working FPGA prototype for real-time semantic image transmission with a vision transformer, but the low-SNR outperformance claim lacks the metrics needed to evaluate it properly.

read the letter

The main thing to know is that this is a hardware demo paper: they built and ran a real-time FPGA prototype of an end-to-end vision-transformer model for semantic image transmission over a wireless channel, and they report it beats 256-QAM at low SNR on CIFAR-10. To their knowledge it is the first such implementation. That engineering step from simulation to working hardware is the concrete new piece here. Getting a ViT to run in real time on FPGA for this task is non-trivial and worth noting as a practical advance. The paper does well by actually closing the loop with a physical wireless link instead of stopping at software results. On the soft spots, the abstract supplies no numbers, error bars, dataset splits, channel model specifics, or ablation results, so the central performance claim cannot be checked from the given text. If the full paper does not add those details plus checks for quantization effects, DAC/ADC distortion, or other hardware impairments, the reported gains could be artifacts rather than evidence of semantic robustness. The stress-test concern about the trained model not transferring cleanly to the real FPGA link is load-bearing and not resolved by the abstract alone. This paper is for people working on practical implementations of semantic communications who want to see what a hardware prototype actually looks like. A reader focused on 6G waveform ideas or FPGA deployment would find the demo useful even with the gaps. It deserves peer review because hardware feasibility results are scarce in this area and referees can push for the missing quantitative evidence.

Referee Report

2 major / 0 minor

Summary. The paper proposes an end-to-end deep neural network architecture based on a vision transformer for semantic image transmission over wireless channels. It reports an FPGA-based real-time prototype implementation and claims that this system outperforms a traditional 256-QAM baseline in the low-SNR regime when evaluated on the CIFAR-10 dataset. The work positions itself as the first demonstration of real-time semantic communications using a vision transformer.

Significance. If the reported outperformance and hardware fidelity hold under scrutiny, the result would be significant as the first documented real-time FPGA prototype of ViT-based semantic communications, providing concrete evidence that end-to-end semantic models can operate under actual wireless distortions rather than idealized simulations. The hardware demonstration itself is a strength, as it moves beyond simulation-only claims common in the semantic communications literature.

major comments (2)

[Abstract] Abstract: the central claim that the system 'outperforms the traditional 256-quadrature amplitude modulation system in the low signal-to-noise ratio regime' is stated without any quantitative metrics, error bars, dataset splits, channel model parameters, or ablation results. This absence makes the performance assertion impossible to evaluate from the given text and directly undermines the load-bearing empirical claim.
The manuscript does not provide evidence that the FPGA prototype faithfully reproduces the modeled channel used during training. No characterization of fixed-point quantization effects in the ViT layers, DAC/ADC distortion, or real-time channel emulation fidelity is supplied, leaving open the possibility that observed gains are artifacts of unmodeled hardware impairments rather than semantic robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and commit to revisions that will improve the clarity of the empirical claims and the documentation of the hardware implementation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the system 'outperforms the traditional 256-quadrature amplitude modulation system in the low signal-to-noise ratio regime' is stated without any quantitative metrics, error bars, dataset splits, channel model parameters, or ablation results. This absence makes the performance assertion impossible to evaluate from the given text and directly undermines the load-bearing empirical claim.

Authors: We agree that the abstract would benefit from quantitative support. In the revised manuscript we will update the abstract to include concrete metrics (e.g., classification accuracy at SNR = 0 dB and 5 dB) together with the CIFAR-10 dataset and AWGN channel parameters. The full results, including error bars from repeated trials, train/test splits, and ablation studies, are already reported in Sections IV and V; the abstract revision will make the central claim directly evaluable while remaining concise. revision: yes
Referee: The manuscript does not provide evidence that the FPGA prototype faithfully reproduces the modeled channel used during training. No characterization of fixed-point quantization effects in the ViT layers, DAC/ADC distortion, or real-time channel emulation fidelity is supplied, leaving open the possibility that observed gains are artifacts of unmodeled hardware impairments rather than semantic robustness.

Authors: We acknowledge the value of explicit hardware-fidelity evidence. The original submission did not include such characterization. In the revision we will add a new subsection that reports the fixed-point quantization scheme applied to the ViT layers, measured DAC/ADC distortion levels, and side-by-side comparisons of FPGA output versus the simulated channel model under identical noise realizations. These additions will directly address the concern that gains may stem from unmodeled impairments. revision: yes

Circularity Check

0 steps flagged

Empirical FPGA demo; no derivation chain present

full rationale

The paper describes an end-to-end trained ViT architecture for image transmission and its FPGA prototype implementation. Performance claims (outperformance vs 256-QAM on CIFAR-10 in low SNR) rest on hardware measurements, not on any sequence of equations, fitted parameters renamed as predictions, or self-citation chains that reduce to the paper's own inputs. No mathematical derivation is offered that could be circular; the work is a feasibility demonstration whose validity is assessed by external experimental reproduction rather than internal consistency of a proof.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient detail to enumerate specific free parameters, axioms, or invented entities; the neural-network weights and any channel-model parameters are implicitly fitted but not described.

pith-pipeline@v0.9.0 · 5634 in / 1005 out tokens · 22114 ms · 2026-05-24T12:21:54.608235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

write newline

" write newline "" initialize.prev.this.status FUNCTION begin.bib " write newline preamble empty 'skip preamble write newline if " thebibliography " longest.label * " " * write newline " [1] #1 " write newline " url@samestyle " write newline " " write newline " [2] #2 " write newline " =0pt " write newline " " ALTinterwordstretchfactor * " " * write newli...

work page
[2]

MBQ rM [ݯ

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page arXiv 1979
[3]

K. Lu, R. Li, X. Chen, Z. Zhao, and H. Zhang, ``Reinforcement learning-powered semantic communication via semantic similarity,'' arXiv preprint arXiv:2108.12121, 2021

work page arXiv 2021
[4]

Weng and Z

Z. Weng and Z. Qin, ``Semantic communication systems for speech transmission,'' IEEE Journal on Selected Areas in Communications, 2021

work page 2021
[5]

K. He, X. Zhang, S. Ren, and J. Sun, ``Deep residual learning for image recognition,'' in Proc. IEEE Conf. on Comp. Vision and Pattern Recog., 2016, pp. 770--778

work page 2016
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al., ``An image is worth 16x16 words: Transformers for image recognition at scale,'' arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Vincent, H

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ``Extracting and composing robust features with denoising autoencoders,'' in Proc. Int. Conf. on Machine Learning, 2008, pp. 1096--1103

work page 2008
[8]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, ``Image quality assessment: from error visibility to structural similarity,'' IEEE Trans. on Image Proc., vol. 13, no. 4, pp. 600--612, 2004

work page 2004

[1] [1]

write newline

" write newline "" initialize.prev.this.status FUNCTION begin.bib " write newline preamble empty 'skip preamble write newline if " thebibliography " longest.label * " " * write newline " [1] #1 " write newline " url@samestyle " write newline " " write newline " [2] #2 " write newline " =0pt " write newline " " ALTinterwordstretchfactor * " " * write newli...

work page

[2] [2]

MBQ rM [ݯ

11em plus .33em minus .07em 4000 4000 100 4000 4000 500 `\.=1000 = #1 \@IEEEnotcompsoconly \@IEEEcompsoconly #1 * [1] 0pt [0pt][0pt] #1 * [1] 0pt [0pt][0pt] #1 * \| ** #1 \@IEEEauthorblockNstyle \@IEEEcompsocnotconfonly \@IEEEauthorblockAstyle \@IEEEcompsocnotconfonly \@IEEEcompsocconfonly \@IEEEauthordefaulttextstyle \@IEEEcompsocnotconfonly \@IEEEauthor...

work page arXiv 1979

[3] [3]

K. Lu, R. Li, X. Chen, Z. Zhao, and H. Zhang, ``Reinforcement learning-powered semantic communication via semantic similarity,'' arXiv preprint arXiv:2108.12121, 2021

work page arXiv 2021

[4] [4]

Weng and Z

Z. Weng and Z. Qin, ``Semantic communication systems for speech transmission,'' IEEE Journal on Selected Areas in Communications, 2021

work page 2021

[5] [5]

K. He, X. Zhang, S. Ren, and J. Sun, ``Deep residual learning for image recognition,'' in Proc. IEEE Conf. on Comp. Vision and Pattern Recog., 2016, pp. 770--778

work page 2016

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy et al., ``An image is worth 16x16 words: Transformers for image recognition at scale,'' arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Vincent, H

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ``Extracting and composing robust features with denoising autoencoders,'' in Proc. Int. Conf. on Machine Learning, 2008, pp. 1096--1103

work page 2008

[8] [8]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, ``Image quality assessment: from error visibility to structural similarity,'' IEEE Trans. on Image Proc., vol. 13, no. 4, pp. 600--612, 2004

work page 2004