Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication

Changchuan Yin; Danpu Liu; Gongyu Jin; Sihua Wang; Xinhui Zhang; Zhilong Zhang

A recursive vision transformer with adaptive depth and width cuts parameters nearly in half for image semantic communication while raising reconstruction quality.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-29 13:30 UTC pith:TFCST3V5

load-bearing objection The paper applies recursion and dynamic depth/width tuning to a ViT for semantic image comm and reports 48.7% parameter reduction in simulations, but the abstract leaves the overhead of the adaptation logic unaddressed. the 1 major comments →

arxiv 2606.00114 v1 pith:TFCST3V5 submitted 2026-05-27 cs.CV cs.ITmath.IT

Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication

Zhilong Zhang , Xinhui Zhang , Gongyu Jin , Sihua Wang , Danpu Liu , Changchuan Yin This is my paper

classification cs.CV cs.ITmath.IT

keywords image semantic communicationvision transformerrecursive architecturedynamic depth adjustmentdynamic width adjustmentresource efficiencyparameter reductionsemantic feature refinement

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a vision transformer architecture for image semantic communication that uses a recursive loop to refine features and thereby shrink the total parameter count. Three separate dynamic mechanisms then adjust how many recursion layers run and which neurons and attention heads stay active, based on the input image and the channel state. Under the tested conditions these changes deliver higher image quality than prior systems at the same computational budget. A sympathetic reader would care because semantic communication systems must eventually run on edge devices with tight memory and power limits, and the reported reduction in size could make that deployment feasible.

Core claim

The proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.

What carries the argument

Recursive structure that iteratively refines semantic features together with dynamic depth adjustment (varying number of recursive modules), dynamic width adjustment (pruning neurons and heads), and joint width-depth optimization.

Load-bearing premise

The simulation results under the tested images and channel conditions will hold in real deployments without extra overhead from the adjustment logic or drops in untested conditions.

What would settle it

Running the system on actual edge hardware with live wireless channels and measuring whether parameter savings and quality gains remain within 5 percent of the reported simulation figures.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The architecture can fit inside the memory budget of typical mobile or IoT devices that current ViT semantic encoders exceed.
Computation can be scaled on the fly per image or per channel condition without retraining the entire model.
Joint width-depth control creates a continuous trade-off curve between latency and quality that system designers can tune at runtime.
Fewer parameters lower the energy cost of transmitting the semantic encoder itself over the network before inference begins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the dynamic logic itself adds measurable latency on low-power chips, the net gain in end-to-end latency may shrink compared with the static baseline.
The same recursive-plus-pruning pattern could be applied to other transformer-based semantic tasks such as video or point-cloud transmission without starting from scratch.
A hardware implementation that exposes the width and depth controls to the MAC layer could close the loop between channel state and model size in real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper applies recursion and dynamic depth/width tuning to a ViT for semantic image comm and reports 48.7% parameter reduction in simulations, but the abstract leaves the overhead of the adaptation logic unaddressed.

read the letter

This paper takes a recursive Vision Transformer and layers on dynamic depth, width, and joint adjustment strategies to cut resource use in image semantic communication. The central claim is a 48.7% parameter reduction plus higher reconstruction quality than baselines at comparable complexity.

The approach is a direct engineering response to the memory and compute problems that block edge deployment in 6G-style systems. Recursion lets the model refine features over multiple passes while sharing parameters, and the three adjustment methods let the system pick recursion count or prune heads and neurons based on image content and channel state. The joint optimization adds a practical knob for trading width against depth. These moves are not individually novel, but putting them together for this use case is a reasonable step.

The soft spot is the thin support for the numbers. The abstract states simulation outcomes without naming baselines, training details, or any ablation that isolates the cost of the dynamic decision logic. If that logic requires a separate network or adds runtime parameters and FLOPs that are not folded into the comparisons, the net savings shrink. The stress-test note flags exactly this risk, and nothing in the provided abstract rules it out.

This is for people working on efficient semantic communications or resource-aware vision models for wireless links. A reader focused on edge AI deployments could extract useful implementation ideas if the full experiments include proper overhead accounting and statistical checks.

I would bring it to a reading group to talk through the adaptation overhead question. It deserves peer review because the target problem is timely and the method is concrete, even though revisions will be needed to strengthen the evidence.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a recursive Vision Transformer architecture for image semantic communication, incorporating a recursive structure to iteratively refine semantic features and reduce parameters, along with three dynamic adjustment strategies (dynamic depth based on image content and channel conditions, dynamic width to preserve important neurons/heads, and joint width-depth optimization) to adaptively lower computational complexity. Simulation results are stated to show a 48.7% parameter reduction and higher reconstruction quality than baselines at comparable complexity.

Significance. If the reported simulation outcomes are robustly supported with proper accounting for adaptation overhead and standard baselines, the work could advance resource-efficient semantic communication systems suitable for constrained wireless devices. The combination of recursion with content- and channel-adaptive mechanisms offers a targeted approach to efficiency in ViT-based semantic codecs, though its impact hinges on reproducible experimental validation.

major comments (1)

[Abstract] Abstract: The central claim of a 48.7% parameter reduction with higher reconstruction quality is presented as a direct simulation outcome, but without any reference to the specific baselines, metrics (e.g., PSNR or SSIM), training details, statistical significance, or ablation isolating the overhead of the dynamic decision mechanisms from the recursive core. This directly affects the load-bearing claim that net savings are achieved under comparable complexity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a 48.7% parameter reduction with higher reconstruction quality is presented as a direct simulation outcome, but without any reference to the specific baselines, metrics (e.g., PSNR or SSIM), training details, statistical significance, or ablation isolating the overhead of the dynamic decision mechanisms from the recursive core. This directly affects the load-bearing claim that net savings are achieved under comparable complexity.

Authors: We agree the abstract is too terse on these points. The full manuscript specifies the baselines (non-recursive ViT semantic codecs), metrics (PSNR and SSIM), training details (dataset splits, optimizer, epochs), and Section 4.3 ablations that isolate dynamic overhead from the recursive core; the 48.7% reduction is reported relative to the baseline under matched channel SNR with complexity including decision costs. We will revise the abstract to add brief references to the metrics, the primary baseline, and a note that overhead is accounted for in the reported complexity. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture and results are independently simulated

full rationale

The paper introduces a recursive ViT structure plus three dynamic adjustment strategies (depth, width, joint) for semantic communication, claiming 48.7% parameter reduction and improved reconstruction via simulation. No equations, fitted parameters, or self-citations are described that would make any 'prediction' or uniqueness claim reduce to its own inputs by construction. The central claims rest on direct empirical comparisons under stated conditions rather than self-referential definitions or imported ansatzes. This is a standard architectural proposal validated externally, warranting score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details remain unspecified.

pith-pipeline@v0.9.1-grok · 5721 in / 1039 out tokens · 40596 ms · 2026-06-29T13:30:40.104875+00:00 · methodology

0 comments

read the original abstract

Image semantic communication is a critical component in next-generation wireless communication systems. However, such systems typically suffer from large memory footprints and high computational complexity, making them difficult to deploy on resource-constrained devices. To address these challenges, we propose a vision transformer (ViT)-enabled image semantic communication system. In this system, a recursive structure is introduced to iteratively refine semantic features and reduce the parameter count. In addition, three dynamic adjustment strategies are designed to adaptively reduce computational complexity: dynamic depth adjustment, dynamic width adjustment, and joint width-depth optimization. Dynamic depth adjustment adaptively determines the number of recursive modules according to image content and channel conditions, while dynamic width adjustment selectively preserves important neurons and attention heads. The joint width-depth optimization further enables flexible computation configurations. Simulation results verify that the proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.

Figures

Figures reproduced from arXiv: 2606.00114 by Changchuan Yin, Danpu Liu, Gongyu Jin, Sihua Wang, Xinhui Zhang, Zhilong Zhang.

**Figure 2.** Figure 2: Visualization of pruning the fully connected layer in the Transformer [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: SSIM performance varies with SNR under different channel types. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average SSIM difference (relative to Standard ViT) vs. average FLOPs under the Rayleigh channel for the joint width–depth optimization strategy. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Average SSIM difference vs. average FLOPs for all considered algorithms under different channel types. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: SSIM performance as SNR increases at fixed computational complexity under different channel types. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: The number of layers and the corresponding FLOPs vary as SNR [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Distribution of pruning ratios for the RTUs under the AWGN channel [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 5 canonical work pages · 1 internal anchor

[1]

SCSC: A novel standards-compatible semantic communication frame- work for image transmission,

X. Han, Y . Wu, Z. Gao, B. Feng, Y . Shi, D. G ¨und¨uz, and W. Zhang, “SCSC: A novel standards-compatible semantic communication frame- work for image transmission,”IEEE Transactions on Communications, vol. 73, no. 8, pp. 5682–5698, Aug. 2025

2025
[2]

Enhancement and segmentation of high definition CT images in everything 6G medical IoT environment,

J. Liu, F. Yu, R. Li, X. Lyu, and S. Zheng, “Enhancement and segmentation of high definition CT images in everything 6G medical IoT environment,”IEEE Internet of Things Journal, pp. 1–1, Jun. 2025

2025
[3]

Semantic importance- aware image transmission in V2X networks,

A. Cai, L. Wang, Y . Lin, C. Liu, and P. Qian, “Semantic importance- aware image transmission in V2X networks,”IEEE Internet of Things Journal, vol. 12, no. 17, pp. 36 471–36 487, Sep. 2025

2025
[4]

Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities,

Q. Cui, X. You, N. Wei, G. Nan, X. Zhang, J. Zhanget al., “Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities,”Science China Information Sciences, vol. 68, no. 7, p. 171301, Apr. 2025

2025
[5]

When AI meets sustainable 6G,

X. You, Y . Huang, C. Zhang, J. Wang, H. Yin, and H. Wu, “When AI meets sustainable 6G,”Science China Information Sciences, vol. 68, no. 1, p. 110301, Dec 2024

2024
[6]

Advancing 6G: Survey for explainable AI on communications and network slicing,

H. Sun, Y . Liu, A. Al-Tahmeesschi, A. Nag, M. Soleimanpour, B. Can- berk, H. Arslan, and H. Ahmadi, “Advancing 6G: Survey for explainable AI on communications and network slicing,”IEEE Open Journal of the Communications Society, vol. 6, pp. 1372–1412, Jan. 2025

2025
[7]

Generative semantic communication for text-to-speech synthesis,

J. Zheng, J. Ren, P. Xu, Z. Yuan, J. Xu, F. Wang, G. Gui, and S. Cui, “Generative semantic communication for text-to-speech synthesis,” in IEEE Globecom Workshops (GC Wkshps), Cape Town, South Africa, Dec. 2024

2024
[8]

A survey on semantic communications: technologies, solutions, applications and challenges,

Y . Liu, X. Wang, Z. Ning, M. Zhou, L. Guo, and B. Jedari, “A survey on semantic communications: technologies, solutions, applications and challenges,”Digital Communications and Networks, vol. 10, no. 3, pp. 528–545, Jun. 2024

2024
[9]

Attention-based UNet enabled lightweight image semantic communication system over Internet of Things,

G. Ma, H. Tong, N. Yang, and C. Yin, “Attention-based UNet enabled lightweight image semantic communication system over Internet of Things,” inIEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, Apr. 2024

2024
[10]

A novel lightweight joint source- channel coding design in semantic communications,

X. Yu, D. Li, N. Zhang, and X. Shen, “A novel lightweight joint source- channel coding design in semantic communications,”IEEE Internet of Things Journal, vol. 12, no. 11, pp. 18 447–18 450, Jun. 2025

2025
[11]

Lightweight semantic communication model driven UA V for intelligent transmission,

Y . Liang, “Lightweight semantic communication model driven UA V for intelligent transmission,” inInternational Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, Mar. 2025

2025
[12]

Lightweight task- oriented semantic communication empowered by large-scale AI models,

C. Liu, C. Guo, Y . Yang, M. Chen, and T. Q. S. Quek, “Lightweight task- oriented semantic communication empowered by large-scale AI models,” IEEE Transactions on V ehicular Technology, vol. 74, no. 9, pp. 14 823– 14 827, Sep. 2025

2025
[13]

Dynamic neural networks: A survey,

Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang, “Dynamic neural networks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, Nov. 2022

2022
[14]

Learning task-oriented communication for edge inference: An information bottleneck approach,

J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,”IEEE Journal on Selected Areas in Communications, vol. 40, no. 1, pp. 197–211, Jan. 2022

2022
[15]

Semantic communications for image recovery and classification via deep joint source and channel coding,

Z. Lyu, G. Zhu, J. Xu, B. Ai, and S. Cui, “Semantic communications for image recovery and classification via deep joint source and channel coding,”IEEE Transactions on Wireless Communications, vol. 23, no. 8, pp. 8388–8404, Aug. 2024. 11

2024
[16]

Universal transformers,

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser, “Universal transformers,” inInternational Conference on Learning Representations, New Orleans, Louisiana, USA, May 2019. [Online]. Available: https://openreview.net/forum?id=HyzdRiR9Y7

2019
[17]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, Vienna, Austria, Jan. 2021. [Online]. Available: https://openre...

2021
[18]

Vision Transformer based semantic communications for next generation wireless networks,

M. A. Mohsin, M. Jazib, Z. Alam, M. F. Khan, M. Saad, and M. A. Jamshed, “Vision Transformer based semantic communications for next generation wireless networks,” inIEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, Jun. 2025

2025
[19]

A robust image semantic communication system with multi-scale Vision Transformer,

X. Peng, Z. Qin, X. Tao, J. Lu, and K. B. Letaief, “A robust image semantic communication system with multi-scale Vision Transformer,” IEEE Journal on Selected Areas in Communications, vol. 43, no. 4, pp. 1278–1291, Apr. 2025

2025
[20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, Long Beach, CA, USA, Dec. 2017

2017
[21]

Leap: Learnable pruning for Transformer-based models,

Z. Yao, X. Wu, L. Ma, S. Shen, K. Keutzer, M. W. Mahoney, and Y . He, “Leap: Learnable pruning for Transformer-based models,” arXiv:2105.14636, May 2022

work page arXiv 2022
[22]

To prune, or not to prune: exploring the efficacy of pruning for model compression

M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,”arXiv:1710.01878, Oct. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Big/little deep neural network for ultra low power inference,

E. Park, D. Kim, S. Kim, Y .-D. Kim, G. Kim, S. Yoon, and S. Yoo, “Big/little deep neural network for ultra low power inference,” in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Amsterdam, Netherlands, Oct. 2015

2015
[24]

Image quality assess- ment: from error visibility to structural similarity,

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess- ment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004

2004
[25]

Vision Transformer for adaptive image transmission over MIMO channels,

H. Wu, Y . Shao, C. Bian, K. Mikolajczyk, and D. G ¨und¨uz, “Vision Transformer for adaptive image transmission over MIMO channels,” in ICC-IEEE International Conference on Communications, Rome, Italy, May 2023

2023
[26]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. Burth Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, Sep. 2019

2019
[27]

Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,

Y . Wu, X. Wang, D. Zeng, H. Ye, X. Xie, Q. Zhao, and S. Li, “Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,”CoRR, vol. abs/2407.05383, Jul

work page arXiv
[28]

Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,

[Online]. Available: https://doi.org/10.48550/arXiv.2407.05383

work page doi:10.48550/arxiv.2407.05383
[29]

Vision Transformer pruning,

M. Zhu, Y . Tang, and K. Han, “Vision Transformer pruning,” arXiv:2104.08500, Apr. 2021

work page arXiv 2021
[30]

A flexible bert model enabling width- and depth-dynamic inference,

T. Hu, C. Meinel, and H. Yang, “A flexible bert model enabling width- and depth-dynamic inference,”Computer Speech & Language, vol. 87, p. 101646, Apr. 2024. Zhilong Zhangreceived the B.E. degree in com- munication engineering from the University of Sci- ence and Technology, Beijing, China in 2007, and the M.S. and the Ph.D. degrees in communication and i...

2024
[31]

degree with the Laboratory of Wireless Communication Systems and Networks, BUPT

He is currently pursuing the M.S. degree with the Laboratory of Wireless Communication Systems and Networks, BUPT. His main research interests focus on semantic communications. Gongyu Jinreceived the B.E. and M.S. degrees in Communication Engineering from Beijing Uni- versity of Posts and Telecommunications (BUPT), Beijing, China, in 2021 and 2024. Her re...

2021

[1] [1]

SCSC: A novel standards-compatible semantic communication frame- work for image transmission,

X. Han, Y . Wu, Z. Gao, B. Feng, Y . Shi, D. G ¨und¨uz, and W. Zhang, “SCSC: A novel standards-compatible semantic communication frame- work for image transmission,”IEEE Transactions on Communications, vol. 73, no. 8, pp. 5682–5698, Aug. 2025

2025

[2] [2]

Enhancement and segmentation of high definition CT images in everything 6G medical IoT environment,

J. Liu, F. Yu, R. Li, X. Lyu, and S. Zheng, “Enhancement and segmentation of high definition CT images in everything 6G medical IoT environment,”IEEE Internet of Things Journal, pp. 1–1, Jun. 2025

2025

[3] [3]

Semantic importance- aware image transmission in V2X networks,

A. Cai, L. Wang, Y . Lin, C. Liu, and P. Qian, “Semantic importance- aware image transmission in V2X networks,”IEEE Internet of Things Journal, vol. 12, no. 17, pp. 36 471–36 487, Sep. 2025

2025

[4] [4]

Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities,

Q. Cui, X. You, N. Wei, G. Nan, X. Zhang, J. Zhanget al., “Overview of AI and communication for 6G network: Fundamentals, challenges, and future research opportunities,”Science China Information Sciences, vol. 68, no. 7, p. 171301, Apr. 2025

2025

[5] [5]

When AI meets sustainable 6G,

X. You, Y . Huang, C. Zhang, J. Wang, H. Yin, and H. Wu, “When AI meets sustainable 6G,”Science China Information Sciences, vol. 68, no. 1, p. 110301, Dec 2024

2024

[6] [6]

Advancing 6G: Survey for explainable AI on communications and network slicing,

H. Sun, Y . Liu, A. Al-Tahmeesschi, A. Nag, M. Soleimanpour, B. Can- berk, H. Arslan, and H. Ahmadi, “Advancing 6G: Survey for explainable AI on communications and network slicing,”IEEE Open Journal of the Communications Society, vol. 6, pp. 1372–1412, Jan. 2025

2025

[7] [7]

Generative semantic communication for text-to-speech synthesis,

J. Zheng, J. Ren, P. Xu, Z. Yuan, J. Xu, F. Wang, G. Gui, and S. Cui, “Generative semantic communication for text-to-speech synthesis,” in IEEE Globecom Workshops (GC Wkshps), Cape Town, South Africa, Dec. 2024

2024

[8] [8]

A survey on semantic communications: technologies, solutions, applications and challenges,

Y . Liu, X. Wang, Z. Ning, M. Zhou, L. Guo, and B. Jedari, “A survey on semantic communications: technologies, solutions, applications and challenges,”Digital Communications and Networks, vol. 10, no. 3, pp. 528–545, Jun. 2024

2024

[9] [9]

Attention-based UNet enabled lightweight image semantic communication system over Internet of Things,

G. Ma, H. Tong, N. Yang, and C. Yin, “Attention-based UNet enabled lightweight image semantic communication system over Internet of Things,” inIEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, Apr. 2024

2024

[10] [10]

A novel lightweight joint source- channel coding design in semantic communications,

X. Yu, D. Li, N. Zhang, and X. Shen, “A novel lightweight joint source- channel coding design in semantic communications,”IEEE Internet of Things Journal, vol. 12, no. 11, pp. 18 447–18 450, Jun. 2025

2025

[11] [11]

Lightweight semantic communication model driven UA V for intelligent transmission,

Y . Liang, “Lightweight semantic communication model driven UA V for intelligent transmission,” inInternational Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, Mar. 2025

2025

[12] [12]

Lightweight task- oriented semantic communication empowered by large-scale AI models,

C. Liu, C. Guo, Y . Yang, M. Chen, and T. Q. S. Quek, “Lightweight task- oriented semantic communication empowered by large-scale AI models,” IEEE Transactions on V ehicular Technology, vol. 74, no. 9, pp. 14 823– 14 827, Sep. 2025

2025

[13] [13]

Dynamic neural networks: A survey,

Y . Han, G. Huang, S. Song, L. Yang, H. Wang, and Y . Wang, “Dynamic neural networks: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7436–7456, Nov. 2022

2022

[14] [14]

Learning task-oriented communication for edge inference: An information bottleneck approach,

J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,”IEEE Journal on Selected Areas in Communications, vol. 40, no. 1, pp. 197–211, Jan. 2022

2022

[15] [15]

Semantic communications for image recovery and classification via deep joint source and channel coding,

Z. Lyu, G. Zhu, J. Xu, B. Ai, and S. Cui, “Semantic communications for image recovery and classification via deep joint source and channel coding,”IEEE Transactions on Wireless Communications, vol. 23, no. 8, pp. 8388–8404, Aug. 2024. 11

2024

[16] [16]

Universal transformers,

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser, “Universal transformers,” inInternational Conference on Learning Representations, New Orleans, Louisiana, USA, May 2019. [Online]. Available: https://openreview.net/forum?id=HyzdRiR9Y7

2019

[17] [17]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, Vienna, Austria, Jan. 2021. [Online]. Available: https://openre...

2021

[18] [18]

Vision Transformer based semantic communications for next generation wireless networks,

M. A. Mohsin, M. Jazib, Z. Alam, M. F. Khan, M. Saad, and M. A. Jamshed, “Vision Transformer based semantic communications for next generation wireless networks,” inIEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, Jun. 2025

2025

[19] [19]

A robust image semantic communication system with multi-scale Vision Transformer,

X. Peng, Z. Qin, X. Tao, J. Lu, and K. B. Letaief, “A robust image semantic communication system with multi-scale Vision Transformer,” IEEE Journal on Selected Areas in Communications, vol. 43, no. 4, pp. 1278–1291, Apr. 2025

2025

[20] [20]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, Long Beach, CA, USA, Dec. 2017

2017

[21] [21]

Leap: Learnable pruning for Transformer-based models,

Z. Yao, X. Wu, L. Ma, S. Shen, K. Keutzer, M. W. Mahoney, and Y . He, “Leap: Learnable pruning for Transformer-based models,” arXiv:2105.14636, May 2022

work page arXiv 2022

[22] [22]

To prune, or not to prune: exploring the efficacy of pruning for model compression

M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,”arXiv:1710.01878, Oct. 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Big/little deep neural network for ultra low power inference,

E. Park, D. Kim, S. Kim, Y .-D. Kim, G. Kim, S. Yoon, and S. Yoo, “Big/little deep neural network for ultra low power inference,” in International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Amsterdam, Netherlands, Oct. 2015

2015

[24] [24]

Image quality assess- ment: from error visibility to structural similarity,

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess- ment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004

2004

[25] [25]

Vision Transformer for adaptive image transmission over MIMO channels,

H. Wu, Y . Shao, C. Bian, K. Mikolajczyk, and D. G ¨und¨uz, “Vision Transformer for adaptive image transmission over MIMO channels,” in ICC-IEEE International Conference on Communications, Rome, Italy, May 2023

2023

[26] [26]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. Burth Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, Sep. 2019

2019

[27] [27]

Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,

Y . Wu, X. Wang, D. Zeng, H. Ye, X. Xie, Q. Zhao, and S. Li, “Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,”CoRR, vol. abs/2407.05383, Jul

work page arXiv

[28] [28]

Learning motion blur robust Vision Transformers with dynamic early exit for real-time UA V tracking,

[Online]. Available: https://doi.org/10.48550/arXiv.2407.05383

work page doi:10.48550/arxiv.2407.05383

[29] [29]

Vision Transformer pruning,

M. Zhu, Y . Tang, and K. Han, “Vision Transformer pruning,” arXiv:2104.08500, Apr. 2021

work page arXiv 2021

[30] [30]

A flexible bert model enabling width- and depth-dynamic inference,

T. Hu, C. Meinel, and H. Yang, “A flexible bert model enabling width- and depth-dynamic inference,”Computer Speech & Language, vol. 87, p. 101646, Apr. 2024. Zhilong Zhangreceived the B.E. degree in com- munication engineering from the University of Sci- ence and Technology, Beijing, China in 2007, and the M.S. and the Ph.D. degrees in communication and i...

2024

[31] [31]

degree with the Laboratory of Wireless Communication Systems and Networks, BUPT

He is currently pursuing the M.S. degree with the Laboratory of Wireless Communication Systems and Networks, BUPT. His main research interests focus on semantic communications. Gongyu Jinreceived the B.E. and M.S. degrees in Communication Engineering from Beijing Uni- versity of Posts and Telecommunications (BUPT), Beijing, China, in 2021 and 2024. Her re...

2021