Mixture-of-Experts Transformer for Automatic Modulation Recognition

Jiale Wang; Jingwei Zhang; Wupeng Xie; Xin Liu; Yaxin Mu; Zhilong Zhao

arxiv: 2606.09085 · v1 · pith:GG2ISNBPnew · submitted 2026-06-08 · 📡 eess.SP

Mixture-of-Experts Transformer for Automatic Modulation Recognition

Jiale Wang , Wupeng Xie , Yaxin Mu , Xin Liu , Zhilong Zhao , Jingwei Zhang This is my paper

Pith reviewed 2026-06-27 15:55 UTC · model grok-4.3

classification 📡 eess.SP

keywords automatic modulation recognitionmixture of expertstransformerI/Q signalstemporal resamplingcognitive radiodeep learning

0 comments

The pith

Mixture-of-experts transformer with input-dependent gating outperforms static multi-scale methods on I/Q modulation signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MoEformer to overcome the rigidity of existing deep-learning approaches to automatic modulation recognition. Those approaches rely on fixed multi-scale fusion that cannot adjust when modulation signals change their temporal behavior. MoEformer creates several expert views of the same I/Q waveform by resampling it at different rates, then lets an input-dependent gate decide which experts to combine for each sample. Rotary position embeddings inside the transformer layers track both short-range and long-range timing relations. The resulting model records higher average accuracy than prior baselines on three standard radio-signal collections while keeping model size practical.

Core claim

MoEformer is an adaptive Multi-Scale Mixture-of-Experts Transformer network that directly processes I/Q signals to preserve their temporal and phase structures. It constructs multi-scale expert views through temporal resampling, employs an input-dependent gating mechanism for dynamic expert fusion, and integrates Rotary Position Embeddings within Transformer encoders to capture both local and global temporal dependencies, achieving superior average recognition accuracies of 63.74 percent, 66.24 percent, and 64.22 percent on RadioML2016.10a, RadioML2016.10b, and RadioML2018.01A respectively.

What carries the argument

Input-dependent gating mechanism that dynamically selects and fuses multi-scale expert views produced by temporal resampling inside a RoPE-equipped Transformer encoder.

If this is right

Higher recognition accuracy than competitive baselines on the three evaluated RadioML benchmarks.
Direct I/Q processing that retains original temporal and phase information without intermediate transformations.
Improved ability to handle dynamic temporal variations compared with static multi-scale fusion.
A practical balance between recognition performance and overall model complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same resampling-plus-gating pattern could be tested on other time-series classification problems that involve variable-scale patterns.
If the gate learns to ignore certain experts for particular signal classes, the architecture might be pruned for lower inference cost without retraining.
Deployment on edge devices would require measuring whether the added gating computation offsets the accuracy gain under strict latency constraints.

Load-bearing premise

Dynamic selection of experts according to each input will adapt more effectively to changing temporal patterns in modulation signals than any fixed multi-scale fusion rule.

What would settle it

A controlled experiment on the same three RadioML datasets in which the input-dependent gate is replaced by a static average of the same experts and accuracy does not decrease would show that the gating step is not required.

Figures

Figures reproduced from arXiv: 2606.09085 by Jiale Wang, Jingwei Zhang, Wupeng Xie, Xin Liu, Yaxin Mu, Zhilong Zhao.

**Figure 2.** Figure 2: Architecture of the Local Representation Block (LRB). [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Adaptive gating network for dynamic expert routing. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of classification accuracy versus SNR across three benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of different modulation schemes on three [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Confusion matrices for modulation classification across three benchmark RadioML datasets under varying SNR conditions. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: t-SNE visualizations of learned modulation feature representations across three benchmark datasets at 0 dB and 12 dB [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison of different positional encoding [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of gating weights across modulation types [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

read the original abstract

Automatic Modulation Recognition (AMR) is a key enabling technology for cognitive radio and intelligent spectrum management in next-generation wireless systems. However, current deep learning-based AMR methods predominantly rely on static multi-scale fusion strategies, which lack the flexibility to adapt to the highly dynamic temporal variations of modulation signals. To address this limitation, we propose MoEformer, an adaptive Multi-Scale Mixture-of-Experts Transformer network that directly processes I/Q signals to preserve their temporal and phase structures. Specifically, MoEformer constructs multi scale expert views through temporal resampling, employs an input-dependent gating mechanism for dynamic expert fusion, and integrates Rotary Position Embeddings (RoPE) within Transformer encoders to capture both local and global tem poral dependencies. Comprehensive evaluations on three widely adopted benchmarks (RadioML2016.10a, RadioML2016.10b, and RadioML2018.01A) demonstrate that MoEformer outperforms the competitive baselines, achieving superior average recognition accuracies of 63.74%, 66.24%, and 64.22%, respectively. In addition, the proposed method strikes an optimal trade-off between recognition performance and model complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoEformer applies input-dependent MoE gating plus temporal resampling and RoPE to raw I/Q for AMR and reports higher point-estimate accuracies than baselines, but those numbers lack any variance or significance measures.

read the letter

The main thing to know is that this paper puts forward MoEformer, a transformer that builds multi-scale expert views of I/Q signals via temporal resampling, routes them with an input-dependent gate, and uses RoPE inside the encoders. It claims average accuracies of 63.74 percent, 66.24 percent, and 64.22 percent on the three RadioML sets, beating the listed baselines.

What is actually new is the particular stacking of those three pieces for the AMR task. Each element has appeared elsewhere, but the combination is presented as a direct response to the static-fusion limitation the authors identify. The architecture description is straightforward and keeps the input as raw I/Q rather than moving to spectrograms or other transforms.

The paper does a decent job laying out why an adaptive gate might handle varying temporal structure better than fixed multi-scale fusion. The motivation is domain-plausible for modulation signals.

The soft spot is exactly the one flagged in the stress-test note. The performance numbers are single scalars with no standard deviations, no count of independent runs, no confidence intervals, and no ablation results visible in the abstract. In this area, where seed and optimizer effects are known to matter, that leaves the claimed gains open to the possibility that they sit inside normal experimental variation. If the full paper supplies those details it would strengthen the case; from the given text the central empirical claim rests on thin evidence.

This is for people already working on deep-learning methods for automatic modulation recognition or similar signal-classification problems in wireless systems. A reader who follows transformer or MoE work in communications could extract the design choices without much trouble.

The paper shows clear enough thinking on its own terms and engages the relevant literature at the level of the abstract. I would send it to peer review rather than desk-reject so the experimental reporting can be checked and, if needed, improved.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoEformer, an adaptive Multi-Scale Mixture-of-Experts Transformer for Automatic Modulation Recognition that processes raw I/Q signals. It constructs multi-scale expert views via temporal resampling, uses an input-dependent gating network for dynamic expert selection, and incorporates Rotary Position Embeddings (RoPE) inside Transformer encoders to model local and global temporal dependencies. On the RadioML2016.10a, RadioML2016.10b, and RadioML2018.01A benchmarks the method is reported to achieve average accuracies of 63.74%, 66.24%, and 64.22% respectively, outperforming competitive baselines while maintaining a favorable accuracy-complexity trade-off.

Significance. If the empirical superiority can be shown to be statistically reliable, the combination of input-dependent MoE routing with explicit temporal resampling offers a principled way to handle non-stationary modulation signals without relying on hand-crafted multi-scale fusion. The preservation of raw I/Q phase structure and the use of RoPE are technically sound design choices that align with the physics of the problem.

major comments (2)

[Abstract / Experiments] Abstract and experimental section: the central claim of outperformance rests on three scalar average accuracies (63.74%, 66.24%, 64.22%) reported without standard deviations, number of independent runs, confidence intervals, or hypothesis tests. In AMR, where training stochasticity and data ordering materially affect results, point estimates alone do not establish that the observed margins are outside normal experimental variation.
[§3 (Method)] Method description (gating and resampling): the paper asserts that the input-dependent gating plus temporal resampling adapts more effectively to dynamic temporal variations than static multi-scale strategies, yet provides no ablation that isolates the contribution of the gating network versus the resampling operation or versus a standard multi-head attention baseline with the same resampling.

minor comments (2)

[Abstract] Abstract contains the typographical error 'tem poral' (should be 'temporal').
[§4 (Experiments)] Training protocol, optimizer settings, learning-rate schedule, batch size, and exact baseline re-implementations are not described at a level that permits reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental section: the central claim of outperformance rests on three scalar average accuracies (63.74%, 66.24%, 64.22%) reported without standard deviations, number of independent runs, confidence intervals, or hypothesis tests. In AMR, where training stochasticity and data ordering materially affect results, point estimates alone do not establish that the observed margins are outside normal experimental variation.

Authors: We agree that point estimates alone are insufficient to demonstrate statistical reliability. In the revised manuscript we will rerun all experiments with at least five independent random seeds, report mean accuracy together with standard deviation and 95% confidence intervals for each method, and include paired statistical tests (e.g., t-tests) to assess whether the observed margins exceed experimental variation. revision: yes
Referee: [§3 (Method)] Method description (gating and resampling): the paper asserts that the input-dependent gating plus temporal resampling adapts more effectively to dynamic temporal variations than static multi-scale strategies, yet provides no ablation that isolates the contribution of the gating network versus the resampling operation or versus a standard multi-head attention baseline with the same resampling.

Authors: We concur that targeted ablations would strengthen the methodological claims. The revised version will add an ablation table that compares (i) the full MoEformer, (ii) a fixed-gating variant, (iii) a single-scale (no-resampling) variant, and (iv) a standard Transformer encoder that receives the same resampled inputs but uses conventional multi-head attention. These results will isolate the incremental benefit of input-dependent routing and of the resampling step. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The manuscript proposes MoEformer as an architecture for AMR and reports empirical accuracies on RadioML benchmarks. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on experimental outcomes rather than fitted parameters renamed as predictions or self-citation load-bearing theorems. This is the normal case for an applied neural-network paper with no mathematical reduction steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the model name itself; the gating mechanism and expert views are described at a high level without numerical fitting details.

invented entities (1)

MoEformer no independent evidence
purpose: Adaptive multi-scale fusion for I/Q-based AMR via dynamic expert gating
New model name and architecture introduced to address static fusion limitations

pith-pipeline@v0.9.1-grok · 5741 in / 1238 out tokens · 15657 ms · 2026-06-27T15:55:35.853308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Deep learning at the physical layer: System challenges and applications to 5g and beyond,

F. Restuccia and T. Melodia, “Deep learning at the physical layer: System challenges and applications to 5g and beyond,”IEEE Communications Magazine, vol. 58, no. 10, pp. 58–64, 2020

2020
[2]

Signal identification for multiple-antenna wireless systems: Achievements and challenges,

Y . A. Eldemerdash, O. A. Dobre, and M. ¨Oner, “Signal identification for multiple-antenna wireless systems: Achievements and challenges,”IEEE Commun. Surveys Tuts., vol. 18, no. 3, pp. 1524–1551, 2016

2016
[3]

Shared spectrum monitoring using deep learning,

F. A. Bhatti, M. J. Khan, and A. Selim, “Shared spectrum monitoring using deep learning,”IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 4, pp. 1172–1185, 2021

2021
[4]

End-to-end learning from spectrum data: A deep learning approach for wireless signal identification in spectrum monitoring applications,

M. Kulin, T. Kazaz, I. Moerman, and E. De Poorter, “End-to-end learning from spectrum data: A deep learning approach for wireless signal identification in spectrum monitoring applications,”IEEE Access, vol. 6, pp. 18 484–18 501, 2018

2018
[5]

Survey of automatic modulation classification techniques: Classical approaches and new trends,

O. A. Dobre, A. Abdi, Y . Bar-Ness, and W. Su, “Survey of automatic modulation classification techniques: Classical approaches and new trends,”IET Commun., vol. 1, no. 2, pp. 137–156, 2007

2007
[6]

Maximum-likelihood classification of digital amplitude-phase modulated signals in flat fading non-Gaussian channels,

V . G. Chavali and C. R. Da Silva, “Maximum-likelihood classification of digital amplitude-phase modulated signals in flat fading non-Gaussian channels,”IEEE Trans. Commun., vol. 59, no. 8, pp. 2051–2056, Aug. 2011

2051
[7]

Algorithms for automatic modulation recognition of communication signals,

A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation recognition of communication signals,”IEEE Trans. Commun., vol. 46, no. 4, pp. 431–436, Apr. 1998

1998
[8]

Learning to short-time Fourier transform in spectrum sensing,

L. Zhou, Z. Sun, and W. Wang, “Learning to short-time Fourier transform in spectrum sensing,”Phys. Commun., vol. 25, pp. 420–425, 2017

2017
[9]

Automatic modulation recognition using deep learning architectures,

M. Zhang, Y . Zeng, Z. Han, and Y . Gong, “Automatic modulation recognition using deep learning architectures,” inProc. IEEE 19th Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), 2018, pp. 1–5

2018
[10]

Automatic modulation recognition of digital signals using wavelet features and SVM,

C. Park, J. Choi, S. Nah, W. Jang, and D. Y . Kim, “Automatic modulation recognition of digital signals using wavelet features and SVM,” inProc. 10th Int. Conf. Adv. Commun. Technol., 2008, pp. 387–390. 13

2008
[11]

A survey of modulation classification using deep learning: Signal representation and data preprocessing,

S. Peng, S. Sun, and Y .-D. Yao, “A survey of modulation classification using deep learning: Signal representation and data preprocessing,”IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 7020–7038, Dec. 2022

2022
[12]

An introduction to deep learning for the physical layer,

T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017

2017
[13]

Radio machine learning dataset generation with GNU radio,

T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU radio,” inProc. GNU Radio Conf., 2016, pp. 1–6

2016
[14]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV , USA, 2016, pp. 770–778

2016
[15]

Deep architectures for modulation recognition,

N. E. West and T. J. O’Shea, “Deep architectures for modulation recognition,” inProc. IEEE Int. Symp. Dyn. Spectr. Access Netw. (DySPAN), Baltimore, MD, USA, 2017, pp. 1–6

2017
[16]

Automatic modulation classification: A deep learning enabled approach,

F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation classification: A deep learning enabled approach,”IEEE Trans. Veh. Technol., vol. 67, no. 11, pp. 10 760–10 772, Nov. 2018

2018
[17]

Over-the-air deep learning based radio signal classification,

T. J. O’Shea, T. Roy, and T. C. Clancy, “Over-the-air deep learning based radio signal classification,”IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 168–179, Feb. 2018

2018
[18]

Convolutional radio modulation recognition networks,

T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recognition networks,” inProc. Eng. Appl. Neural Netw. (EANN), Aberdeen, U.K., 2016, pp. 213–226

2016
[19]

Data-driven deep learning for automatic modulation recognition in cognitive radios,

Y . Wang, M. Liu, J. Yang, and G. Gui, “Data-driven deep learning for automatic modulation recognition in cognitive radios,”IEEE Transactions on Vehicular Technology, vol. 68, no. 4, pp. 4074–4077, 2019

2019
[20]

ImageNet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2012, pp. 1097–1105

2012
[21]

Going deeper with convolutions,

C. Szegedyet al., “Going deeper with convolutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9

2015
[22]

Modulation classification based on signal constellation diagrams and deep learning,

S. Penget al., “Modulation classification based on signal constellation diagrams and deep learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 718–727, Mar. 2019

2019
[23]

Modulation recognition using signal enhancement and multistage attention mechanism,

S. Lin, Y . Zeng, and Y . Gong, “Modulation recognition using signal enhancement and multistage attention mechanism,”IEEE Trans. Wireless Commun., vol. 21, no. 11, pp. 9921–9935, Nov. 2022

2022
[24]

Augmenting radio signals with wavelet transform for deep learning-based modulation recognition,

T. Chen, S. Zheng, K. Qiu, L. Zhang, Q. Xuan, and X. Yang, “Augmenting radio signals with wavelet transform for deep learning-based modulation recognition,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 6, pp. 2029–2044, Dec. 2024

2029
[25]

Understanding the effective receptive field in deep convolutional neural networks,

W. Luo, Y . Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” inProceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016, pp. 4905–4913

2016
[26]

A spatio temporal multi-channel learning framework for automatic modulation recognition,

J. Xu, C. Luo, G. Parr, and Y . Luo, “A spatio temporal multi-channel learning framework for automatic modulation recognition,” inProc. IEEE Wireless Commun. Lett., vol. 9, no. 10, Oct. 2020, pp. 1629–1632

2020
[27]

Automatic modulation classification using recurrent neural networks,

D. Hong, Z. Zhang, and X. Xu, “Automatic modulation classification using recurrent neural networks,” inProc. IEEE Int. Conf. Comput. Commun. (ICCC), 2017, pp. 695–700

2017
[28]

Deep learning models for wireless signal classification with distributed low-cost spectrum sensors,

S. Rajendran, W. Meert, D. Giustiniano, V . Lenders, and S. Pollin, “Deep learning models for wireless signal classification with distributed low-cost spectrum sensors,”IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 3, pp. 433–445, Sep. 2018

2018
[29]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

C. Z. at al, “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” 2023, arXiv:2302.09419

work page arXiv 2023
[30]

SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering,

Y . S. Chuang, C. L. Liu, and H. Y . Lee, “SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering,” inProc. Interspeech, 2020, pp. 4168–4172

2020
[31]

An image is worth 16×16 words: Transformers for image recognition at scale,

A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021, pp. 1–22

2021
[32]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6000–6010

2017
[33]

Automatic modulation classification using convolutional neural network with features fusion of SPWVD and BJD,

Z. Zhang, C. Wang, C. Gan, S. Sun, and M. Wang, “Automatic modulation classification using convolutional neural network with features fusion of SPWVD and BJD,”IEEE Trans. Signal Inf. Process. Netw., vol. 5, no. 3, pp. 469–478, Sep. 2019

2019
[34]

A spatiotemporal multi- stream learning framework based on attention mechanism for automatic modulation recognition,

X. Wang, D. Liu, Y . Zhang, Y . Li, and S. Wu, “A spatiotemporal multi- stream learning framework based on attention mechanism for automatic modulation recognition,”Digital Signal Process., vol. 130, p. 103703, 2022

2022
[35]

A novel lstm architecture for automatic modulation recognition: Comparative analysis with conventional machine learning and rnn-based approaches,

S. Ansari, S. Mahmoud, S. Majzoub, E. Almajali, A. Jarndal, and T. Bonny, “A novel lstm architecture for automatic modulation recognition: Comparative analysis with conventional machine learning and rnn-based approaches,”IEEE Access, vol. 13, pp. 72 526–72 543, 2025

2025
[36]

MCformer: A transformer based deep neural network for automatic modulation classification,

S. Hamidi-Rad and S. Jain, “MCformer: A transformer based deep neural network for automatic modulation classification,” inProc. IEEE Global Commun. Conf. (GLOBECOM), Madrid, Spain, 2021, pp. 1–6

2021
[37]

MST: A multi-scale transformer framework with cross-scale token fusion for automatic modulation recognition,

J. Zhang, S. An, F. Meng, and Q. Liu, “MST: A multi-scale transformer framework with cross-scale token fusion for automatic modulation recognition,”IEEE Wireless Communications Letters, vol. 14, no. 12, pp. 4112–4116, 2025

2025
[38]

Automatic modulation classification using CNN-LSTM based dual-stream structure,

Z. Zhang, H. Luo, C. Wang, C. Gan, and Y . Xiang, “Automatic modulation classification using CNN-LSTM based dual-stream structure,”IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 13 521–13 531, 2020

2020
[39]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022

2022
[40]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

S. Mu and S. Lin, “A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications,”arXiv preprint arXiv:2503.07137, 2026

work page arXiv 2026
[41]

A survey on mixture of experts in large language models,

W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang, “A survey on mixture of experts in large language models,”IEEE Transactions on Knowledge and Data Engineering, pp. 1–20, 2025

2025
[42]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inProc. Int. Conf. Learn. Represent. (ICLR), Toulon, France, 2017

2017
[43]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint, vol. arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

Xception: Deep learning with depthwise separable convo- lutions,

F. Chollet, “Xception: Deep learning with depthwise separable convo- lutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, 2017, pp. 1251–1258

2017
[45]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. Int. Conf. Mach. Learn. (ICML), Lille, France, 2015, pp. 448–456

2015
[46]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint, vol. arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[47]

AMC-Net: An effective network for automatic modulation classification,

J. Zhang, T. Wang, Z. Feng, and S. Yang, “AMC-Net: An effective network for automatic modulation classification,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5

2023
[48]

Abandon locality: Frame- wise embedding aided transformer for automatic modulation recognition,

Y . Chen, B. Dong, C. Liu, W. Xiong, and S. Li, “Abandon locality: Frame- wise embedding aided transformer for automatic modulation recognition,” IEEE Commun. Lett., vol. 27, no. 1, pp. 327–331, Jan. 2023

2023
[49]

Enhancing automatic modulation recognition through robust global feature extraction,

Y . Qu, Z. Lu, R. Zeng, J. Wang, and J. Wang, “Enhancing automatic modulation recognition through robust global feature extraction,”arXiv preprint, vol. arXiv:2401.01056, 2024

work page arXiv 2024
[50]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008

2008

[1] [1]

Deep learning at the physical layer: System challenges and applications to 5g and beyond,

F. Restuccia and T. Melodia, “Deep learning at the physical layer: System challenges and applications to 5g and beyond,”IEEE Communications Magazine, vol. 58, no. 10, pp. 58–64, 2020

2020

[2] [2]

Signal identification for multiple-antenna wireless systems: Achievements and challenges,

Y . A. Eldemerdash, O. A. Dobre, and M. ¨Oner, “Signal identification for multiple-antenna wireless systems: Achievements and challenges,”IEEE Commun. Surveys Tuts., vol. 18, no. 3, pp. 1524–1551, 2016

2016

[3] [3]

Shared spectrum monitoring using deep learning,

F. A. Bhatti, M. J. Khan, and A. Selim, “Shared spectrum monitoring using deep learning,”IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 4, pp. 1172–1185, 2021

2021

[4] [4]

End-to-end learning from spectrum data: A deep learning approach for wireless signal identification in spectrum monitoring applications,

M. Kulin, T. Kazaz, I. Moerman, and E. De Poorter, “End-to-end learning from spectrum data: A deep learning approach for wireless signal identification in spectrum monitoring applications,”IEEE Access, vol. 6, pp. 18 484–18 501, 2018

2018

[5] [5]

Survey of automatic modulation classification techniques: Classical approaches and new trends,

O. A. Dobre, A. Abdi, Y . Bar-Ness, and W. Su, “Survey of automatic modulation classification techniques: Classical approaches and new trends,”IET Commun., vol. 1, no. 2, pp. 137–156, 2007

2007

[6] [6]

Maximum-likelihood classification of digital amplitude-phase modulated signals in flat fading non-Gaussian channels,

V . G. Chavali and C. R. Da Silva, “Maximum-likelihood classification of digital amplitude-phase modulated signals in flat fading non-Gaussian channels,”IEEE Trans. Commun., vol. 59, no. 8, pp. 2051–2056, Aug. 2011

2051

[7] [7]

Algorithms for automatic modulation recognition of communication signals,

A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation recognition of communication signals,”IEEE Trans. Commun., vol. 46, no. 4, pp. 431–436, Apr. 1998

1998

[8] [8]

Learning to short-time Fourier transform in spectrum sensing,

L. Zhou, Z. Sun, and W. Wang, “Learning to short-time Fourier transform in spectrum sensing,”Phys. Commun., vol. 25, pp. 420–425, 2017

2017

[9] [9]

Automatic modulation recognition using deep learning architectures,

M. Zhang, Y . Zeng, Z. Han, and Y . Gong, “Automatic modulation recognition using deep learning architectures,” inProc. IEEE 19th Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), 2018, pp. 1–5

2018

[10] [10]

Automatic modulation recognition of digital signals using wavelet features and SVM,

C. Park, J. Choi, S. Nah, W. Jang, and D. Y . Kim, “Automatic modulation recognition of digital signals using wavelet features and SVM,” inProc. 10th Int. Conf. Adv. Commun. Technol., 2008, pp. 387–390. 13

2008

[11] [11]

A survey of modulation classification using deep learning: Signal representation and data preprocessing,

S. Peng, S. Sun, and Y .-D. Yao, “A survey of modulation classification using deep learning: Signal representation and data preprocessing,”IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 12, pp. 7020–7038, Dec. 2022

2022

[12] [12]

An introduction to deep learning for the physical layer,

T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563–575, Dec. 2017

2017

[13] [13]

Radio machine learning dataset generation with GNU radio,

T. J. O’Shea and N. West, “Radio machine learning dataset generation with GNU radio,” inProc. GNU Radio Conf., 2016, pp. 1–6

2016

[14] [14]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Las Vegas, NV , USA, 2016, pp. 770–778

2016

[15] [15]

Deep architectures for modulation recognition,

N. E. West and T. J. O’Shea, “Deep architectures for modulation recognition,” inProc. IEEE Int. Symp. Dyn. Spectr. Access Netw. (DySPAN), Baltimore, MD, USA, 2017, pp. 1–6

2017

[16] [16]

Automatic modulation classification: A deep learning enabled approach,

F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation classification: A deep learning enabled approach,”IEEE Trans. Veh. Technol., vol. 67, no. 11, pp. 10 760–10 772, Nov. 2018

2018

[17] [17]

Over-the-air deep learning based radio signal classification,

T. J. O’Shea, T. Roy, and T. C. Clancy, “Over-the-air deep learning based radio signal classification,”IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 168–179, Feb. 2018

2018

[18] [18]

Convolutional radio modulation recognition networks,

T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recognition networks,” inProc. Eng. Appl. Neural Netw. (EANN), Aberdeen, U.K., 2016, pp. 213–226

2016

[19] [19]

Data-driven deep learning for automatic modulation recognition in cognitive radios,

Y . Wang, M. Liu, J. Yang, and G. Gui, “Data-driven deep learning for automatic modulation recognition in cognitive radios,”IEEE Transactions on Vehicular Technology, vol. 68, no. 4, pp. 4074–4077, 2019

2019

[20] [20]

ImageNet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2012, pp. 1097–1105

2012

[21] [21]

Going deeper with convolutions,

C. Szegedyet al., “Going deeper with convolutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9

2015

[22] [22]

Modulation classification based on signal constellation diagrams and deep learning,

S. Penget al., “Modulation classification based on signal constellation diagrams and deep learning,”IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 718–727, Mar. 2019

2019

[23] [23]

Modulation recognition using signal enhancement and multistage attention mechanism,

S. Lin, Y . Zeng, and Y . Gong, “Modulation recognition using signal enhancement and multistage attention mechanism,”IEEE Trans. Wireless Commun., vol. 21, no. 11, pp. 9921–9935, Nov. 2022

2022

[24] [24]

Augmenting radio signals with wavelet transform for deep learning-based modulation recognition,

T. Chen, S. Zheng, K. Qiu, L. Zhang, Q. Xuan, and X. Yang, “Augmenting radio signals with wavelet transform for deep learning-based modulation recognition,”IEEE Trans. Cogn. Commun. Netw., vol. 10, no. 6, pp. 2029–2044, Dec. 2024

2029

[25] [25]

Understanding the effective receptive field in deep convolutional neural networks,

W. Luo, Y . Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” inProceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016, pp. 4905–4913

2016

[26] [26]

A spatio temporal multi-channel learning framework for automatic modulation recognition,

J. Xu, C. Luo, G. Parr, and Y . Luo, “A spatio temporal multi-channel learning framework for automatic modulation recognition,” inProc. IEEE Wireless Commun. Lett., vol. 9, no. 10, Oct. 2020, pp. 1629–1632

2020

[27] [27]

Automatic modulation classification using recurrent neural networks,

D. Hong, Z. Zhang, and X. Xu, “Automatic modulation classification using recurrent neural networks,” inProc. IEEE Int. Conf. Comput. Commun. (ICCC), 2017, pp. 695–700

2017

[28] [28]

Deep learning models for wireless signal classification with distributed low-cost spectrum sensors,

S. Rajendran, W. Meert, D. Giustiniano, V . Lenders, and S. Pollin, “Deep learning models for wireless signal classification with distributed low-cost spectrum sensors,”IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 3, pp. 433–445, Sep. 2018

2018

[29] [29]

A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,

C. Z. at al, “A comprehensive survey on pretrained foundation models: A history from bert to chatgpt,” 2023, arXiv:2302.09419

work page arXiv 2023

[30] [30]

SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering,

Y . S. Chuang, C. L. Liu, and H. Y . Lee, “SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering,” inProc. Interspeech, 2020, pp. 4168–4172

2020

[31] [31]

An image is worth 16×16 words: Transformers for image recognition at scale,

A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021, pp. 1–22

2021

[32] [32]

Attention is all you need,

A. Vaswaniet al., “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 6000–6010

2017

[33] [33]

Automatic modulation classification using convolutional neural network with features fusion of SPWVD and BJD,

Z. Zhang, C. Wang, C. Gan, S. Sun, and M. Wang, “Automatic modulation classification using convolutional neural network with features fusion of SPWVD and BJD,”IEEE Trans. Signal Inf. Process. Netw., vol. 5, no. 3, pp. 469–478, Sep. 2019

2019

[34] [34]

A spatiotemporal multi- stream learning framework based on attention mechanism for automatic modulation recognition,

X. Wang, D. Liu, Y . Zhang, Y . Li, and S. Wu, “A spatiotemporal multi- stream learning framework based on attention mechanism for automatic modulation recognition,”Digital Signal Process., vol. 130, p. 103703, 2022

2022

[35] [35]

A novel lstm architecture for automatic modulation recognition: Comparative analysis with conventional machine learning and rnn-based approaches,

S. Ansari, S. Mahmoud, S. Majzoub, E. Almajali, A. Jarndal, and T. Bonny, “A novel lstm architecture for automatic modulation recognition: Comparative analysis with conventional machine learning and rnn-based approaches,”IEEE Access, vol. 13, pp. 72 526–72 543, 2025

2025

[36] [36]

MCformer: A transformer based deep neural network for automatic modulation classification,

S. Hamidi-Rad and S. Jain, “MCformer: A transformer based deep neural network for automatic modulation classification,” inProc. IEEE Global Commun. Conf. (GLOBECOM), Madrid, Spain, 2021, pp. 1–6

2021

[37] [37]

MST: A multi-scale transformer framework with cross-scale token fusion for automatic modulation recognition,

J. Zhang, S. An, F. Meng, and Q. Liu, “MST: A multi-scale transformer framework with cross-scale token fusion for automatic modulation recognition,”IEEE Wireless Communications Letters, vol. 14, no. 12, pp. 4112–4116, 2025

2025

[38] [38]

Automatic modulation classification using CNN-LSTM based dual-stream structure,

Z. Zhang, H. Luo, C. Wang, C. Gan, and Y . Xiang, “Automatic modulation classification using CNN-LSTM based dual-stream structure,”IEEE Transactions on Vehicular Technology, vol. 69, no. 11, pp. 13 521–13 531, 2020

2020

[39] [39]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 120, pp. 1–39, 2022

2022

[40] [40]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

S. Mu and S. Lin, “A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications,”arXiv preprint arXiv:2503.07137, 2026

work page arXiv 2026

[41] [41]

A survey on mixture of experts in large language models,

W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang, “A survey on mixture of experts in large language models,”IEEE Transactions on Knowledge and Data Engineering, pp. 1–20, 2025

2025

[42] [42]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” inProc. Int. Conf. Learn. Represent. (ICLR), Toulon, France, 2017

2017

[43] [43]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,”arXiv preprint, vol. arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

Xception: Deep learning with depthwise separable convo- lutions,

F. Chollet, “Xception: Deep learning with depthwise separable convo- lutions,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, 2017, pp. 1251–1258

2017

[45] [45]

Batch normalization: Accelerating deep network training by reducing internal covariate shift,

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” inProc. Int. Conf. Mach. Learn. (ICML), Lille, France, 2015, pp. 448–456

2015

[46] [46]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint, vol. arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[47] [47]

AMC-Net: An effective network for automatic modulation classification,

J. Zhang, T. Wang, Z. Feng, and S. Yang, “AMC-Net: An effective network for automatic modulation classification,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2023, pp. 1–5

2023

[48] [48]

Abandon locality: Frame- wise embedding aided transformer for automatic modulation recognition,

Y . Chen, B. Dong, C. Liu, W. Xiong, and S. Li, “Abandon locality: Frame- wise embedding aided transformer for automatic modulation recognition,” IEEE Commun. Lett., vol. 27, no. 1, pp. 327–331, Jan. 2023

2023

[49] [49]

Enhancing automatic modulation recognition through robust global feature extraction,

Y . Qu, Z. Lu, R. Zeng, J. Wang, and J. Wang, “Enhancing automatic modulation recognition through robust global feature extraction,”arXiv preprint, vol. arXiv:2401.01056, 2024

work page arXiv 2024

[50] [50]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”J. Mach. Learn. Res., vol. 9, no. 86, pp. 2579–2605, 2008

2008