pith. sign in

arxiv: 2509.11419 · v3 · pith:LTFA52XYnew · submitted 2025-09-14 · 📡 eess.SP

Knowledge Distillation for Sensing-Assisted Long-Term Beam Tracking in mmWave Communications

Pith reviewed 2026-05-21 22:23 UTC · model grok-4.3

classification 📡 eess.SP
keywords knowledge distillationbeam trackingmmWave communicationssensing-assistedneural networkslong-term predictionmodel compression
0
0 comments X

The pith

A distilled lightweight student neural network matches a large teacher's accuracy in predicting current and future mmWave beams from visual sensor data while using over 1670 percent fewer parameters and 60 percent shorter sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that knowledge distillation can transfer the ability to track optimal beams over multiple future time slots from a large attention-enhanced neural network to a compact student model in millimeter-wave systems that use infrastructure-mounted visual sensors. The teacher processes image sequences with convolutional feature extraction and gated recurrent units plus attention to capture temporal patterns and select beams from a codebook. The resulting student achieves comparable top-5 accuracies above 93 percent for the current slot and six future slots yet requires dramatically fewer parameters and lower complexity. A sympathetic reader would care because this directly addresses the high computational cost and latency barriers that currently limit practical deployment of sensing-assisted beamforming in real wireless systems. If the claim holds, it points to more efficient, lower-power operation without sacrificing long-term prediction performance.

Core claim

The central claim is that knowledge distillation from a large teacher network—which combines a convolutional neural network for compact image feature extraction with gated recurrent units and attention to model temporal dependencies in visual observation sequences—trains a lightweight student network that closely matches the teacher's top-5 beam prediction accuracies exceeding 93 percent for the current and six future time slots, while reducing model parameters by over 1670 percent, cutting complexity by over 450 percent, and operating on 60 percent shorter input sequences.

What carries the argument

Knowledge distillation in which a large attention-enhanced neural network teacher (convolutional feature extractor plus gated recurrent units with attention) guides training of a compact student network that preserves long-term beam prediction from shorter visual sequences.

If this is right

  • The teacher network itself reaches state-of-the-art-level accuracy with a 90 percent reduction in parameters compared to prior models.
  • The student retains long-term beam prediction capability despite the large cuts in size and input length.
  • The overall framework improves data efficiency while lowering latency and power consumption for both sensing and processing stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation approach could be tested with alternative sensors such as radar to improve robustness when visual data is obstructed.
  • Shorter input sequences open the door to real-time updates on edge hardware that cannot buffer long histories.
  • The parameter savings suggest the method could scale to multi-user or rapidly changing scenarios without proportional increases in compute.
  • Hardware measurements on actual mmWave testbeds would be needed to confirm that the complexity reductions translate into measurable power and latency gains.

Load-bearing premise

Visual observations from infrastructure sensors remain sufficiently informative and aligned with actual communication channel states in the tested scenarios and the reported accuracies will hold beyond the specific simulation or dataset.

What would settle it

Deploy the student model on a fresh dataset collected from a different physical environment or with real rather than simulated camera footage and check whether top-5 accuracy for the six future slots drops below 90 percent.

Figures

Figures reproduced from arXiv: 2509.11419 by A. Lee Swindlehurst, Markku Juntti, Mengyuan Ma, Nhan Thanh Nguyen, Nir Shlezinger, Yonina C. Eldar.

Figure 1
Figure 1. Figure 1: Illustration of the considered system model. The BS s [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the ML model. GRU Embeding Prediction Student model (173K) GRU Multihead attention module Embeding Prediction Teacher model (1788K) BatchNorm2d Cov2d ReLU MaxPool2d FC layers LayerNorm AdaptiveMaxPool2d Elementwise addition [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the teacher and student model struct [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scenario 9 of the DeepSense 6G dataset. 0 0 0 0 0 0 0  000 00 00 00 00 00     0  0     [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistics of the optimal beam index in the considere [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of data preprocessing with L = 8. (a): The original sequence of 8 images. (b): The sequence of 7 difference images (c): The sequence of 7 motion masks. TABLE II. Overall generalization performance of the teacher model in % for ATop-k accuracy and ADBA score. Metric W/o MHA With MHA Self-KD Optimal [56] Test loss 1.141 1.050 1.016 0.8158 ATop-1 40.77 42.91 44.94 50.20 ATop-3 77.44 79.97 81.45 8… view at source ↗
Figure 7
Figure 7. Figure 7: Performance of the teacher and student models with [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-3 and Top-5 prediction accuracy of the student model. The teacher models are trained and tested under L = 8. TABLE IV. Values of β and Γ for KD-aided learning. Term Teacher (self-KD) Student L = 8 L = 5 L = 3 β 0.3 0.3 0.5 0.5 Γ 2 5 3 4 corresponding values for the latter are 90.31% and 88.32%, respectively. Third, with the aid of KD, the student model can achieve performance comparable to the vanilla … view at source ↗
Figure 9
Figure 9. Figure 9: DBA score of the student model with L = 3, 5. The teacher models are trained and tested under L = 8. gaps in prediction accuracy compared to the teacher when implemented without KD-aided learning. In contrast, with KD, the student model achieves close to 95% Top-5 accuracy, which even surpasses the vanilla teacher without self-KD at the future time slots t5 and t6. Although the Top-3 accuracy of the teache… view at source ↗
Figure 10
Figure 10. Figure 10: Performance of the teacher and student models. [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generalization performance of the teacher model (w [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

Infrastructure-mounted sensors can capture rich environmental information to enhance communications and facilitate beamforming in millimeter-wave systems. This work presents an efficient sensing-assisted long-term beam tracking framework that selects optimal beams from a codebook for current and multiple future time slots. We first design a large attention-enhanced neural network (NN) to fully exploit past visual observations for beam tracking. A convolutional NN extracts compact image features, while gated recurrent units with attention capture the temporal dependencies within sequences. The large NN then acts as the teacher to guide the training of a lightweight student NN via knowledge distillation. The student requires shorter input sequences yet preserves long-term beam prediction ability. Numerical results demonstrate that the teacher achieves Top-5 accuracies exceeding 93% for current and six future time slots, approaching state-of-the-art performance with a 90% reduction of model parameters. The student closely matches the teacher's performance while reducing the number of model parameters by over 1670% and cutting complexity by over 450%, despite operating with 60% shorter input sequences. This improvement significantly enhances data efficiency, reduces latency, and reduces power consumption in sensing and processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a knowledge-distillation framework for sensing-assisted long-term beam tracking in mmWave systems. A large teacher network combines a CNN for extracting features from infrastructure visual observations with attention-enhanced GRUs to predict optimal beams from a codebook for the current slot and six future slots. The teacher then distills knowledge to a lightweight student network that operates on 60% shorter input sequences while aiming to retain comparable long-term prediction performance. Numerical results report teacher Top-5 accuracies above 93% with 90% parameter reduction relative to prior work, and student performance that closely matches the teacher with over 1670% parameter reduction and over 450% complexity reduction.

Significance. If the empirical claims hold, the work shows that distillation can compress long-range temporal modeling from an attention-based teacher into a compact student, enabling lower-latency and lower-power beam tracking in practical mmWave deployments that use visual sensors. The reported scale of parameter and complexity savings, together with preservation of multi-slot accuracy, would be a concrete contribution to efficient sensing-assisted communications.

major comments (2)
  1. [Numerical Results / Experiments] The central claim that distillation successfully transfers long-term beam-tracking ability to the student despite 60% shorter sequences is load-bearing for the abstract and results. No ablation is presented that isolates the effect of the distillation loss from the convolutional feature extractor when both models are restricted to the shorter input length; without this comparison it remains unclear whether the student's future-slot accuracy stems from transferred temporal knowledge or from the CNN features alone.
  2. [System Model and Evaluation Setup] The evaluation assumes that visual observations remain sufficiently aligned with channel evolution across the tested mobility regimes. The manuscript does not report results under higher-mobility scenarios or quantify degradation of visual-channel correlation over the six future slots; such a test would directly address whether the student's compressed representation continues to encode future dynamics when attention mechanisms are removed.
minor comments (2)
  1. [Proposed Method] The description of the knowledge-distillation loss (temperature, weighting between soft and hard targets) is not fully specified; adding the exact formulation and hyper-parameter values would improve reproducibility.
  2. [Numerical Results] Figure captions and axis labels for the accuracy-versus-sequence-length plots should explicitly state the number of Monte-Carlo runs and whether error bars represent standard deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below, proposing revisions where appropriate to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Numerical Results / Experiments] The central claim that distillation successfully transfers long-term beam-tracking ability to the student despite 60% shorter sequences is load-bearing for the abstract and results. No ablation is presented that isolates the effect of the distillation loss from the convolutional feature extractor when both models are restricted to the shorter input length; without this comparison it remains unclear whether the student's future-slot accuracy stems from transferred temporal knowledge or from the CNN features alone.

    Authors: We agree that an ablation isolating the distillation loss would clarify whether the student's performance is due to transferred knowledge or the CNN features. We will add this ablation to the revised manuscript by training a baseline student without distillation on the short sequences and comparing its long-term prediction accuracy to the distilled student. This will provide direct evidence for the effectiveness of the knowledge distillation in transferring temporal modeling capabilities. revision: yes

  2. Referee: [System Model and Evaluation Setup] The evaluation assumes that visual observations remain sufficiently aligned with channel evolution across the tested mobility regimes. The manuscript does not report results under higher-mobility scenarios or quantify degradation of visual-channel correlation over the six future slots; such a test would directly address whether the student's compressed representation continues to encode future dynamics when attention mechanisms are removed.

    Authors: Our evaluation is conducted on the dataset reflecting realistic mobility regimes for the considered mmWave scenarios. We recognize that higher-mobility tests could further validate robustness. Since the current dataset does not encompass extreme mobility cases, we will include in the revision an analysis of the visual-channel correlation over the six future slots using the existing data to quantify degradation. This will help assess the student's ability to encode future dynamics without attention mechanisms. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on independent simulation measurements

full rationale

The paper presents a sensing-assisted beam tracking framework using a large attention-enhanced teacher NN (CNN + GRU with attention) distilled to a lightweight student NN that uses 60% shorter sequences. Central claims consist of reported Top-5 accuracies (>93% for teacher, closely matched by student) obtained from numerical experiments on simulated mmWave scenarios. No mathematical derivation chain exists that reduces predictions or uniqueness results to fitted parameters by construction, nor does any load-bearing premise rely on self-citations whose content is unverified. The performance numbers are measured directly against ground-truth beam selections in the test set and remain falsifiable outside the training procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied machine-learning paper whose claims rest on standard supervised training assumptions and empirical validation rather than explicit axioms or new physical postulates.

pith-pipeline@v0.9.0 · 5749 in / 1034 out tokens · 45757 ms · 2026-05-21T22:23:08.690167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowledge Distillation for Lightweight Multimodal Sensing-Aided mmWave Beam Tracking

    eess.SP 2026-04 conditional novelty 4.0

    Knowledge distillation creates a lightweight student model that reaches over 96% top-5 beam prediction accuracy on real multimodal sensor data while using 27 times fewer parameters than the teacher.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    The roa d towards 6G: A comprehensive survey,

    W. Jiang, B. Han, M. A. Habibi, and H. D. Schotten, “The roa d towards 6G: A comprehensive survey,” IEEE Open J. Commun. Soc. , vol. 2, pp. 334–366, 2021

  2. [2]

    Beam training and trac king in mmwave communication: A survey,

    W. Yi, W. Zhiqing, and F. Zhiyong, “Beam training and trac king in mmwave communication: A survey,” China Commun. , vol. 21, no. 6, pp. 1–22, 2024

  3. [3]

    Environment sema ntic communication: Enabling distributed sensing aided networ ks,

    S. Imran, G. Charan, and A. Alkhateeb, “Environment sema ntic communication: Enabling distributed sensing aided networ ks,” IEEE Open J. Commun. Soc. , vol. 5, pp. 7767–7786, 2024

  4. [4]

    Integrated sensin g and communications (ISAC) for vehicular communication networ ks (VCN),

    X. Cheng, D. Duan, S. Gao, and L. Y ang, “Integrated sensin g and communications (ISAC) for vehicular communication networ ks (VCN),” IEEE Internet Things J. , vol. 9, no. 23, pp. 23 441–23 451, 2022

  5. [5]

    Artificial intelligence-empowered hybrid multiple-inpu t/multiple- output beamforming: Learning to optimize for high-through put scalable MIMO,

    N. Shlezinger, M. Ma, O. Lavi, N. T. Nguyen, Y . C. Eldar, an d M. Juntti, “Artificial intelligence-empowered hybrid multiple-inpu t/multiple- output beamforming: Learning to optimize for high-through put scalable MIMO,” IEEE V eh. Technol. Mag., vol. 19, no. 3, pp. 58–67, 2024

  6. [6]

    Model-based machine learning for Max-Min fairn ess beamforming design in JCAS systems,

    M. Ma, T. Fang, N. Shlezinger, A. Swindlehurst, M. Juntti , and N. Nguyen, “Model-based machine learning for Max-Min fairn ess beamforming design in JCAS systems,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2025

  7. [7]

    Hierarchical codeb ook design for beamforming training in millimeter-wave communicatio n,

    Z. Xiao, T. He, P . Xia, and X.-G. Xia, “Hierarchical codeb ook design for beamforming training in millimeter-wave communicatio n,” IEEE Trans. Wireless Commun. , vol. 15, no. 5, pp. 3380–3392, 2016

  8. [8]

    Explore and eliminate: Optimized two-stage search for millimeter- wave beam alignment,

    M. Li, C. Liu, S. V . Hanly, I. B. Collings, and P . Whiting, “ Explore and eliminate: Optimized two-stage search for millimeter- wave beam alignment,” IEEE Trans. Wireless Commun. , vol. 18, no. 9, pp. 4379–4393, 2019

  9. [9]

    Hierarchical cod ebook- based multiuser beam training for millimeter wave massive M IMO,

    C. Qi, K. Chen, O. A. Dobre, and G. Y . Li, “Hierarchical cod ebook- based multiuser beam training for millimeter wave massive M IMO,” IEEE Trans. Wireless Commun. , vol. 19, no. 12, pp. 8142–8152, 2020

  10. [10]

    Robust be am- tracking for mmWave mobile communications,

    S. Jayaprakasam, X. Ma, J. W. Choi, and S. Kim, “Robust be am- tracking for mmWave mobile communications,” IEEE Commun. Lett. , vol. 21, no. 12, pp. 2654–2657, 2017

  11. [11]

    Beam tracking under hig hly nonlinear mobile millimeter-wave channel,

    J. Lim, H.-M. Park, and D. Hong, “Beam tracking under hig hly nonlinear mobile millimeter-wave channel,” IEEE Commun. Lett. , vol. 23, no. 3, pp. 450–453, 2019

  12. [12]

    Robust adaptive beam tracking for mobile millimet re wave communications,

    C. Liu, M. Li, L. Zhao, P . Whiting, S. V . Hanly, I. B. Colli ngs, and M. Zhao, “Robust adaptive beam tracking for mobile millimet re wave communications,” IEEE Trans. Wireless Commun. , vol. 20, no. 3, pp. 1918–1934, 2020

  13. [13]

    Zhang, A

    D. Zhang, A. Li, M. Shirvanimoghaddam, P . Cheng, Y . Li, a nd B. Vucetic, “Codebook-based training beam sequence design for 13 Time slot DBA score (%) Fig. 11. Generalization performance of the teacher model (w ith self-KD). The teacher model is trained for L = 8 and tested for L = 5, 3. In contrast, the student models for comparison are trained and test...

  14. [14]

    Beam allocation for millimeter-wave MIMO tracking system s,

    D. Zhang, A. Li, H. Chen, N. Wei, M. Ding, Y . Li, and B. Vuce tic, “Beam allocation for millimeter-wave MIMO tracking system s,” IEEE Trans. V eh. Technol., vol. 69, no. 2, pp. 1595–1611, 2020

  15. [15]

    Closed-form hybrid b eamforming solution for spectral efficiency upper bound maximization i n mmWave MIMO-OFDM systems,

    M. Ma, N. T. Nguyen, and M. Juntti, “Closed-form hybrid b eamforming solution for spectral efficiency upper bound maximization i n mmWave MIMO-OFDM systems,” in Proc. IEEE V eh. Technol. Conf. , 2021

  16. [16]

    Deep learning coordinated beamforming for highly-mobile milli meter wave systems,

    A. Alkhateeb, S. Alex, P . V arkey, Y . Li, Q. Qu, and D. Tujk ovic, “Deep learning coordinated beamforming for highly-mobile milli meter wave systems,” IEEE Access , vol. 6, pp. 37 328–37 348, 2018

  17. [17]

    D esign and implementation for deep learning based adjustable beam forming training for millimeter wave communication systems,

    L.-H. Shen, T.-W. Chang, K.-T. Feng, and P .-T. Huang, “D esign and implementation for deep learning based adjustable beam forming training for millimeter wave communication systems,” IEEE Trans. V eh. Technol., vol. 70, no. 3, pp. 2413–2427, 2021

  18. [18]

    Model-driven dee p learning assisted beam tracking for millimeter-wave systems,

    J. Liu, X. Li, T. Fan, S. Lv, and M. Shi, “Model-driven dee p learning assisted beam tracking for millimeter-wave systems,” IEEE Commun. Lett., vol. 26, no. 10, pp. 2345–2349, 2022

  19. [19]

    Fast MIMO beamfo rming via deep reinforcement learning for high mobility mmWave conne ctivity,

    M. Fozi, A. R. Sharafat, and M. Bennis, “Fast MIMO beamfo rming via deep reinforcement learning for high mobility mmWave conne ctivity,” IEEE J. Sel. Areas Commun. , vol. 40, no. 1, pp. 127–142, 2022

  20. [20]

    Continuous-time m mWave beam prediction with ODE-LSTM learning architecture,

    K. Ma, F. Zhang, W. Tian, and Z. Wang, “Continuous-time m mWave beam prediction with ODE-LSTM learning architecture,” IEEE Wireless Commun. Lett. , vol. 12, no. 1, pp. 187–191, 2023

  21. [21]

    Deep unfolding hybrid beamfo rming designs for THz massive MIMO systems,

    N. T. Nguyen, M. Ma, O. Lavi, N. Shlezinger, Y . C. Eldar, A . L. Swindlehurst, and M. Juntti, “Deep unfolding hybrid beamfo rming designs for THz massive MIMO systems,” IEEE Trans. Signal Process. , vol. 71, pp. 3788–3804, 2023

  22. [22]

    Multimodal fusio n assisted mmwave beam training in dual-model networks,

    J. Liu, X. Li, T. Fan, S. Lv, and M. Shi, “Multimodal fusio n assisted mmwave beam training in dual-model networks,” IEEE Trans. V eh. Technol., vol. 73, no. 1, pp. 995–1011, 2024

  23. [23]

    In tegrated sensing and communication-enabled predictive beamformin g with deep learning in vehicular networks,

    J. Mu, Y . Gong, F. Zhang, Y . Cui, F. Zheng, and X. Jing, “In tegrated sensing and communication-enabled predictive beamformin g with deep learning in vehicular networks,” IEEE Commun. Lett. , vol. 25, no. 10, pp. 3301–3304, 2021

  24. [24]

    Learning-based predictive beamforming for integrated se nsing and communication in vehicular networks,

    C. Liu, W. Y uan, S. Li, X. Liu, H. Li, D. W. K. Ng, and Y . Li, “Learning-based predictive beamforming for integrated se nsing and communication in vehicular networks,” IEEE J. Sel. Areas Commun. , vol. 40, no. 8, pp. 2317–2334, 2022

  25. [25]

    LiDAR aided futu re beam prediction in real-world millimeter wave V2I communicatio ns,

    S. Jiang, G. Charan, and A. Alkhateeb, “LiDAR aided futu re beam prediction in real-world millimeter wave V2I communicatio ns,” IEEE Wireless Commun. Lett. , vol. 12, no. 2, pp. 212–216, 2023

  26. [26]

    LID AR data for deep learning-based mmWave beam-selection,

    A. Klautau, N. Gonz´ alez-Prelcic, and R. W. Heath, “LID AR data for deep learning-based mmWave beam-selection,” IEEE Wireless Commun. Lett. , vol. 8, no. 3, pp. 909–912, 2019

  27. [27]

    Radar aided 6G beam predi ction: Deep learning algorithms and real-world demonstration,

    U. Demirhan and A. Alkhateeb, “Radar aided 6G beam predi ction: Deep learning algorithms and real-world demonstration,” i n Proc. IEEE Wireless Commun. and Networking Conf. , 2022

  28. [28]

    Millimeter wave V2V beam tracking using radar: Algorithms and real-world demonstra tion,

    H. Luo, U. Demirhan, and A. Alkhateeb, “Millimeter wave V2V beam tracking using radar: Algorithms and real-world demonstra tion,” in Proc. European Signal Proc. Conf. , 2023

  29. [29]

    Environment s emantics aided wireless communications: A case study of mmWave beam prediction and blockage prediction,

    Y . Y ang, F. Gao, X. Tao, G. Liu, and C. Pan, “Environment s emantics aided wireless communications: A case study of mmWave beam prediction and blockage prediction,” IEEE J. Sel. Areas Commun. , vol. 41, no. 7, pp. 2025–2040, 2023

  30. [30]

    Vision-position multi-modal beam prediction using real m illimeter wave datasets,

    G. Charan, T. Osman, A. Hredzak, N. Thawdar, and A. Alkha teeb, “Vision-position multi-modal beam prediction using real m illimeter wave datasets,” in Proc. IEEE Wireless Commun. and Networking Conf., 2022

  31. [31]

    3D scene-based b eam selection for mmWave communications,

    W. Xu, F. Gao, S. Jin, and A. Alkhateeb, “3D scene-based b eam selection for mmWave communications,” IEEE Wireless Commun. Lett. , vol. 9, no. 11, pp. 1850–1854, 2020

  32. [32]

    Intelligent multi-modal sen sing- communication integration: Synesthesia of machines,

    X. Cheng, H. Zhang, J. Zhang, S. Gao, S. Li, Z. Huang, L. Ba i, Z. Y ang, X. Zheng, and L. Y ang, “Intelligent multi-modal sen sing- communication integration: Synesthesia of machines,” IEEE Commun. Surveys Tuts., vol. 26, no. 1, pp. 258–301, 2024

  33. [33]

    Se nsing- assisted high reliable communication: A transformer-base d beamforming approach,

    Y . Cui, J. Nie, X. Cao, T. Y u, J. Zou, J. Mu, and X. Jing, “Se nsing- assisted high reliable communication: A transformer-base d beamforming approach,” IEEE J. Sel. Topics Signal Process. , vol. 18, no. 5, pp. 782–795, 2024

  34. [34]

    Deep quantum-transformer networks for multimodal beam pr ediction in ISAC systems,

    S. Tariq, B. E. Arfeto, U. Khalid, S. Kim, T. Q. Duong, and H. Shin, “Deep quantum-transformer networks for multimodal beam pr ediction in ISAC systems,” IEEE Internet Things J. , vol. 11, no. 18, pp. 29 387–29 401, 2024

  35. [35]

    Multimodal transformers for wireless communications: A case study in b eam prediction,

    Y . Tian, Q. Zhao, F. Boukhalfa, K. Wu, F. Bader et al. , “Multimodal transformers for wireless communications: A case study in b eam prediction,” arXiv preprint arXiv:2309.11811 , 2023

  36. [36]

    Multimodal d eep learning empowered millimeter-wave beam prediction,

    B. Shi, M. Li, M.-M. Zhao, M. Lei, and L. Li, “Multimodal d eep learning empowered millimeter-wave beam prediction,” in Proc. IEEE V eh. Technol. Conf., 2024

  37. [37]

    Resource-efficient beam pre- diction in mmwave communications with multimodal realistic simulation framework,

    Y . M. Park, Y . K. Tun, W. Saad, and C. S. Hong, “Resource-e fficient beam prediction in mmwave communications with multimodal r ealistic simulation framework,” arXiv preprint arXiv:2504.05187 , 2025

  38. [38]

    Multimodal deep learning-empowered beam prediction in fu ture THz ISAC systems,

    K. Zhang, W. Y u, H. He, S. Song, J. Zhang, and K. B. Letaief , “Multimodal deep learning-empowered beam prediction in fu ture THz ISAC systems,” arXiv preprint arXiv:2505.02381 , 2025

  39. [39]

    Advancing mu lti-modal beam prediction with cross-modal feature enhancement and d ynamic fusion mechanism,

    Q. Zhu, Y . Wang, W. Li, H. Huang, and G. Gui, “Advancing mu lti-modal beam prediction with cross-modal feature enhancement and d ynamic fusion mechanism,” IEEE Trans. Commun. , Early access, 2025

  40. [40]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the know ledge in a neural network,” arXiv preprint arXiv:1503.02531 , 2015

  41. [41]

    Compacting deep neural networks for Internet of Things: Me thods and applications,

    K. Zhang, H. Ying, H.-N. Dai, L. Li, Y . Peng, K. Guo, and H. Y u, “Compacting deep neural networks for Internet of Things: Me thods and applications,” IEEE Internet Things J. , vol. 8, no. 15, pp. 11 935–11 959, 2021

  42. [42]

    Towards understanding knowl edge distillation,

    M. Phuong and C. Lampert, “Towards understanding knowl edge distillation,” in Proc. International Conference on Machine Learning . PMLR, 2019, pp. 5142–5151

  43. [43]

    Understanding and improving knowledge distillation,

    J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, and S. Jain, “Understanding and improving knowledge distillation,” arXiv preprint arXiv:2002.03532, 2020

  44. [44]

    Self-disti llation amplifies regularization in Hilbert space,

    H. Mobahi, M. Farajtabar, and P . Bartlett, “Self-disti llation amplifies regularization in Hilbert space,” Advances in Neural Information Processing Systems, vol. 33, pp. 3351–3361, 2020

  45. [45]

    Mode l compression,

    C. Buciluˇ a, R. Caruana, and A. Niculescu-Mizil, “Mode l compression,” in Proc. 12th ACM SIGKDD International Conference on Knowledg e Discovery and Data Mining , 2006, pp. 535–541

  46. [46]

    Do deep nets really need to be deep?

    J. Ba and R. Caruana, “Do deep nets really need to be deep? ” Advances in Neural Information Processing Systems , vol. 27, 2014

  47. [47]

    Model compression via distillation and quantization

    A. Polino, R. Pascanu, and D. Alistarh, “Model compress ion via distillation and quantization,” arXiv preprint arXiv:1802.05668 , 2018

  48. [48]

    Knowledge distillatio n-aided end-to-end learning for linear precoding in multiuser MIMO downlink systems with finite-rate feedback,

    K. Kong, W.-J. Song, and M. Min, “Knowledge distillatio n-aided end-to-end learning for linear precoding in multiuser MIMO downlink systems with finite-rate feedback,” IEEE Trans. V eh. Technol., vol. 70, no. 10, pp. 11 095–11 100, 2021

  49. [49]

    Kno wledge- distillation-aided lightweight neural network for massiv e MIMO CSI feedback,

    H. Tang, J. Guo, M. Matthaiou, C.-K. Wen, and S. Jin, “Kno wledge- distillation-aided lightweight neural network for massiv e MIMO CSI feedback,” in Proc. IEEE V eh. Technol. Conf. , 2021

  50. [50]

    De fensive distillation-based adversarial attack mitigation method for channel estimation using deep learning models in next-generation w ireless networks,

    F. O. Catak, M. Kuzlu, E. Catak, U. Cali, and O. Guler, “De fensive distillation-based adversarial attack mitigation method for channel estimation using deep learning models in next-generation w ireless networks,” IEEE Access , vol. 10, pp. 98 191–98 203, 2022

  51. [51]

    Environment know ledge-aided massive MIMO feedback codebook enhancement using artificia l intelli- gence,

    J. Guo, C.-K. Wen, M. Chen, and S. Jin, “Environment know ledge-aided massive MIMO feedback codebook enhancement using artificia l intelli- gence,” IEEE Trans. Commun. , vol. 70, no. 7, pp. 4527–4542, 2022. 14

  52. [52]

    Knowledge dist illation- based semantic communications for multiple users,

    C. Liu, Y . Zhou, Y . Chen, and S.-H. Y ang, “Knowledge dist illation- based semantic communications for multiple users,” IEEE Trans. Wireless Commun., vol. 23, no. 7, pp. 7000–7012, 2023

  53. [53]

    Knowledge distillation based deep lear ning model for user equipment positioning in massive MIMO systems usin g flying reconfigurable intelligent surfaces,

    A. Al-Ahmadi, “Knowledge distillation based deep lear ning model for user equipment positioning in massive MIMO systems usin g flying reconfigurable intelligent surfaces,” IEEE Access , vol. 12, pp. 20 679–20 691, 2024

  54. [54]

    Lear ning efficient and accurate detectors with dynamic knowledge dis tillation in remote sensing imagery,

    Y . Zhang, Z. Y an, X. Sun, W. Diao, K. Fu, and L. Wang, “Lear ning efficient and accurate detectors with dynamic knowledge dis tillation in remote sensing imagery,” IEEE Trans. Geosci. Remote Sens. , vol. 60, pp. 1–19, 2021

  55. [55]

    Cross-m odal knowledge distillation for vision-to-sensor action recog nition,

    J. Ni, R. Sarbajna, Y . Liu, A. H. Ngu, and Y . Y an, “Cross-m odal knowledge distillation for vision-to-sensor action recog nition,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. , 2022

  56. [56]

    Computer vision aided beam t racking in a real-world millimeter wave deployment,

    S. Jiang and A. Alkhateeb, “Computer vision aided beam t racking in a real-world millimeter wave deployment,” in Proc. IEEE Global Commun. Conf. W orkshop, 2022

  57. [57]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Y olov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020

  58. [58]

    Thomas and A

    M. Thomas and A. T. Joy, Elements of Information Theory . Wiley- Interscience, 2006

  59. [59]

    Fundamentals of recurrent neural net work (RNN) and long short-term memory (LSTM) network,

    A. Sherstinsky, “Fundamentals of recurrent neural net work (RNN) and long short-term memory (LSTM) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020

  60. [60]

    Attention is all you ne ed,

    A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jone s, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you ne ed,” Advances in Neural Information Processing Systems , vol. 30, 2017

  61. [61]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L . Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774 , 2023

  62. [62]

    DeepSeek-V3 Technical Report

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. De ng, C. Zhang, C. Ruan et al. , “Deepseek-v3 Technical Report,” arXiv preprint arXiv:2412.19437, 2024

  63. [63]

    Gate-variants of gated recurren t unit (GRU) neural networks,

    R. Dey and F. M. Salem, “Gate-variants of gated recurren t unit (GRU) neural networks,” in Proc. IEEE International Midwest Symp. on Circuits and Systems (MWSCAS) , 2017, pp. 1597–1600

  64. [64]

    Rethinking the Value of Network Pruning

    Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethin king the value of network pruning,” arXiv preprint arXiv:1810.05270 , 2018

  65. [65]

    A compre hensive survey on model quantization for deep neural networks in ima ge classification,

    B. Rokh, A. Azarpeyvand, and A. Khanteymoori, “A compre hensive survey on model quantization for deep neural networks in ima ge classification,” ACM Transactions on Intelligent Systems and Technology, vol. 14, no. 6, pp. 1–50, 2023

  66. [66]

    Does knowledge distillation really work?

    S. Stanton, P . Izmailov, P . Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?” Advances in Neural Information Processing Systems , vol. 34, pp. 6906–6919, 2021

  67. [67]

    Focal loss for dense object detection,

    T.-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Doll´ ar, “ Focal loss for dense object detection,” in Proc. IEEE International Conf. on Computer Vision, 2017, pp. 2980–2988

  68. [68]

    Deepsense 6G: A large-scale real-world multi-modal sensing and communication dataset ,

    A. Alkhateeb, G. Charan, T. Osman, A. Hredzak, J. Morais , U. Demirhan, and N. Srinivas, “Deepsense 6G: A large-scale real-world multi-modal sensing and communication dataset ,” IEEE Commun. Mag. , vol. 61, no. 9, pp. 122–128, 2023

  69. [69]

    Multi-modal beam prediction challenge 2022 : Towards generalization,

    G. Charan, U. Demirhan, J. Morais, A. Behboodi, H. Pezes hki, and A. Alkhateeb, “Multi-modal beam prediction challenge 2022 : Towards generalization,” arXiv preprint arXiv:2209.07519 , 2022