pith. machine review for the scientific record. sign in

arxiv: 2604.15016 · v2 · submitted 2026-04-16 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

DLink: Distilling Layer-wise and Dominant Knowledge from EEG Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords EEG foundation modelsknowledge distillationmodel compressionlayer routingspectral alignmentcompact student modelsbrain signal decoding
0
0 comments X

The pith

DLink distills EEG foundation models into compact students by routing layer-wise knowledge and aligning spectral features to cut parameters and latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

EEG foundation models deliver strong cross-subject and cross-task performance but impose high computational costs that limit deployment. The paper finds that task-discriminative information is distributed across intermediate layers rather than concentrated in the final representation, so direct distillation from a fixed teacher underutilizes available knowledge. DLink therefore trains a lightweight input-conditioned router to aggregate relevant teacher layers for each sample and applies magnitude-plus-phase spectral alignment to counteract compression-induced distortions. The resulting student internalizes the routed knowledge through a project-then-compress pathway and operates without the teacher or router at inference time. On four EEG benchmarks this yields improved compact models that remain competitive with lightweight baselines while substantially lowering parameter count, FLOPs, and CPU latency.

Core claim

DLink is a spectrally guided distillation framework that employs an input-conditioned layer router to aggregate task-adapted representations from multiple layers of an EEG foundation model and aligns magnitude and phase spectra to transfer distributed discriminative information into a compact student, which is trained once and then deployed independently of the teacher.

What carries the argument

Lightweight input-conditioned router that dynamically aggregates teacher layers per input, combined with magnitude-and-phase spectral alignment to mitigate distortion during knowledge transfer.

If this is right

  • Matched compact students achieve higher decoding accuracy on four EEG benchmarks than standard distillation baselines.
  • DLink students remain competitive with existing lightweight models while narrowing the gap to full fine-tuned EFMs.
  • Parameter count, FLOPs, and CPU-only inference latency drop substantially relative to retaining the original EFM.
  • The teacher and router are required only during training, enabling fully independent student inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-plus-spectral mechanism could be tested on other biosignal foundation models such as those for ECG or MEG.
  • Edge-device EEG applications might become practical once the distilled students reach real-time CPU speeds.
  • Spectral alignment may prove useful for preserving signal structure in distillation pipelines outside EEG.

Load-bearing premise

A lightweight input-conditioned router plus magnitude-and-phase spectral alignment can reliably aggregate and transfer task-discriminative information distributed across intermediate layers without introducing new distortions or requiring the teacher at inference time.

What would settle it

If a compact student trained with DLink shows no accuracy gain over a student distilled only from the teacher's final layer on a held-out EEG benchmark, the benefit of layer-wise routing and spectral alignment would be refuted.

Figures

Figures reproduced from arXiv: 2604.15016 by Chenyu Liu, Fang Li, Jingyuan Wang, Junfeng Yao, Meiyan Xu, Xinliang Zhou, Yi Ding, Yong Li, Zhihao Jia, Ziyu Jia.

Figure 1
Figure 1. Figure 1: DLink vs. Traditional KD. Unlike rigid last-layer distil￾lation, DLink utilizes a dynamic Router for multi-layer aggregation and Spectral Transformation for efficient distillation into the EEG MiC student, optimizing the accuracy-parameter trade-off. et al., 2025]. However, their deployment on resource￾constrained embedded systems is hindered by substantial memory requirements and limited on-device adaptab… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the DLink pipeline. The framework distills knowledge from a multi-layer EEG Foundation Model (FM) teacher into the EEG MiC student (comprising a Mimicker and Compressor). A dynamic Router adaptively aggregates these hierarchical FM layers guided by PSD scores, while frequency-domain alignment of magnitude and phase ensures efficient EEG decoding. classifier Mimicker S t C CNN Gelu conv Norm Gel… view at source ↗
Figure 3
Figure 3. Figure 3: EEG MiC architecture comprising a Mimicker, a Com￾pressor, and an MLP Classifier. task-relevant features {f (l) T } L l=1 from a high-capacity teacher T to a student S, where f (l) T has a channel–segment–time (C×S×T) structure, while preserving spectral information critical for EEG decoding. To achieve this, we propose DLink (Distilling Layer-wise and Dominant Knowledge), a framework where three com￾ponen… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy vs. efficiency on FACED. Bubble size and color [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interpretability analysis of the Router policy on the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

EEG foundation models (EFMs) achieve strong cross-subject and cross-task generalization through large-scale pretraining and downstream fine-tuning. Through empirical analysis, we observe that (i) task-adapted EFMs provide strong decoding performance but incur substantial overhead when retained as inference backbones, making knowledge distillation a natural route for optimizing compact students; and (ii) direct distillation from a fixed teacher representation underutilizes EFM knowledge, as task-discriminative information is distributed across intermediate layers rather than concentrated in the final layer. These observations motivate DLink (Distilling Layer-wise and Dominant Knowledge), a spectrally guided distillation framework with input-conditioned layer routing for transferring EFM knowledge into compact students. DLink uses a lightweight router to aggregate teacher layers for each input, and aligns magnitude and phase spectra to mitigate compression-induced spectral distortion in learned representations. The routed teacher knowledge is internalized by a project-then-compress student; the teacher and router are used only during training. Experiments on four EEG benchmarks show that DLink improves matched compact students and remains competitive with lightweight baselines, narrowing the gap to fine-tuned EFMs while substantially reducing parameters, FLOPs, and CPU-only inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DLink, a spectrally guided knowledge distillation framework for EEG foundation models (EFMs). Motivated by observations that task-adapted EFMs incur high inference overhead and that task-discriminative information is distributed across intermediate layers, DLink employs a lightweight input-conditioned router to aggregate teacher layers per input and aligns magnitude and phase spectra to mitigate compression-induced distortions. Knowledge is internalized by a project-then-compress student; the teacher and router are discarded at inference. The central claim is that DLink improves matched compact students, remains competitive with lightweight baselines, narrows the gap to fine-tuned EFMs, and substantially reduces parameters, FLOPs, and CPU-only inference latency across four EEG benchmarks.

Significance. If the empirical results hold under rigorous validation, DLink could meaningfully advance practical deployment of EEG foundation models in resource-constrained settings such as wearables or real-time BCI systems. The combination of input-conditioned layer routing and spectral alignment offers a targeted way to capture distributed knowledge without test-time teacher overhead, which is a relevant direction for efficient model compression in signal-processing domains.

major comments (2)
  1. [Abstract and Experimental Results] Abstract and Experimental Results: the central claim that 'Experiments on four EEG benchmarks show that DLink improves matched compact students...' is unsupported by any numerical results, error bars, ablation studies, or statistical tests in the provided manuscript text. This is load-bearing for the performance and efficiency assertions.
  2. [Method] Method section: the framework is described only at the component level with no equations, loss formulations, or hyperparameter details for the router or the magnitude-and-phase spectral alignment. This prevents assessment of whether the mechanism reliably aggregates distributed layer knowledge without introducing new distortions.
minor comments (2)
  1. The abstract is clear but would benefit from one or two key quantitative highlights (e.g., relative accuracy gain and parameter reduction) to better convey the empirical contribution.
  2. Add explicit descriptions of the four EEG benchmarks, the architectures of the matched compact students, and the lightweight baselines used for comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on empirical support and methodological detail. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results: the central claim that 'Experiments on four EEG benchmarks show that DLink improves matched compact students...' is unsupported by any numerical results, error bars, ablation studies, or statistical tests in the provided manuscript text. This is load-bearing for the performance and efficiency assertions.

    Authors: We agree that the abstract claim would be more robust with explicit numerical backing. In the revised manuscript we will augment the abstract with key quantitative results (accuracy improvements, parameter/FLOP/latency reductions on the four benchmarks) and will ensure the experimental section prominently displays error bars from repeated runs, ablation studies isolating the router and spectral alignment, and statistical significance tests. These additions will directly substantiate the central claim using the existing experimental data. revision: yes

  2. Referee: [Method] Method section: the framework is described only at the component level with no equations, loss formulations, or hyperparameter details for the router or the magnitude-and-phase spectral alignment. This prevents assessment of whether the mechanism reliably aggregates distributed layer knowledge without introducing new distortions.

    Authors: We will expand the Method section with the precise mathematical formulations: the input-conditioned router's aggregation function and weighting, the magnitude and phase spectral alignment losses, the overall distillation objective, and the project-then-compress student procedure. All relevant hyperparameters for router training and spectral alignment will be listed. These details will allow assessment of layer aggregation reliability and any potential distortions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirically motivated framework without derivations or self-referential reductions

full rationale

The paper introduces DLink as a distillation framework motivated by two explicit empirical observations about EFM overhead and distributed layer knowledge. No equations, derivations, fitted parameters, or mathematical claims appear in the provided text. The router, spectral alignment, and project-then-compress student are described directly as design choices, with the teacher/router discarded at inference. Claims of improved efficiency and competitive performance rest on benchmark experiments rather than any reduction to inputs by construction, self-citation chains, or renamed known results. The argument is self-contained against external validation via the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical method proposal with no explicit mathematical derivation, free parameters, or invented entities stated in the abstract; it relies on standard assumptions of neural-network training and knowledge distillation.

pith-pipeline@v0.9.0 · 5532 in / 1116 out tokens · 53223 ms · 2026-05-12T00:44:53.005997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

  1. [1]

    A large finer-grained affective computing EEG dataset.Scientific Data, 10(1):740, oct

    [Chenet al., 2023 ] Jingjing Chen, Xiaobin Wang, Chen Huang, Xin Hu, Xinke Shen, and Dan Zhang. A large finer-grained affective computing EEG dataset.Scientific Data, 10(1):740, oct

  2. [2]

    TSception: Captur- ing temporal dynamics and spatial asymmetry from EEG for emotion recognition.IEEE Transactions on Affective Computing, 14(3):2238–2250,

    [Dinget al., 2023 ] Yi Ding, Neethu Robinson, Su Zhang, Qiuhao Zeng, and Cuntai Guan. TSception: Captur- ing temporal dynamics and spatial asymmetry from EEG for emotion recognition.IEEE Transactions on Affective Computing, 14(3):2238–2250,

  3. [3]

    EmT: A novel transformer for generalized cross-subject EEG emotion recognition, June

    [Dinget al., 2024 ] Yi Ding, Chengxuan Tong, Shuailei Zhang, Muyun Jiang, Yong Li, Kevin Lim Jun Liang, and Cuntai Guan. EmT: A novel transformer for generalized cross-subject EEG emotion recognition, June

  4. [4]

    EEG-Deformer: A dense convolutional transformer for brain-computer interfaces.IEEE Journal of Biomedi- cal and Health Informatics, 29(3):1909–1918,

    [Dinget al., 2025 ] Yi Ding, Yong Li, Hao Sun, Rui Liu, Chengxuan Tong, Chenyu Liu, Xinliang Zhou, and Cuntai Guan. EEG-Deformer: A dense convolutional transformer for brain-computer interfaces.IEEE Journal of Biomedi- cal and Health Informatics, 29(3):1909–1918,

  5. [5]

    Schwab, and Ramakrishna Vedantam

    [Duboiset al., 2020 ] Yann Dubois, Douwe Kiela, David J. Schwab, and Ramakrishna Vedantam. Learning optimal representations with the decodable information bottleneck. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA,

  6. [6]

    [Geirnaertet al., 2022 ] Simon Geirnaert, Tom Francart, and Alexander Bertrand

    Curran Associates Inc. [Geirnaertet al., 2022 ] Simon Geirnaert, Tom Francart, and Alexander Bertrand. Time-adaptive unsupervised auditory attention decoding using eeg-based stimulus reconstruc- tion.IEEE Journal of Biomedical and Health Informatics, 26(8):3767–3778,

  7. [7]

    Goldberger, Luis A

    [Goldbergeret al., 2000 ] Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Phys- ioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. 101(23),

  8. [8]

    Maybank, and Dacheng Tao

    [Gouet al., 2021 ] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey.Int. J. Comput. Vision, 129(6):1789–1819, June

  9. [9]

    Sera: Separated coarse-to-fine repre- sentation alignment for cross-subject eeg-based emotion recognition

    [Jiaet al., 2025 ] Zhihao Jia, Meiyan Xu, Jingyuan Wang, Ziyu Jia, Yong Li, Xinliang Zhou, Chenyu Liu, Junfeng Yao, and Yi Ding. Sera: Separated coarse-to-fine repre- sentation alignment for cross-subject eeg-based emotion recognition. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 5509–5518, New York, NY , USA,

  10. [10]

    [Jianget al., 2024 ] Wei-Bang Jiang, Liming Zhao, and Bao- liang Lu

    Association for Computing Machinery. [Jianget al., 2024 ] Wei-Bang Jiang, Liming Zhao, and Bao- liang Lu. Large brain model for learning generic represen- tations with tremendous eeg data in bci. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, edi- tors,International Conference on Representation Learn- ing, volume 2024, pages 16405–16426,

  11. [11]

    CM- CRD: cross-modal contrastive representation distillation for emotion recognition.CoRR, abs/2504.09221,

    [Kanet al., 2025 ] Siyuan Kan, Huanyu Wu, Zhenyao Cui, Fan Huang, Xiaolong Xu, and Dongrui Wu. CM- CRD: cross-modal contrastive representation distillation for emotion recognition.CoRR, abs/2504.09221,

  12. [12]

    DistiLLM: Towards streamlined dis- tillation for large language models

    [Koet al., 2024 ] Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined dis- tillation for large language models. In Ruslan Salakhutdi- nov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Ma- chine Learning, v...

  13. [13]

    EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces

    [Lawhernet al., 2018 ] Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou P Hung, and Brent J Lance. EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces. Journal of Neural Engineering, 15(5):056013, jul

  14. [14]

    A multi-view spectral-spatial-temporal masked autoencoder for decoding emotions with self- supervised learning

    [Liet al., 2022 ] Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. A multi-view spectral-spatial-temporal masked autoencoder for decoding emotions with self- supervised learning. InProceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 6–14, New York, NY , USA,

  15. [15]

    [Lianget al., 2023 ] Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao

    Association for Com- puting Machinery. [Lianget al., 2023 ] Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, and Tuo Zhao. Less is more: task-aware layer-wise distillation for language model com- pression. InProceedings of the 40th International Confer- ence on Machine Learning, ICML’23. JMLR.org,

  16. [16]

    EmotionKD: A cross-modal knowledge distillation frame- work for emotion recognition based on physiological sig- nals

    [Liuet al., 2023 ] Yucheng Liu, Ziyu Jia, and Haichao Wang. EmotionKD: A cross-modal knowledge distillation frame- work for emotion recognition based on physiological sig- nals. InProceedings of the 31st ACM International Con- ference on Multimedia, MM ’23, page 6122–6131, New York, NY , USA,

  17. [17]

    [Maet al., 2022 ] Jun Ma, Banghua Yang, Wenzheng Qiu, Yunzhe Li, Shouwei Gao, and Xinxing Xia

    Association for Computing Ma- chinery. [Maet al., 2022 ] Jun Ma, Banghua Yang, Wenzheng Qiu, Yunzhe Li, Shouwei Gao, and Xinxing Xia. A large EEG dataset for studying cross-session variability in motor im- agery brain-computer interface.Scientific Data, 9(1):531, September

  18. [18]

    MDD patients and healthy controls EEG data (new)

    [Mumtaz, 2016] Wajid Mumtaz. MDD patients and healthy controls EEG data (new). 11

  19. [19]

    A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion meth- ods.Neurocomputing, 561:126866,

    [Panet al., 2023 ] Bei Pan, Kaoru Hirota, Zhiyang Jia, and Yaping Dai. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion meth- ods.Neurocomputing, 561:126866,

  20. [20]

    Learn- ing student-friendly teacher networks for knowledge dis- tillation

    [Parket al., 2021 ] Dae Young Park, Moon-Hyun Cha, changwook jeong, Daesin Kim, and Bohyung Han. Learn- ing student-friendly teacher networks for knowledge dis- tillation. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 13292–13303. Curran Associates, Inc.,

  21. [21]

    Fantas- tic features and where to find them: A probing method to combine features from multiple foundation models

    [Ramtoulaet al., 2025 ] Benjamin Ramtoula, Pierre-Yves Lajoie, Paul Newman, and Daniele De Martini. Fantas- tic features and where to find them: A probing method to combine features from multiple foundation models. In NeurIPS,

  22. [22]

    Ribeiro and Thomas B

    [Ribeiro and Sch¨on, 2021] Antˆonio H. Ribeiro and Thomas B. Sch ¨on. How convolutional neural net- works deal with aliasing. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2755–2759,

  23. [23]

    FitNets: Hints for thin deep nets

    [Romeroet al., 2015 ] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In Yoshua Bengio and Yann LeCun, editors,3rd Inter- national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings,

  24. [24]

    Schalk, D.J

    [Schalket al., 2004 ] G. Schalk, D.J. McFarland, T. Hin- terberger, N. Birbaumer, and J.R. Wolpaw. Bci2000: a general-purpose brain-computer interface (bci) sys- tem.IEEE Transactions on Biomedical Engineering, 51(6):1034–1043,

  25. [25]

    [Shannon, 1949] C.E. Shannon. Communication in the pres- ence of noise.Proceedings of the IRE, 37(1):10–21,

  26. [26]

    EEG conformer: Convolutional transformer for EEG decoding and visu- alization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719,

    [Songet al., 2023 ] Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao. EEG conformer: Convolutional transformer for EEG decoding and visu- alization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719,

  27. [27]

    Logit standardization in knowledge distillation

    [Sunet al., 2024 ] Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang, and Xiaochun Cao. Logit standardization in knowledge distillation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15731–15740,

  28. [28]

    EEGPT: Pre- trained transformer for universal and reliable representa- tion of EEG signals

    [Wanget al., 2024 ] Guangyu Wang, Wenchao Liu, Yuhong He, Cong Xu, Lin Ma, and Haifeng Li. EEGPT: Pre- trained transformer for universal and reliable representa- tion of EEG signals. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems,

  29. [29]

    CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding

    [Wanget al., 2025a ] Jiquan Wang, Sha Zhao, Zhiling Luo, Yangxuan Zhou, Haiteng Jiang, Shijian Li, Tao Li, and Gang Pan. CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Representation Learning, volume 2025, pages 75310– 75346,

  30. [30]

    EEG- DINO: Learning EEG Foundation Models via Hierarchi- cal Self-Distillation

    [Wanget al., 2025b ] Xujia Wang, Xuhui Liu, Xi Liu, Qian Si, Zhaoliang Xu, Yang Li, and Xiantong Zhen. EEG- DINO: Learning EEG Foundation Models via Hierarchi- cal Self-Distillation . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2025, volume LNCS 15960. Springer Nature Switzerland, September

  31. [31]

    AdaBrain-Bench : Benchmarking brain foundation models for brain-computer interface applications

    [Wuet al., 2025 ] Jiamin Wu, Zichen Ren, Junyu Wang, Pengyu Zhu, Yonghao Song, Mianxin Liu, Qihao Zheng, Lei Bai, Wanli Ouyang, and Chunfeng Song. Adabrain- bench: Benchmarking brain foundation models for brain- computer interface applications.CoRR, abs/2507.09882,

  32. [32]

    FreeKD: Knowledge distillation via semantic frequency prompt

    [Zhanget al., 2024 ] Yuan Zhang, Tao Huang, Jiaming Liu, Tao Jiang, Kuan Cheng, and Shanghang Zhang. FreeKD: Knowledge distillation via semantic frequency prompt. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15931– 15940, June

  33. [33]

    Making convolutional net- works Shift-Invariant again

    [Zhang, 2019] Richard Zhang. Making convolutional net- works Shift-Invariant again. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th In- ternational Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 7324–7334. PMLR, 09–15 Jun

  34. [34]

    Brain foundation models: A survey on advancements in neural signal processing and brain discovery.IEEE Signal Processing Magazine, 42(5):22–35, 2025

    [Zhouet al., 2025 ] Xinliang Zhou, Chenyu Liu, Zhisheng Chen, Kun Wang, Yi Ding, Ziyu Jia, and Qingsong Wen. Brain foundation models: A survey on advancements in neural signal processing and brain discovery.IEEE Signal Processing Magazine, 42(5):22–35, 2025