pith. machine review for the scientific record. sign in

arxiv: 2605.01732 · v1 · submitted 2026-05-03 · 💻 cs.CL

Recognition: unknown

EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords knowledge distillationtoken-level adaptationentropy-guided learninglarge language modelsmodel compressioncurriculum learningadaptive temperature
0
0 comments X

The pith

Entropy from the teacher model dynamically adjusts token-level curriculum, temperature, and distillation branches to improve knowledge transfer to smaller student models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are too big for many uses, so knowledge distillation tries to compress their abilities into smaller students, but standard methods treat every token the same even though some tokens matter more for decisions. This paper claims that measuring the entropy of the teacher's output probabilities lets the training process adapt automatically: it starts with low-entropy tokens and moves to high-entropy ones, scales the distillation temperature per token to match teacher confidence, and switches between simple logit matching for easy tokens and richer feature matching for hard ones. The result is supposed to be more efficient and effective transfer because the student spends its capacity where the teacher is actually uncertain rather than wasting it on obvious cases. If the approach works, compact models could close more of the performance gap with their oversized teachers while using less training compute.

Core claim

We propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation: a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training, adjustment of the distillation temperature based on token entropy to better capture teacher confidence patterns, and a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens.

What carries the argument

Teacher output entropy, which measures uncertainty in the next-token distribution and is used to adapt curriculum order, temperature scaling, and choice between logit-only and feature-based distillation branches for each token.

If this is right

  • Student models reach higher task performance for the same parameter count because training effort is concentrated on tokens where the teacher shows high uncertainty.
  • Overall distillation training time decreases because easy low-entropy tokens use a cheaper logits-only branch instead of full feature extraction.
  • Temperature scaling per token lets the student better imitate the teacher's varying confidence levels rather than a single global temperature.
  • The curriculum ordering produces a natural progression from simple to complex tokens, similar to human learning schedules but derived automatically from entropy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy signal could be reused to decide when to stop distilling a given token or to weight the loss dynamically beyond the three changes described.
  • If entropy correlates with token difficulty across languages, the method might improve cross-lingual distillation without language-specific tuning.
  • The dual-branch switch could be extended to other efficiency techniques such as early exiting or sparse attention on low-entropy tokens.

Load-bearing premise

The entropy of the teacher's predictions reliably marks tokens that are differentially important or difficult for the student, and the three adaptive changes together produce net gains without adding new training instabilities or biases.

What would settle it

On standard benchmarks such as GLUE or SuperGLUE, student models trained with the entropy-guided method show no accuracy gain or lower accuracy than identical students trained with uniform distillation under the same compute budget.

Figures

Figures reproduced from arXiv: 2605.01732 by Guangxin Wu, Hao Zhang, Jiafeng Guo, Wanyi Ning, Xueqi Cheng, Zhibin Zhang.

Figure 1
Figure 1. Figure 1: Performance comparison of different methods on the SST-2 benchmark. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EGAD. Given an input sequence, the teacher model produces [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison among different distillation methods. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token-level entropy predicted by the teacher model for two randomly [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have achieved remarkable performance across diverse domains, yet their enormous computational and memory requirements hinder deployment in resource-constrained environments. Knowledge distillation offers a promising solution by transferring knowledge from a large teacher model to a smaller student model. However, existing distillation methods typically treat all tokens equally, ignoring the fact that different tokens contribute unequally to model decisions. This can lead to inefficient knowledge transfer and reduced learning effectiveness. To address this limitation, we propose an entropy-based adaptive distillation strategy that dynamically adjusts the training process at the token level. Our method leverages the teacher's output entropy to guide three aspects of distillation. Specifically, we introduce a token-level curriculum by dynamically shifting focus from low- to high-entropy tokens during training. We further adjust the distillation temperature based on token entropy to better capture teacher confidence patterns. Moreover, we employ a dual-branch architecture for efficient logits-only distillation on easy tokens and deeper feature-based distillation on difficult tokens. Extensive experiments validate the soundness and effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes EGAD, an entropy-guided adaptive distillation strategy for token-level knowledge transfer from large teacher LLMs to smaller student models. It dynamically adjusts the distillation process at the token level by using the teacher's output entropy to implement (1) a curriculum that shifts focus from low- to high-entropy tokens, (2) entropy-dependent temperature scaling, and (3) a dual-branch architecture applying logits-only distillation to easy tokens and deeper feature-based distillation to difficult tokens. The authors assert that this addresses the limitation of treating all tokens equally in prior methods and that extensive experiments validate its soundness and effectiveness.

Significance. If the empirical results hold and demonstrate consistent gains over standard distillation baselines, the approach could meaningfully advance efficient LLM deployment by making knowledge transfer adaptive to token uncertainty, potentially improving student performance with reduced computational overhead. The design is a coherent heuristic extension of existing curriculum and temperature techniques, directly targeting a known inefficiency in uniform token treatment.

major comments (1)
  1. Abstract: The central claim that 'extensive experiments validate the soundness and effectiveness of our method' is load-bearing, yet the manuscript provides no quantitative results, specific baselines, ablation studies, or statistical significance tests. Without these, the effectiveness of the entropy-guided adjustments cannot be assessed.
minor comments (1)
  1. The description of how entropy is computed and thresholded for the curriculum and dual-branch decisions would benefit from explicit equations and pseudocode to ensure reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your review of our manuscript. We appreciate the detailed feedback and address the concern regarding the abstract below.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'extensive experiments validate the soundness and effectiveness of our method' is load-bearing, yet the manuscript provides no quantitative results, specific baselines, ablation studies, or statistical significance tests. Without these, the effectiveness of the entropy-guided adjustments cannot be assessed.

    Authors: We agree that the abstract makes a strong claim about experimental validation that is not supported by any quantitative results, baselines, ablations, or statistical tests in the manuscript text provided. This is a substantive shortcoming, as the effectiveness of the proposed entropy-guided curriculum, temperature scaling, and dual-branch design cannot be evaluated without such evidence. We will revise the abstract to remove the phrase 'extensive experiments validate the soundness and effectiveness of our method' and replace it with a neutral description of the proposed approach. In the revised submission, we will either incorporate a concise summary of key results (if the full experimental section exists) or ensure the main body includes the required quantitative comparisons, ablations, and significance testing before resubmission. revision: yes

Circularity Check

0 steps flagged

No significant circularity; heuristic design with no self-referential reductions

full rationale

The paper proposes EGAD as a heuristic entropy-guided adaptive distillation method that introduces three interlocking adjustments (token-level curriculum from low- to high-entropy tokens, entropy-based temperature scaling, and dual-branch logits-vs-feature distillation) to address unequal token importance. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described construction that would reduce any claimed result to its own inputs by definition. The method is presented as an empirical design choice validated by experiments, not a tautological or self-citation-forced outcome, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the approach rests on standard assumptions of knowledge distillation plus the unstated premise that entropy is a suitable proxy for token difficulty.

pith-pipeline@v0.9.0 · 5486 in / 1109 out tokens · 70599 ms · 2026-05-10T16:08:19.769167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    Anonymous: Difficulty aware knowledge distillation (da-kd) (2024), unpub- lished citation

  2. [2]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024)

    Asai, A., Nguyen, H., Srinivasan, L., Clark, C.: Buffet: Benchmarking large language model fine-tuning across data domains. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (2024)

  3. [3]

    Slicegpt: Compress large language models by deleting rows and columns,

    Ashkboos, S., Croci, M.L., Nascimento, M.G.d., Hoefler, T., Hensman, J.: Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024 (2024)

  4. [4]

    Ba, J., Caruana, R.: Do deep nets really need to be deep? Advances in neural information processing systems27(2014)

  5. [5]

    arXiv preprint arXiv:2302.06557 (2023)

    Cai, Y., Wang, Z., Li, Y., Wang, S., Liu, Z., Sun, M.: Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2302.06557 (2023)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, X., Rao, Z., Chen, Y., Zhang, Q.: Explaining knowledge distillation by quantifying the knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12925–12935 (2020)

  7. [7]

    GPT-4 Technical Report

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90% quality. arXiv preprint arXiv:2303.08774 (2023)

  8. [8]

    In: Annual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society

    Cong, Z., Wang, Z., Zhang, H., Zheng, G., Cao, K., Zhao, L., Song, R., Li, J., Liu, C.: Hierarchical multi-scale feature fusion network for multi-center major depressive disorder classification with t1-weighted mri. In: Annual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society. IEEE Engineering in Medicine and Biology Socie...

  9. [9]

    https://github.com/databricks- datasets/dolly-15k (2023)

    Databricks: Databricks dolly 15k. https://github.com/databricks- datasets/dolly-15k (2023)

  10. [10]

    Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

    Fu, R., Wang, Z., Meng, C., Lu, J., Wu, J., Qian, K., Zhang, H., Fong, S.: Missing-by-design: Certifiable modality deletion for revocable multimodal sentiment analysis. arXiv preprint arXiv:2602.16144 (2026)

  11. [11]

    arXiv preprint arXiv:2306.03964 (2023)

    Gu, X., Sun, Q., Ma, H., Wang, B.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.03964 (2023)

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  13. [13]

    arXiv preprint arXiv:2501.15167 (2025)

    He, Y., Wang, J., Wang, Y., Zhong, Y., Song, X., Lin, J., Yuan, X., Tang, J., Xin, Y., Zhang, H., et al.: Enhancing intent understanding for am- biguous prompt: A human-machine co-adaption strategy. arXiv preprint arXiv:2501.15167 (2025)

  14. [14]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural net- work. arXiv preprint arXiv:1503.02531 (2015) EGAD 17

  15. [15]

    In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Honovich, O., Scialom, T., Levy, O., Ben-Ari, R.: Unnatural instructions: Tuning language models with multi-task instructions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7229–7249 (2023)

  16. [16]

    Hu, K., Zhang, W., Wang, T., Zhang, H., Wang, W., Long, H.: P2r-obb: A unified framework for multi-scale and orientation-aware ship detection (2026)

  17. [17]

    In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Jiang, Y., Han, M., Li, M., Hou, X., Zhang, H., Zhu, W., Li, H., He, Y., Wu, G., Yang, D., et al.: Multi-agent diagnostic collaboration and segmentation- aware residual decoding for hallucination-resistant medical vqa. In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 11122–11126. IEEE (2026)

  18. [18]

    In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing

    Jung, S., Yoon, S., Kim, D., Lee, H.: Todi: Token-wise distillation via fine- grained divergence control. In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing. pp. 8089–8102 (2025)

  19. [19]

    arXiv preprint arXiv:2601.12815 (2026)

    Kang, Z., Gong, J., Chen, Q., Zhang, H., Liu, J., Fu, R., Feng, Z., Wang, Y., Fong, S., Zhou, K.: Multimodal multi-agent empowered legal judgment prediction. arXiv preprint arXiv:2601.12815 (2026)

  20. [20]

    Kwon, K., Na, H., Lee, H., Kim, N.S.: Adaptive knowledge distillation based onentropy.In:ICASSP2020-2020IEEEInternationalConferenceonAcous- tics, Speech and Signal Processing (ICASSP). pp. 7409–7413. IEEE (2020)

  21. [21]

    arXiv preprint (2025)

    Li, Y., et al.: Bild: Bidirectional logit distillation for large language models. arXiv preprint (2025)

  22. [22]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  23. [23]

    arXiv preprint arXiv:2503.17970 (2025)

    Luo, Y., Wang, S., Liu, J., Xiao, J., Xue, R., Zhang, Z., Zhang, H., Lu, Y., Zhao, Y., Xie, Y.: Pathohr: Breast cancer survival prediction on high- resolution pathological images. arXiv preprint arXiv:2503.17970 (2025)

  24. [24]

    arXiv preprint arXiv:2601.20679 (2026)

    Mo, M., Tan, Y., Zhang, H., Zhang, H., He, Y.: Shieldedcode: Learning robust representations for virtual machine protected code. arXiv preprint arXiv:2601.20679 (2026)

  25. [25]

    In: Advances in Neural Information Processing Systems

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, S.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems. vol. 35, pp. 27730–27744 (2022)

  26. [26]

    In: Advances in Neural Information Processing Systems

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An open source machine learning framework. In: Advances in Neural Information Processing Systems. vol. 32 (2019)

  27. [27]

    In: Annual Conference on Medical Image Understanding and Analysis

    Qi, X., Zhang, Z., Gang, C., Zhang, H., Zhang, L., Zhang, Z., Zhao, Y.: Mediaug: Exploring visual augmentation in medical imaging. In: Annual Conference on Medical Image Understanding and Analysis. pp. 218–232. Springer (2025)

  28. [28]

    arXiv preprint arXiv:2502.00631 (2025) 18 Authors Suppressed Due to Excessive Length

    Qi, X., Zhang, Z., Zheng, H., Chen, M., Kutaiba, N., Lim, R., Chi- ang, C., Tham, Z.E., Ren, X., Zhang, W., et al.: Medconv: Convolutions beat transformers on long-tailed bone density prediction. arXiv preprint arXiv:2502.00631 (2025) 18 Authors Suppressed Due to Excessive Length

  29. [29]

    In: Proceedings of the 5th Workshop on Representation Learning for NLP

    Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled ver- sion of bert: smaller, faster, cheaper and lighter. In: Proceedings of the 5th Workshop on Representation Learning for NLP. pp. 1–7 (2019)

  30. [30]

    arXiv preprint (2025)

    Su, W., et al.: Ea-kd: Entropy based adaptive knowledge distillation for large language models. arXiv preprint (2025)

  31. [31]

    bioRxiv pp

    Wang, B., Zhang, H., Cui, T., Wang, X., Song, J., Xu, H.: Evormd: Inte- grating biological context and evolutionary rna language models for inter- pretable prediction of rna modifications. bioRxiv pp. 2026–03 (2026)

  32. [32]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al.: Beyond the 80/20 rule: High-entropy minority to- kens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939 (2025)

  33. [33]

    DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

    Wang, S., Li, Y., Hu, B., Li, Z., Zhan, H., Li, L., Liu, W., Qian, R., Wu, G., Zhang, H., et al.: Deco-detr: Decoupled cognition detr for efficient open- vocabulary object detection. arXiv preprint arXiv:2604.02753 (2026)

  34. [34]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Wang, Y., Kordi, Y., Liu, S., Liu, Y., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2023)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wanyan, Y., Yang, X., Chen, C., Xu, C.: Active exploration of multimodal complementarity for few-shot action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6492–6502 (2023)

  36. [36]

    arXiv preprint arXiv:2506.07237 (2025)

    Wei, J.C., Lin, Y.C., Ritter-Gutierrez, F., Lee, H.y.: Multi-distillation from speech and music representation models. arXiv preprint arXiv:2506.07237 (2025)

  37. [37]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2020)

  38. [38]

    arXiv preprint arXiv:2601.02674 (2026)

    Wu, G., Zhang, H., Zhibin, Z., Guo, J., Cheng, X.: Iterative structured pruning for large language models with multi-domain calibration. arXiv preprint arXiv:2601.02674 (2026)

  39. [39]

    ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

    Wu, J., Fu, R., Li, C., Zhang, Z., Wu, G., Zhang, H., Lin, S., Ni, J., Li, Y., Zhang, D., et al.: Protoflow: Mitigating forgetting in class-incremental re- mote sensing segmentation via low-curvature prototype flow. arXiv preprint arXiv:2604.03212 (2026)

  40. [40]

    arXiv preprint arXiv:2504.05652 , year=

    Wu, Y.H., Xiong, Y.J., Zhang, H., Zhang, J.C., Zhou, Z.: Sugar- coated poison: Benign generation unlocks llm jailbreaking. arXiv preprint arXiv:2504.05652 (2025)

  41. [41]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al.: Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115 (2024)

  42. [42]

    arXiv preprint arXiv:2602.01745 (2026)

    Yu, W., Wei, S., Liu, J., Li, Y., Hu, M., Liu, A., Zhang, H., King, I.: Probability-entropy calibration: An elastic indicator for adaptive fine- tuning. arXiv preprint arXiv:2602.01745 (2026)

  43. [43]

    Mi- prun: Optimize large language model pruning via mutual information.arXiv preprint arXiv:2601.07212, 2026

    Zhang, H., Zhang, Z., Wu, G., Chen, H., Guo, J., Cheng, X.: Mi-prun: Opti- mize large language model pruning via mutual information. arXiv preprint arXiv:2601.07212 (2026) EGAD 19

  44. [44]

    arXiv preprint arXiv:2509.12715 (2025)

    Zhang, H., Hu, H., Shen, Y., Yu, W., Yuan, Y., You, H., Cheng, G., Zhang, Z., Gan, L., Wei, H., et al.: Asymoe: Leveraging modal asymmetry for en- hanced expert specialization in large vision-language models. arXiv preprint arXiv:2509.12715 (2025)

  45. [45]

    In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Zhang, H., Yu, W., Gong, Y., Huang, W., Zhang, H., Huang, J.: Guid- ing efficient llm instruction-tuning via gradient flow matching. In: ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4981–4985. IEEE (2026)

  46. [46]

    Zhang, H., Zhang, T., Shi, Y., Gu, X., Shen, Y., Zhang, Z., Yuan, Y., Zhang, H., Huang, J.: Can representation gaps be the key to enhancing robustness in graph-text alignment? arXiv preprint arXiv:2510.12087 (2025)

  47. [47]

    arXiv preprint arXiv:2511.00908 (2025)

    Zheng, H., Shi, Y., Gu, X., You, H., Zhang, Z., Gan, L., Zhang, H., Huang, W., Huang, J.: Graphgeo: Multi-agent debate framework for visual geo-localization with heterogeneous graph neural networks. arXiv preprint arXiv:2511.00908 (2025)

  48. [48]

    arXiv preprint arXiv:2511.00911 (2025)

    Zheng, H., You, H., Liu, Z., Zhang, Z., Gan, L., Zhang, H., Huang, W., Huang, J.: G2rammar: Bilingual grammar modeling for enhanced text- attributed graph learning. arXiv preprint arXiv:2511.00911 (2025)

  49. [49]

    Knowledge distillation based on transformed teacher matching

    Zheng, K., Yang, E.H.: Knowledge distillation based on transformed teacher matching. arXiv preprint arXiv:2402.11148 (2024)

  50. [50]

    Advances in Neural Information Processing Systems36, 55006–55021 (2023)

    Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al.: Lima: Less is more for alignment. Advances in Neural Information Processing Systems36, 55006–55021 (2023)

  51. [51]

    In: ICASSP 2026-2026 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Zhou, W., Wu, G., Zhang, H.: Hot-p: Hierarchical optimal transport pro- totyping for self-supervised learning. In: ICASSP 2026-2026 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5301–5305. IEEE (2026)

  52. [52]

    Pattern Recognition153, 110545 (2024)

    Zhu,S.,Shang,R.,Yuan,B.,Zhang,W.,Li,W.,Li,Y.,Jiao,L.:Dynamickd: An effective knowledge distillation via dynamic entropy correction-based distillation for gap optimizing. Pattern Recognition153, 110545 (2024)

  53. [53]

    In: ICASSP 2026-2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP)

    Zu, L., Jin, Y., Cao, S., Suo, S., Lyu, H., Fu, S., Sun, H., Zhang, H.: End- to-end story visualization framework with penalty-based evaluation using vision-language models. In: ICASSP 2026-2026 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). pp. 10492– 10496. IEEE (2026) 20 Authors Suppressed Due to Excessive Length A...