Dynamic temperature knowledge distillation.arXiv preprint arXiv:2404.12711

Wei, Yukang, Bai, Yu , urldate = · arXiv 2404.12711

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

Consistently Informative Soft-Label Temperature for Knowledge Distillation

cs.LG · 2026-05-19 · unverdicted · novelty 5.0

CIST uses per-sample adaptive temperatures for both teacher and student in knowledge distillation to ensure consistent entropy in soft labels and reports gains on vision and language tasks.

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

cs.CL · 2026-05-30 · unverdicted · novelty 4.0

TS-OPSD internalizes temperature via on-policy self-distillation to reheat entropy-collapsed RL policies in LLMs, providing stronger initialization for further training than continued RL or rollout temperature adjustment.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 119
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Consistently Informative Soft-Label Temperature for Knowledge Distillation cs.LG · 2026-05-19 · unverdicted · none · ref 11
CIST uses per-sample adaptive temperatures for both teacher and student in knowledge distillation to ensure consistent entropy in soft labels and reports gains on vision and language tasks.
Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning cs.CL · 2026-05-30 · unverdicted · none · ref 15
TS-OPSD internalizes temperature via on-policy self-distillation to reheat entropy-collapsed RL policies in LLMs, providing stronger initialization for further training than continued RL or rollout temperature adjustment.

Dynamic temperature knowledge distillation.arXiv preprint arXiv:2404.12711

fields

years

verdicts

representative citing papers

citing papers explorer