OISD improves mathematical reasoning in language models by using the final layer as an internal teacher to align logits and attention patterns in selected intermediate layers via signed advantage-weighted Jensen-Shannon divergence during GRPO optimization.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
OISD: On-Policy Internal Self-Distillation of Language Models
OISD improves mathematical reasoning in language models by using the final layer as an internal teacher to align logits and attention patterns in selected intermediate layers via signed advantage-weighted Jensen-Shannon divergence during GRPO optimization.