Recognition: unknown
OPSDL: On-Policy Self-Distillation for Long-Context Language Models
Pith reviewed 2026-05-10 06:03 UTC · model grok-4.3
The pith
A model's short-context strength can supervise and improve its own long-context generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OPSDL establishes that an LLM can improve its long-context behavior by first generating responses conditioned on the complete long input and then receiving per-token supervision signals from its own short-context capability via point-wise reverse KL divergence on extracted relevant short contexts. This mechanism encourages faithful reliance on pertinent evidence and counters hallucinations induced by irrelevant content. The method yields consistent gains across context lengths and model sizes from 7B to 32B parameters while using training samples more efficiently than standard post-training baselines.
What carries the argument
On-policy self-distillation in which long-context generations receive per-token reverse KL supervision from the model's short-context conditioned distribution on relevant excerpts.
If this is right
- Consistent and substantial gains appear across varying context lengths on standard long-context benchmarks.
- Training requires fewer samples to reach higher performance than supervised fine-tuning or direct preference optimization.
- Short-context capabilities remain intact after the long-context training procedure.
- The approach scales stably to models ranging from 7 billion to 32 billion parameters.
Where Pith is reading between the lines
- The technique may reduce dependence on externally curated long-context datasets by recycling the model's existing short-context competence.
- Success likely hinges on accurate extraction of relevant short contexts; better relevance filters could therefore amplify the observed gains.
- Similar self-teaching loops could be explored for other uneven capabilities, such as multi-step reasoning under long inputs.
Load-bearing premise
The model's short-context capability must remain strong and accurate enough to supply reliable token-level signals without introducing its own biases or errors.
What would settle it
A direct test comparing hallucination rates on long-context tasks containing known distractors before and after OPSDL training; if error rates do not drop or if short-context supervision itself contains inaccuracies that persist, the central claim would be undermined.
Figures
read the original abstract
Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OPSDL, an on-policy self-distillation method for long-context LLMs. It generates responses under full long context and uses the model's short-context capability as a self-teacher to provide per-token supervision via point-wise reverse KL divergence on an extracted relevant short-context, with the goal of encouraging faithful evidence use and reducing irrelevant-context hallucinations. Evaluations across 7B-32B models claim consistent gains over SFT and DPO on long-context benchmarks, higher sample efficiency, and no degradation on short-context tasks.
Significance. If the empirical claims hold with proper controls, the approach offers a data-efficient, self-supervised route to long-context scaling that avoids external high-quality data or sparse rewards, which could be practically useful for post-training.
major comments (3)
- [Abstract and §3] Abstract and §3 (method): the central claim that the short-context self-teacher mitigates hallucinations rests on the extraction of a 'relevant short-context' being both complete and noise-free, yet no algorithm, pseudocode, or quality metric for this extraction step is provided; without it the dense token-level reverse-KL signal could equally propagate short-context errors.
- [§4] §4 (experiments): the abstract asserts 'consistent and substantial improvements' and 'higher sample efficiency' over SFT/DPO, but reports neither concrete benchmark scores, standard deviations, statistical tests, nor ablations on extraction quality or teacher error rate on the same long-context tasks; these omissions make it impossible to assess whether the gains are load-bearing or artifactual.
- [§3.2] §3.2 (training objective): the point-wise reverse KL is applied under the extracted short-context, but the manuscript supplies no analysis of distribution shift between short- and long-context regimes or measurement of how often the short-context teacher itself hallucinates on the target long-context queries; if teacher error exceeds a threshold the distillation can reinforce rather than correct mistakes.
minor comments (2)
- [Abstract and §3] Notation for the reverse-KL term and the extraction function should be defined once in §3 and used consistently; the abstract introduces 'point-wise reverse KL' without an equation reference.
- [§4] The claim of 'no degradation' on short-context performance would be stronger with a dedicated table or figure showing before/after scores on standard short-context suites.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and rigor, particularly around methodological details and empirical reporting. We address each major comment below and will incorporate revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the central claim that the short-context self-teacher mitigates hallucinations rests on the extraction of a 'relevant short-context' being both complete and noise-free, yet no algorithm, pseudocode, or quality metric for this extraction step is provided; without it the dense token-level reverse-KL signal could equally propagate short-context errors.
Authors: We agree that the extraction procedure is central to the method and was described at a high level in the original submission. The extraction identifies query-relevant segments from the long context via embedding similarity and attention-based filtering to form the short-context input for the teacher. In the revision we will add a dedicated subsection with full algorithm description, pseudocode, and quantitative quality metrics (e.g., precision/recall against human-annotated relevant spans on a validation set) to demonstrate that the extracted context is reliable and minimizes noise propagation. revision: yes
-
Referee: [§4] §4 (experiments): the abstract asserts 'consistent and substantial improvements' and 'higher sample efficiency' over SFT/DPO, but reports neither concrete benchmark scores, standard deviations, statistical tests, nor ablations on extraction quality or teacher error rate on the same long-context tasks; these omissions make it impossible to assess whether the gains are load-bearing or artifactual.
Authors: We acknowledge that the experimental presentation would be strengthened by additional quantitative detail. While §4 already contains benchmark tables, we will expand them in the revision to include per-task scores with standard deviations over multiple seeds, statistical significance tests (e.g., paired t-tests), and new ablations that vary extraction quality and measure teacher error rates directly on the long-context evaluation sets. These additions will make the claimed improvements and sample-efficiency gains fully verifiable. revision: yes
-
Referee: [§3.2] §3.2 (training objective): the point-wise reverse KL is applied under the extracted short-context, but the manuscript supplies no analysis of distribution shift between short- and long-context regimes or measurement of how often the short-context teacher itself hallucinates on the target long-context queries; if teacher error exceeds a threshold the distillation can reinforce rather than correct mistakes.
Authors: This is a valid concern about potential error reinforcement. The design assumes the short-context regime exhibits lower hallucination rates on relevant evidence, but we did not quantify this in the original version. In the revision we will add an analysis subsection that (1) measures short-context teacher hallucination rates on long-context queries (using available ground-truth answers) and (2) reports token-level agreement statistics between short- and long-context generations to characterize distribution shift. We will also discuss failure cases where teacher error could propagate. revision: yes
Circularity Check
No circularity: method is a standard self-distillation setup evaluated externally
full rationale
The OPSDL derivation defines a training procedure that conditions a self-teacher on extracted short context to supply per-token reverse-KL signals to the long-context policy. This construction does not equate the claimed performance gains to any fitted parameter, self-referential definition, or prior result by the same authors. No equations appear that rename an input as a prediction or smuggle an ansatz via self-citation. Evaluation occurs on independent long-context benchmarks, leaving the central claim independent of its own training loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The model's short-context capability is inherently strong and can serve as a reliable teacher for long-context scenarios without introducing its own errors.
Forward citations
Cited by 2 Pith papers
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Reference graph
Works this paper leans on
-
[1]
Guanzheng Chen, Xin Li, Michael Qizhe Shieh, and Lidong Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization. arXiv preprint arXiv:2502.13922,
-
[2]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review arXiv
-
[3]
Reinforcement Learning via Self-Distillation
Jonas H ¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review arXiv
-
[4]
Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
Norman Paulsen. Context is what you need: The maximum effective context window for real world limits of llms.arXiv preprint arXiv:2509.21361,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,
work page internal anchor Pith review arXiv
-
[6]
Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, et al. Qwenlong-l1. 5: Post-training recipe for long-context reasoning and memory management.arXiv preprint arXiv:2512.12967,
-
[7]
Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, et al. Solopo: Unlocking long-context capabilities in llms via short-to-long preference optimization.arXiv preprint arXiv:2505.11166,
-
[8]
MiniCPM Team, Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, et al. Minicpm-sala: Hybridizing sparse and linear attention for efficient long-context modeling.arXiv preprint arXiv:2602.11761,
-
[9]
Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning
Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, and Ming Yan. Qwenlong-l1: Towards long-context large reasoning models with reinforcement learning.arXiv preprint arXiv:2505.17667,
-
[10]
URLhttps://arxiv.org/abs/2407.10671. 8 Preprint. An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report. arXiv preprint arXiv:2501.15383,
work page internal anchor Pith review arXiv
-
[11]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,
work page internal anchor Pith review arXiv
-
[12]
doi:10.18653/v1/2025.acl-long.187 , isbn =
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.187. URLhttps://aclanthology.org/2025.acl-long.187/. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.