MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation
Pith reviewed 2026-06-29 18:44 UTC · model grok-4.3
The pith
MAIGO trains language models via on-policy self-distillation from history-cleaned references to shrink the lost-in-conversation gap.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAIGO is an on-policy self-distillation procedure that removes prior assistant replies from middle-turn contexts while preserving the sharded user prefix, distills answer turns from paired full-view references, and applies a reliability weight; under the paired-view protocol with deterministic verifiers this procedure raises Qwen2.5-7B-Instruct sharded accuracy from 52.8 to 66.1, lifts the sharded-to-full ratio from 66.5 percent to 84.1 percent, and keeps full accuracy within 2.3 points, demonstrating that self-contamination is a trainable component of the lost-in-conversation gap.
What carries the argument
History-cleaned on-policy self-distillation that removes earlier assistant replies for middle turns and uses paired full-view references for answer turns, with a reliability weight to filter disagreeing samples.
If this is right
- Self-contamination becomes a controllable training target rather than an unavoidable inference artifact.
- Models can maintain near-full accuracy on complete prompts while improving substantially on sharded multi-turn versions of the same tasks.
- No external verifiers, state labels, or inference scaffolding are required to obtain the reported gains.
- The method applies directly to existing chat models without changing their architecture or decoding procedure.
Where Pith is reading between the lines
- The same cleaning step could be applied to other multi-turn settings where early model outputs risk polluting later context, such as tool-use chains or long agent trajectories.
- If the reliability weight proves stable across tasks, the approach might reduce reliance on longer context windows for tasks that can be reframed as sharded dialogues.
- Testing whether the gains hold when the verifier is replaced by human judgment or by a different model family would clarify how much the result depends on the deterministic verifier setup.
Load-bearing premise
The paired-view protocol with deterministic verifiers isolates self-contamination effects from other factors such as prompt formatting or verifier differences.
What would settle it
Replacing the history-cleaned references with ordinary full-history references during distillation and observing that the sharded accuracy gains disappear would falsify the claim that history cleaning is the operative mechanism.
Figures
read the original abstract
Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-contamination from intermediate assistant replies contributes to the lost-in-conversation (LiC) gap in LLMs. It proposes MAIGO, an on-policy self-distillation method that generates history-cleaned references from the model's own policy (removing prior assistant replies for middle turns while preserving sharded prefixes, and using paired full-view references for answer turns) along with a reliability weight to downweight disagreeing samples. No external verifiers, state labels, or inference scaffolding are required. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1% while keeping FULL accuracy within 2.3 points, concluding that self-contamination is a trainable component of the LiC gap.
Significance. If the paired-view protocol is shown to isolate self-contamination, the work offers a practical, reward-free approach to improving multi-turn consistency that could be broadly applicable. The explicit design choice to avoid verifier rewards and scaffolding is a clear strength for real-world deployment.
major comments (2)
- [Abstract] Abstract: The central claim that the reported gains (SHARDED accuracy 52.8→66.1, ratio 66.5%→84.1%) demonstrate self-contamination as a trainable component of the LiC gap depends on the paired-view protocol specifically isolating self-contamination rather than confounds such as prompt formatting, context length, or verifier sensitivity. The abstract describes the protocol and history-cleaning procedure but supplies no ablations, cross-protocol correlations, matched-context controls, or validation against alternative gap measures, leaving the attribution load-bearing yet unverified.
- [Abstract] Abstract (results paragraph): The numerical improvements are stated without reference to the number of evaluation runs, statistical significance testing, variance across seeds, or explicit baseline comparisons (beyond the base model), which is required to support the reliability of the gains and the conclusion that the method trains away self-contamination.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond point-by-point to the major comments and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the reported gains (SHARDED accuracy 52.8→66.1, ratio 66.5%→84.1%) demonstrate self-contamination as a trainable component of the LiC gap depends on the paired-view protocol specifically isolating self-contamination rather than confounds such as prompt formatting, context length, or verifier sensitivity. The abstract describes the protocol and history-cleaning procedure but supplies no ablations, cross-protocol correlations, matched-context controls, or validation against alternative gap measures, leaving the attribution load-bearing yet unverified.
Authors: The paired-view protocol holds the task, verifier, and user-side dialogue fixed while varying only the presence of prior assistant replies in the sharded condition; the full-view condition supplies the identical completed dialogue without any assistant history. This directly controls for prompt formatting and verifier sensitivity. Context length is matched by construction in the full-view reference. The history-cleaning step removes only assistant replies while preserving the user-visible sharded prefix, isolating the self-contamination mechanism. We agree that explicit ablations would strengthen the claim and will add a paragraph in Section 3 explaining the isolation logic together with a brief matched-context control experiment. revision: partial
-
Referee: [Abstract] Abstract (results paragraph): The numerical improvements are stated without reference to the number of evaluation runs, statistical significance testing, variance across seeds, or explicit baseline comparisons (beyond the base model), which is required to support the reliability of the gains and the conclusion that the method trains away self-contamination.
Authors: The reported figures use deterministic decoding and a single evaluation pass per condition, which is the standard protocol for the LiC benchmark to guarantee reproducibility. We will revise the abstract and results section to state the number of runs explicitly, note the deterministic setting, and include additional baseline comparisons. Variance across seeds will be reported in an appendix if space allows. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical method (MAIGO) for on-policy self-distillation using history-cleaned references from the model's own policy, with results reported as measured accuracy gains (e.g., SHARDED accuracy 52.8 to 66.1) under the paired-view protocol on Qwen2.5-7B-Instruct. No equations, derivations, or fitted parameters are described that reduce by construction to the inputs. The use of the model's policy is a standard on-policy design element, not a self-definitional loop, and the evaluation relies on external benchmarks and deterministic verifiers rather than internal tautologies or self-citation chains. The derivation is self-contained against the reported empirical outcomes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
When attention closes: How LLMs lose the thread in multi-turn interaction.arXiv preprint arXiv:2605.12922. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Qisong He,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–
A reduction of imitation learning and struc- tured prediction to no-regret online learning. InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–
-
[3]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
JMLR Workshop and Conference Proceedings. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Sh...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. 2026a. SCOPE: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv preprint arXiv:260...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.