MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

Haoyu Zheng; Jun Xiao; Qing Wang; Shangming Chen; Shu Yuan; Wenqiao Zhang; Yueting Zhuang; Yun Zhu

arxiv: 2605.27186 · v1 · pith:5XGF5DEJnew · submitted 2026-05-26 · 💻 cs.CL

MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation

Haoyu Zheng , Yun Zhu , Shu Yuan , Shangming Chen , Qing Wang , Wenqiao Zhang , Jun Xiao , Yueting Zhuang This is my paper

Pith reviewed 2026-06-29 18:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords lost-in-conversationself-contaminationself-distillationmulti-turn dialogueon-policy learninglarge language modelsconversation degradationsharded evaluation

0 comments

The pith

MAIGO trains language models via on-policy self-distillation from history-cleaned references to shrink the lost-in-conversation gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces part of the lost-in-conversation gap to self-contamination, where an assistant's own earlier replies enter the context and propagate early errors. MAIGO counters this by distilling middle-turn predictions from versions that drop prior assistant messages while keeping the user-visible prefix, and by distilling answer turns from paired full-view references. A reliability weight down-weights unreliable middle-turn samples. On Qwen2.5-7B-Instruct this raises sharded accuracy from 52.8 to 66.1 and the sharded-to-full ratio from 66.5 percent to 84.1 percent while full accuracy stays within 2.3 points. The result indicates that self-contamination is a component of the gap that can be reduced through training rather than inference-time fixes alone.

Core claim

MAIGO is an on-policy self-distillation procedure that removes prior assistant replies from middle-turn contexts while preserving the sharded user prefix, distills answer turns from paired full-view references, and applies a reliability weight; under the paired-view protocol with deterministic verifiers this procedure raises Qwen2.5-7B-Instruct sharded accuracy from 52.8 to 66.1, lifts the sharded-to-full ratio from 66.5 percent to 84.1 percent, and keeps full accuracy within 2.3 points, demonstrating that self-contamination is a trainable component of the lost-in-conversation gap.

What carries the argument

History-cleaned on-policy self-distillation that removes earlier assistant replies for middle turns and uses paired full-view references for answer turns, with a reliability weight to filter disagreeing samples.

If this is right

Self-contamination becomes a controllable training target rather than an unavoidable inference artifact.
Models can maintain near-full accuracy on complete prompts while improving substantially on sharded multi-turn versions of the same tasks.
No external verifiers, state labels, or inference scaffolding are required to obtain the reported gains.
The method applies directly to existing chat models without changing their architecture or decoding procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cleaning step could be applied to other multi-turn settings where early model outputs risk polluting later context, such as tool-use chains or long agent trajectories.
If the reliability weight proves stable across tasks, the approach might reduce reliance on longer context windows for tasks that can be reframed as sharded dialogues.
Testing whether the gains hold when the verifier is replaced by human judgment or by a different model family would clarify how much the result depends on the deterministic verifier setup.

Load-bearing premise

The paired-view protocol with deterministic verifiers isolates self-contamination effects from other factors such as prompt formatting or verifier differences.

What would settle it

Replacing the history-cleaned references with ordinary full-history references during distillation and observing that the sharded accuracy gains disappear would falsify the claim that history cleaning is the operative mechanism.

Figures

Figures reproduced from arXiv: 2605.27186 by Haoyu Zheng, Jun Xiao, Qing Wang, Shangming Chen, Shu Yuan, Wenqiao Zhang, Yueting Zhuang, Yun Zhu.

**Figure 1.** Figure 1: FULL-vs-SHARDED task delivery. The same requirements appear either in one complete prompt or across turns, where earlier assistant replies become part of the final context. contamination. The key idea is to train the model on references that preserve the user-visible conversational prefix while removing the assistant outputs that could otherwise bias the reference distribution. For middle turns, MAIGO co… view at source ↗

**Figure 2.** Figure 2: Overview of MAIGO’s sharded training branch. Student samples are generated on-policy, while middleturn and answer-turn losses use stop-gradient reference distributions computed from cleaner training-time contexts. is the stop-gradient reference distribution evaluated under a role-specific reference context C ref. Here πθ− is an exponential-moving-average (EMA) reference policy (Agarwal et al., 2024; Zhao… view at source ↗

**Figure 3.** Figure 3: Single-seed reveal-time diagnostic on Qwen2.5-7B-Instruct SHARDED traces; the 100% point equals final SHARDED accuracy. ments as short conversational constraints, which can make the required signature and late conditions explicit near generation time before the same unittest scorer is applied. The relevant pattern is therefore SHARDED improvement together with FULL preservation; by that measure, MAIGO gi… view at source ↗

**Figure 4.** Figure 4: Single-seed sensitivity to the FULLpreservation probability ρ on Qwen2.5-7B-Instruct; values are accuracy changes from the default ρ = 0.2. where removal improves SHARDED accuracy. 4.4 Hyperparameter Sensitivity The preservation branch is controlled by the probability ρ of sampling a FULL-view training step [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAIGO shows concrete gains on the LiC task through history-cleaned self-distillation, but the link to reduced self-contamination is not yet isolated from other factors.

read the letter

The main point for you is that this paper gives a workable on-policy distillation recipe that lifts sharded accuracy on Qwen2.5-7B from 52.8 to 66.1 while holding full accuracy within 2.3 points. The method cleans prior assistant turns out of the reference for middle steps and uses paired full-view references only for the final answer, plus a simple reliability weight. That combination is new enough to be worth noting.

What the work does cleanly is avoid external verifiers or state labels and stay inside the model's own policy. The history-cleaning step for middle turns and the reliability weighting are practical details that could transfer to other multi-turn setups. The reported ratio improvement to 84.1% is the kind of number that makes people look twice.

The soft spot is the attribution. The abstract ties the gains to reduced self-contamination, but the paired-view protocol is not shown to rule out confounds like prompt formatting, context length matching, or verifier quirks. No ablations or cross-checks are described, so the claim that self-contamination is the trainable component rests on an unvalidated measurement. Soundness is therefore limited until those controls appear.

This is for groups already working on conversational reliability or on-policy methods. A reader who needs a concrete baseline for LiC experiments could pull the method and test it themselves. The thinking is coherent on its own terms even if the evidence is still thin.

I would send it to peer review. The idea is direct, the numbers are specific, and the missing validation is fixable rather than fatal.

Referee Report

2 major / 0 minor

Summary. The paper claims that self-contamination from intermediate assistant replies contributes to the lost-in-conversation (LiC) gap in LLMs. It proposes MAIGO, an on-policy self-distillation method that generates history-cleaned references from the model's own policy (removing prior assistant replies for middle turns while preserving sharded prefixes, and using paired full-view references for answer turns) along with a reliability weight to downweight disagreeing samples. No external verifiers, state labels, or inference scaffolding are required. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1% while keeping FULL accuracy within 2.3 points, concluding that self-contamination is a trainable component of the LiC gap.

Significance. If the paired-view protocol is shown to isolate self-contamination, the work offers a practical, reward-free approach to improving multi-turn consistency that could be broadly applicable. The explicit design choice to avoid verifier rewards and scaffolding is a clear strength for real-world deployment.

major comments (2)

[Abstract] Abstract: The central claim that the reported gains (SHARDED accuracy 52.8→66.1, ratio 66.5%→84.1%) demonstrate self-contamination as a trainable component of the LiC gap depends on the paired-view protocol specifically isolating self-contamination rather than confounds such as prompt formatting, context length, or verifier sensitivity. The abstract describes the protocol and history-cleaning procedure but supplies no ablations, cross-protocol correlations, matched-context controls, or validation against alternative gap measures, leaving the attribution load-bearing yet unverified.
[Abstract] Abstract (results paragraph): The numerical improvements are stated without reference to the number of evaluation runs, statistical significance testing, variance across seeds, or explicit baseline comparisons (beyond the base model), which is required to support the reliability of the gains and the conclusion that the method trains away self-contamination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the reported gains (SHARDED accuracy 52.8→66.1, ratio 66.5%→84.1%) demonstrate self-contamination as a trainable component of the LiC gap depends on the paired-view protocol specifically isolating self-contamination rather than confounds such as prompt formatting, context length, or verifier sensitivity. The abstract describes the protocol and history-cleaning procedure but supplies no ablations, cross-protocol correlations, matched-context controls, or validation against alternative gap measures, leaving the attribution load-bearing yet unverified.

Authors: The paired-view protocol holds the task, verifier, and user-side dialogue fixed while varying only the presence of prior assistant replies in the sharded condition; the full-view condition supplies the identical completed dialogue without any assistant history. This directly controls for prompt formatting and verifier sensitivity. Context length is matched by construction in the full-view reference. The history-cleaning step removes only assistant replies while preserving the user-visible sharded prefix, isolating the self-contamination mechanism. We agree that explicit ablations would strengthen the claim and will add a paragraph in Section 3 explaining the isolation logic together with a brief matched-context control experiment. revision: partial
Referee: [Abstract] Abstract (results paragraph): The numerical improvements are stated without reference to the number of evaluation runs, statistical significance testing, variance across seeds, or explicit baseline comparisons (beyond the base model), which is required to support the reliability of the gains and the conclusion that the method trains away self-contamination.

Authors: The reported figures use deterministic decoding and a single evaluation pass per condition, which is the standard protocol for the LiC benchmark to guarantee reproducibility. We will revise the abstract and results section to state the number of runs explicitly, note the deterministic setting, and include additional baseline comparisons. Variance across seeds will be reported in an appendix if space allows. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical method (MAIGO) for on-policy self-distillation using history-cleaned references from the model's own policy, with results reported as measured accuracy gains (e.g., SHARDED accuracy 52.8 to 66.1) under the paired-view protocol on Qwen2.5-7B-Instruct. No equations, derivations, or fitted parameters are described that reduce by construction to the inputs. The use of the model's policy is a standard on-policy design element, not a self-definitional loop, and the evaluation relies on external benchmarks and deterministic verifiers rather than internal tautologies or self-citation chains. The derivation is self-contained against the reported empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new entities; full text would be required to populate the ledger.

pith-pipeline@v0.9.1-grok · 5775 in / 1100 out tokens · 39560 ms · 2026-06-29T18:44:44.330729+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 3 internal anchors

[1]

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

When attention closes: How LLMs lose the thread in multi-turn interaction.arXiv preprint arXiv:2605.12922. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Qisong He,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

A reduction of imitation learning and struc- tured prediction to no-regret online learning. InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–
[3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

JMLR Workshop and Conference Proceedings. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Sh...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. 2026a. SCOPE: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv preprint arXiv:260...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

When attention closes: How LLMs lose the thread in multi-turn interaction.arXiv preprint arXiv:2605.12922. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783. Qisong He,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

A reduction of imitation learning and struc- tured prediction to no-regret online learning. InPro- ceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–

[3] [3]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

JMLR Workshop and Conference Proceedings. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Sh...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. 2026a. SCOPE: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv preprint arXiv:260...

work page internal anchor Pith review Pith/arXiv arXiv 2026