Retaining by doing: The role of on-policy data in mitigating forgetting

Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen · 2025 · arXiv 2510.18874

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

cs.CL · 2026-03-23 · conditional · novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.

Watch Before You Answer: Learning from Visually Grounded Post-Training

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

cs.CL · 2026-01-20 · conditional · novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via prior-state regularization, and intervention merging.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

citing papers explorer

Showing 8 of 8 citing papers.

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data cs.CL · 2026-03-23 · conditional · none · ref 6
TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes cs.CL · 2026-05-13 · unverdicted · none · ref 52
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control cs.LG · 2026-05-06 · unverdicted · none · ref 3
Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.
Watch Before You Answer: Learning from Visually Grounded Post-Training cs.CV · 2026-04-06 · unverdicted · none · ref 13
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment cs.CL · 2026-01-20 · conditional · none · ref 5
Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.
On-Policy Distillation with Best-of-N Teacher Rollout Selection cs.CV · 2026-05-10 · unverdicted · none · ref 4 · 2 links
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning cs.LG · 2026-05-07 · unverdicted · none · ref 4 · 2 links
CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via prior-state regularization, and intervention merging.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 1
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Retaining by doing: The role of on-policy data in mitigating forgetting

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer