hub

Retaining by doing: The role of on-policy data in mitigating forgetting

Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen · 2025 · arXiv 2510.18874

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2

citation-polarity summary

use method 2

representative citing papers

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

cs.CL · 2026-03-23 · conditional · novelty 7.0

TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.

Watch Before You Answer: Learning from Visually Grounded Post-Training

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

cs.RO · 2026-02-11 · unverdicted · novelty 6.0

LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

cs.CL · 2026-01-20 · conditional · novelty 6.0

Rank-Surprisal Ratio (RSR) correlates strongly (average Spearman 0.86) with post-distillation reasoning gains across five student models and trajectories from eleven teachers, outperforming existing selection metrics.

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.

On-Policy Distillation with Best-of-N Teacher Rollout Selection

cs.CV · 2026-05-10 · unverdicted · novelty 5.0 · 2 refs

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via prior-state regularization, and intervention merging.

Mind DeepResearch Technical Report

cs.AI · 2026-04-16 · unverdicted · novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Watch Before You Answer: Learning from Visually Grounded Post-Training cs.CV · 2026-04-06 · unverdicted · none · ref 13
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 1
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.

Retaining by doing: The role of on-policy data in mitigating forgetting

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer