{"total":11,"items":[{"citing_arxiv_id":"2605.22675","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Policy Distillation via Capability-Selective Subspace Projection","primary_cat":"cs.CL","submitted_at":"2026-05-21T16:18:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19433","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation","primary_cat":"cs.CL","submitted_at":"2026-05-19T06:42:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MOTAB is a new distillation pipeline that monitors on-policy student trajectories and backtracks with teacher intervention to mitigate dual exposure biases, improving reasoning performance by about 3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13165","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes","primary_cat":"cs.CL","submitted_at":"2026-05-13T08:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09725","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On-Policy Distillation with Best-of-N Teacher Rollout Selection","primary_cat":"cs.CV","submitted_at":"2026-05-10T19:49:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05732","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning","primary_cat":"cs.LG","submitted_at":"2026-05-07T06:24:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRAFT is a continual learning method for LLMs that learns low-rank interventions on hidden representations, using a unified KL-divergence objective to handle task routing by output divergence, forgetting control via prior-state regularization, and intervention merging.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04468","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stabilizing LLM Supervised Fine-Tuning via Explicit Distributional Control","primary_cat":"cs.LG","submitted_at":"2026-05-06T03:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchored Learning stabilizes LLM supervised fine-tuning by interpolating a moving anchor between the current model and a frozen reference to create bounded local updates in distribution space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14518","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mind DeepResearch Technical Report","primary_cat":"cs.AI","submitted_at":"2026-04-16T01:20:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Figure 4: Overview of the knowledge-graph-grounded query synthesis pipeline, consisting of four stages: graph construction and subgraph sampling, initial QA generation, text obfuscation and complexity enhancement, and reasoning validity filtering. stably without catastrophic forgetting, MindDR adopts on-policy self-improved framework with DPO [24] and Self-SFT [1] to align the final report quality with human expectations.(Section 5.4). 4 Data Synthesis 4.1 Query Synthesis We propose an end-to-end framework for synthesizing multi-hop reasoning questions from struc- tured knowledge graphs. The overall pipeline, illustrated in Fig. 4, comprises four stages: graph construction and subgraph sampling, initial QA generation, condition obfuscation and complexity"},{"citing_arxiv_id":"2604.05117","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Watch Before You Answer: Learning from Visually Grounded Post-Training","primary_cat":"cs.CV","submitted_at":"2026-04-06T19:22:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"with a simple data curation method (described in§4.2). WhileVidGroundcan be applied to any base VLM, we adopt Qwen2.5-VL-7B-Instruct [7] for its video understanding capabilities and computational efficiency. 4.1 RL for video understanding post-training We use reinforcement learning (RL) for post-training based on recent evidence that RL improves underlying visual recognition capabilities [13] while exhibiting less catastrophic forgetting than supervised fine-tuning (SFT) [15]. Optimization objective.We adopt Group Relative Policy Optimization (GRPO) [44] augmented with techniques from DAPO [54] and temporal-aware rewards from Video-R1 [20]. Specifically, we employ token-level policy gradient loss with asym- metric clipping (increasing the value ofε h) to make the training more efficient"},{"citing_arxiv_id":"2604.14164","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data","primary_cat":"cs.CL","submitted_at":"2026-03-23T22:00:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10503","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning","primary_cat":"cs.RO","submitted_at":"2026-02-11T04:05:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14249","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment","primary_cat":"cs.CL","submitted_at":"2026-01-20T18:58:10+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}