Recycling failures: Salvaging exploration in rlvr via fine-grained off-policy guidance.arXiv preprint arXiv:2602.24110

11 Yanwei Ren, Haotian Zhang, Likang Xiao, Xikai Zhang, Jiaxing Huang, Jiayan Qiu, Baosheng Yu, Quan Chen, Liu Liu · arXiv 2602.24110

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.

citing papers explorer

Showing 2 of 2 citing papers.

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration cs.LG · 2026-04-13 · unverdicted · none · ref 24
NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization cs.AI · 2026-06-05 · unverdicted · none · ref 14
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.

Recycling failures: Salvaging exploration in rlvr via fine-grained off-policy guidance.arXiv preprint arXiv:2602.24110

fields

years

verdicts

representative citing papers

citing papers explorer