NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.
Recycling failures: Salvaging exploration in rlvr via fine-grained off-policy guidance.arXiv preprint arXiv:2602.24110
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.
citing papers explorer
-
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
PTD-PO supplies step-wise token-distribution supervision to student policies via in-context privileged hints derived from spatial attention and intermediate reasoning, while keeping the student in an answer-free context and using Top-K Jensen-Shannon divergence for stable alignment.