NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
Test-time prompt intervention
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 3polarities
background 3representative citing papers
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
citing papers explorer
-
Near-Future Policy Optimization
NPO uses a policy's own near-future checkpoint as auxiliary trajectories to maximize effective learning signal S = Q/V, improving performance from 57.88 to 63.15 on Qwen3-VL-8B-Instruct with GRPO while accelerating convergence.
-
Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.