{"total":13,"items":[{"citing_arxiv_id":"2605.15726","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR","primary_cat":"cs.AI","submitted_at":"2026-05-15T08:22:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NudgeRL conditions RLVR rollouts on strategy-level contexts to drive diverse trajectories and applies an inter/intra-context reward decomposition plus distillation objective, outperforming GRPO and oracle baselines on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12652","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Rollout On-Policy Distillation via Peer Successes and Failures","primary_cat":"cs.LG","submitted_at":"2026-05-12T18:57:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11461","ref_index":35,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Breaking $\\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-12T03:20:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. 11 [34] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266-95290, 2024. [35] Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025. [36] Ting Wu, Xuefeng Li, and Pengfei Liu. Progress or regress? self-improvement reversal in post-training.arXiv preprint arXiv:2407.05013, 2024. [37] Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie."},{"citing_arxiv_id":"2605.08817","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors","primary_cat":"cs.AI","submitted_at":"2026-05-09T09:10:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"IMAX trains soft prefixes with an InfoMax reward to drive diverse exploration in RLVR, yielding up to 11.60% gains in Pass@4 over standard RLVR across model scales.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RLVR paradigm suffers fromlimited exploration, which in turn influences reasoning capabilities. Existing approaches for exploration enhancement of RLVR typically intervene through the RL objective, reward design, or sampling procedure, aiming to encourage broader exploration on top of the base model prior [13, 49, 16, 29]. However, recent studies suggest that RL is strongly shaped by the base model's distribution [39, 48, 19]. This observation points to a complementary direction: rather than only modifying the RL process itself, one can also steer the model prior that RLVR builds on. In parallel, prompt-based adaptation has shown that learned prompts can effectively steer pretrained models by changing their generation context, thereby eliciting different behaviors"},{"citing_arxiv_id":"2605.07114","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-08T01:42:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HORA adaptively allocates rollouts using hit utility to improve Pass@K over compute-matched GRPO on math reasoning benchmarks while preserving Pass@1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15306","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generalization in LLM Problem Solving: The Case of the Shortest Path","primary_cat":"cs.AI","submitted_at":"2026-04-16T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14265","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning via Value Gradient Flow","primary_cat":"cs.LG","submitted_at":"2026-04-15T17:12:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04066","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"(Geometric Mean) Definition 2(GMPO Objective).GMPO calcu- lates the geometric mean of the importance sam- pling ratios over the sequence length, scaled by the advantage. This formulation fundamentally al- ters the weighting mechanism of individual token gradients compared to GRPO. JGMPO(θ) = 1 G GX i=1    |oi|Y t=1 ri,t(θ)    1 |oi| | {z } Ui(θ) · ˆAi (18) Proposition 2(Global vs. Local Gradient Weighting).The gradient derivations reveal a structural divergence in weighting mechanisms. While GRPO assigns a local weight (i.e., ri,t) to each gradient step, GMPO assigns a global weight (i.e.,U i) to every token in the sequence. Proof. We analyze the gradient contribution of a single sample i. Applying the log-derivative trick"},{"citing_arxiv_id":"2605.04065","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07941","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:00:37+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Intuitively, E𝜖 (𝜋 ; D𝑥 ) collects the state-action patterns that are behaviorally realized with non-negligible mass under policy𝜋and prompt distributionD 𝑥. This is an operational rather than strictly measure-theoretic notion. Recent work has begun to operationalize reduced empirical proxies for support at the answer/completion level under finite sampling [68]. Although such proxies do not fully instantiate state-action support, they suggest that support-based analyses can admit measurable approximations in constrained settings. In this survey, effective support is defined on the state-action space because deployment-relevant behavior depends jointly on reaching a state and taking the right action there."},{"citing_arxiv_id":"2604.00860","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Policy Improvement Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-01T13:10:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"most methods operate within isolated batches and overlook temporal consistency across policy updates. PIPO departs from this paradigm by introducingretrospective verification, which explicitly incorporates cross-iteration policy improvement signals. Exploration, Stability, and Policy Collapse.A well-known challenge in RLVR is policy collapse, where optimizing for Pass@1 degrades sample diversity and Pass@k performance [47, 56, 16, 8, 4, 3]. NSR [60] attributes this effect to excessive positive reinforcement and mitigates it by penalizing incorrect responses. Other lines of work encourage exploration through entropy regularization [17, 9, 12, 10], larger rollout budgets [22, 19], curriculum-based training schedules [5, 38, 43]. Alternative objectives, such as MaxRL [ 44], aim to preserve generative diversity by aligning"},{"citing_arxiv_id":"2509.25424","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Polychromic Objectives for Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-09-29T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14234","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision","primary_cat":"cs.LG","submitted_at":"2025-09-17T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}