{"total":15,"items":[{"citing_arxiv_id":"2605.12058","ref_index":23,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Holder Policy Optimisation","primary_cat":"cs.LG","submitted_at":"2026-05-12T12:45:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11403","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:48:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FG-ExPO improves GRPO by adaptively scaling the KL penalty with batch accuracy and sampling questions via a Gaussian centered at 0.5 accuracy, delivering up to 13.34 point gains on AIME 2025 pass@32.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09923","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:19:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EXPO improves GRPO for LLM mathematical reasoning via accuracy-conditioned KL scaling and Gaussian curriculum sampling, delivering gains such as 13.34 points on AIME 2025 pass@32.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.28005","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-04-30T15:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is closely related to the rapidly growing literature on LLM reasoning. On the statistics side, it connects to classical work on nonparametric estimation and modern work on RL. LLM reasoning. LLM reasoning methods can be broadly grouped into three categories: (1) prompting-based approaches; (2) inference-time 1 approaches that enhance reasoning through search; and (3) training-time approaches via alignment or RL. Early work falls into the first two categories. In particular, prompting-based methods such as chain-of-thought prompting guide LLMs to produce step-by-step reasoning in a manner similar to humans (Wei et al., 2022). This is often achieved by simple, magical prompts such as \"Let us think step by step.\" Under such instructions, the model generates a chain of thought that decomposes a"},{"citing_arxiv_id":"2605.04066","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"sgn( ˆAi)is a constant scalar, the gradient is: ∇θJi(θ) =sgn( ˆAi)· ∇ θMp(θ)(50) Now we compute the gradient of the power-mean term Mp(θ). Let S(θ) = 1 |oi| P t(ϕi,t(θ))p, so Mp(θ) = (S(θ)) 1 p . We can derive that: ∇θMp(θ) = 1 p(S(θ)) 1 p −1 · ∇θS(θ) = 1 p(S(θ)) 1−p p · 1 |oi| X t h p(ϕi,t(θ))p−1 · ∇θϕi,t(θ) i = (Mp(θ))1−p · 1 |oi| X t h (ϕi,t(θ))p−1 · ∇θϕi,t(θ) i (51) H.3 Gradient ofϕ i,t(θ) Let Ui,t(θ) = min(r i,t(θ) ˆAi, ρi,t(θ) ˆAi), we can know that ϕi,t(θ) =|U i,t(θ)|. Using the chain rule and the subgradient of the absolute value function ( d|x| dx =sgn(x)), we get: ∇θϕi,t(θ) =sgn(U i,t(θ))· ∇ θUi,t(θ)(52) The term Ui,t(θ) is a minimum of two terms. Both ri,t(θ) and ρi,t(θ) are positive ratios close to 1."},{"citing_arxiv_id":"2605.04065","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08539","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"g., OCR, Regression) Dense (e.g., Grounding) [0.81, 0.81, 0.82, 0.82, 0.83] [0, 0, 1, 1, 1] [−0.51,−0.51,−0.51,−0.25,1.79] G²RPO (Ours) Previous Approaches: Imbalanced Gradient Update Advantages: Heavy Skewed, Strong Update [−0.32,−0.32,−0.32,−0.32,1.28] [−0.008,−0.008,+0.002,+0.002,+0.012] Symmetrical Update [−1.22,−1.22,0.81,0.81,0.81] [−0.9,−0.9,0.6,0.6,0.6]Ours DR.GRPO [−0.2,−0.2,−0.2,−0.2, 0.8] [−0.008,−0.008,0.02,0.02,0.02 [−0.9,−0.9,+0.26,0.26,1.28] Previous: e.g. DR.GRPO Ours Binary Reward (e.g., Math, MCQ) Ours: Gaussian, Symmetric, Balanced Update GRPO Jagged Gradient Continuous Update Long-tail Distribution Bi-modal Distribution Low-Variance Lucky Outlier Outlier Impacts, Catastrophic Update"},{"citing_arxiv_id":"2604.02507","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reinforcement Learning from Human Feedback: A Statistical Perspective","primary_cat":"stat.ML","submitted_at":"2026-04-02T21:04:17+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03043","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OneThinker: All-in-one Reasoning Model for Image and Video","primary_cat":"cs.CV","submitted_at":"2025-12-02T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10150","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective","primary_cat":"cs.LG","submitted_at":"2025-10-11T10:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Derives a token-level entropy change approximation revealing four factors, identifies limitations in prior entropy interventions, and proposes STEER which adaptively reweights tokens to mitigate collapse and improve performance on math and coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25454","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search","primary_cat":"cs.AI","submitted_at":"2025-09-29T20:00:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSearch embeds MCTS into RLVR training with global frontier selection, entropy guidance, and adaptive replay to achieve 62.95% average accuracy on math reasoning benchmarks while using 5.7x fewer GPU hours than extended training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.01944","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2025-09-02T04:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AutoDrive-R² adds four-step CoT reasoning with self-reflection to VLA models via SFT on nuScenesR²-6K and GRPO RL under spatial, dynamic, and smoothness rewards, reporting SOTA results on nuScenes and Waymo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.15778","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR","primary_cat":"cs.CL","submitted_at":"2025-07-21T16:34:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Archer introduces response-level entropy normalization and differentiated clipping/KL regularization in RLVR to encourage exploration on reasoning tokens while stabilizing knowledge tokens, yielding gains in pass@1 and pass@K on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17086","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advancing Multi-Agent RAG Systems with Minimalist Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-05-20T18:33:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mujica-MyGo decomposes multi-turn RAG interactions via multi-agent workflows and applies minimalist policy gradient optimization to improve performance on QA benchmarks while avoiding long-context problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.10978","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Group-in-Group Policy Optimization for LLM Agent Training","primary_cat":"cs.LG","submitted_at":"2025-05-16T08:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"On relatively difficult tasks (such as Look, Pick2, and WebShop), standard-deviation scaling (Fnorm =std ) could exaggerate gradients from overly difficult samples or highly imbalanced groups, harming update stability; fixing Fnorm = 1 therefore yields higher success. Yet, Fnorm = 1 offers no clear advantage on other tasks and both variants perform similarly, which aligns with findings in [72]. This suggests thatF norm =std can still be beneficial when reward variance is stable. 5.3 Performance on QA tasks As shown in Table 2, GiGPO achieves strong and consistent gains on multi-turn search-augmented QA tasks, reaching 42.1% at 3B and 47.2% at 7B, and significantly outperforming prior strong baselines such as Search-R1 and StepSearch. Although search-augmented QA is relatively short-horizon, the"}],"limit":50,"offset":0}