SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
hub
Right question is already half the answer: Fully unsupervised llm reasoning incentivization
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks plus cross-intersection generalization.
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
citing papers explorer
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks plus cross-intersection generalization.
-
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
-
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach
MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.
-
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
-
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
-
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
-
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.