hub

Right question is already half the answer: Fully unsupervised llm reasoning incentivization

Zhang, Q · 2025 · arXiv 2504.05812

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks plus cross-intersection generalization.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.

ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

cs.SE · 2026-04-09 · unverdicted · novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.

Can LLMs Learn to Reason Robustly under Noisy Supervision?

cs.LG · 2026-04-05 · conditional · novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

cs.LG · 2025-05-21 · unverdicted · novelty 6.0

Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

cs.LG · 2026-05-20 · unverdicted · novelty 5.0

Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

cs.LG · 2026-02-13 · unverdicted · novelty 5.0

VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

citing papers explorer

Showing 10 of 10 citing papers after filters.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 50
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 41 · 2 links
GCPO uses team-level credit assignment via determinant volume over reward-weighted semantic embeddings to promote non-redundant correct reasoning paths, improving both accuracy and diversity in LLM training.
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control cs.AI · 2026-05-08 · unverdicted · none · ref 22
OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks plus cross-intersection generalization.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 16
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach cs.LG · 2026-04-13 · unverdicted · none · ref 8
MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision? cs.SE · 2026-04-09 · unverdicted · none · ref 51
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 99
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization cs.LG · 2026-05-20 · unverdicted · none · ref 18
Identifies two gaps in entropy-based uncertainty for LLM post-training and proposes GCPO to align geometry-aware disagreement measures with reward-based calibration for better gradient regulation.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 45
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction cs.LG · 2026-02-13 · unverdicted · none · ref 36
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

Right question is already half the answer: Fully unsupervised llm reasoning incentivization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer