Reinforcing general reasoning without verifiers

· 2025 · arXiv 2505.21493

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

cs.LG · 2025-09-17 · unverdicted · novelty 6.0

Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.

Trust Region On-Policy Distillation

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

cs.LG · 2026-02-13 · unverdicted · novelty 5.0

VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

cs.AI · 2025-11-13 · unverdicted · novelty 5.0

A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

cs.CL · 2025-06-16 · unverdicted · novelty 5.0

Direct Reasoning Optimization applies token-level Reasoning Reflection Reward (R3) focused on high-variance tokens and rubric-gating constraints to improve sample-efficient RL training of LLMs on unverifiable tasks.

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

cs.AI · 2025-03-31 · unverdicted · novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

citing papers explorer

Showing 11 of 11 citing papers.

Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities cs.LG · 2026-05-11 · unverdicted · none · ref 41
Presents a likelihood-based benchmark for equation-suffix prediction in technical papers with controls to detect shortcut vulnerabilities in model forecasts.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 187
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
G-Zero: Self-Play for Open-Ended Generation from Zero Data cs.LG · 2026-05-11 · unverdicted · none · ref 39
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 11
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision cs.LG · 2025-09-17 · unverdicted · none · ref 33
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 31
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts cs.LG · 2026-05-30 · unverdicted · none · ref 7
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction cs.LG · 2026-02-13 · unverdicted · none · ref 38
VI-CuRL stabilizes verifier-independent RL for LLM reasoning via confidence-guided curriculum that reduces action and problem variance, with a claimed proof of asymptotic unbiasedness and empirical gains over baselines.
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis cs.AI · 2025-11-13 · unverdicted · none · ref 21
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.
Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks cs.CL · 2025-06-16 · unverdicted · none · ref 36
Direct Reasoning Optimization applies token-level Reasoning Reflection Reward (R3) focused on high-variance tokens and rubric-gating constraints to improve sample-efficient RL training of LLMs on unverifiable tasks.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 167
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

Reinforcing general reasoning without verifiers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer