Outcome-supervised verifiers for planning in mathematical reasoning.arXiv preprint arXiv:2311.09724

· 2023 · arXiv 2311.09724

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

cs.LG · 2026-03-13 · unverdicted · novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.

Process Supervision of Confidence Margin for Calibrated LLM Reasoning

cs.LG · 2026-04-25 · unverdicted · novelty 6.0

RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

cs.LG · 2024-10-10 · unverdicted · novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

cs.CL · 2024-06-05 · conditional · novelty 6.0

OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

cs.AI · 2023-12-14 · conditional · novelty 6.0

Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

Introduces IBPO, a counterfactual credit assignment method that turns sparse terminal rewards into process-level advantage estimates for more stable LLM reasoning training.

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

cs.LG · 2026-02-16 · unverdicted · novelty 5.0

A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

citing papers explorer

Showing 8 of 8 citing papers.

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 34
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning cs.LG · 2026-04-25 · unverdicted · none · ref 82
RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning cs.LG · 2024-10-10 · unverdicted · none · ref 23
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision cs.CL · 2024-06-05 · conditional · none · ref 24
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 90
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths cs.LG · 2026-04-20 · unverdicted · none · ref 11
Introduces IBPO, a counterfactual credit assignment method that turns sparse terminal rewards into process-level advantage estimates for more stable LLM reasoning training.
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning cs.LG · 2026-02-16 · unverdicted · none · ref 34
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 183
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Outcome-supervised verifiers for planning in mathematical reasoning.arXiv preprint arXiv:2311.09724

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer