GRPO's group-mean baseline assigns identical advantages to all tokens under output-only rewards, inducing gradient sparsity and an intrinsic rank-2 structure proven from the zero-sum constraint and confirmed by SVD on Nemotron-4B gradients.
super hub Canonical reference
Let's Verify Step by Step
Canonical reference. 81% of citing Pith papers cite this work as background.
abstract
In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu
authors
co-cited works
representative citing papers
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6.7 points.
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.
A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.
Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.
ChemCoTBench-V2 is a new rule-verifiable benchmark with 5,620 samples across 18 tasks that evaluates LLM chemical reasoning traces using deterministic chemistry rules and reference traces rather than final answers alone.
ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.
Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.
PEFT-Arena reveals distinct stability-plasticity profiles across PEFT methods, with orthogonal finetuning achieving the best Pareto frontier under comparable parameter budgets, supported by weight-space spectral and activation-space retention analyses.
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
Causal diagnosis identifies the routing module as bottleneck in LLM agents but prompt patching there degrades results due to linguistic co-adaptation, while upstream patching improves them.
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
DCDM replaces positional blocks with learnable semantic chunks via differentiable Chunking Attention, yielding consistent gains over block and unstructured diffusion baselines up to 1.5B parameters.
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-training while staying competitive on reasoning and coding.
citing papers explorer
No citing papers match the current filters.