pith. sign in

super hub Canonical reference

Let's Verify Step by Step

Canonical reference. 81% of citing Pith papers cite this work as background.

275 Pith papers citing it
Background 81% of classified citations
abstract

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

hub tools

citation-role summary

background 25 dataset 4 method 2

citation-polarity summary

claims ledger

  • abstract In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, bu

authors

co-cited works

clear filters

representative citing papers

VCT: A Verifiable Transcript System for LLM Conversations

cs.CR · 2026-06-22 · unverdicted · novelty 7.0

VCT abstracts non-linear LLM operations into authenticated state transitions via atomic Q&A hash chains, session Merkle roots, and account-level roots with joint signatures, plus protocols for deletions and concurrency detection.

A Verifiable Search Is Not a Learnable Chain-of-Thought

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.

Agreement in Representation Space for Open-Ended Self-Consistency

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.

The Power of Test-Time Training for Approximate Sampling

cs.DS · 2026-06-09 · unverdicted · novelty 7.0

Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.