Training Verifiers to Solve Math Word Problems

Karl Cobbe , Vineet Kosaraju , Mohammad Bavarian , Mark Chen , Heewoo Jun , Lukasz Kaiser , Matthias Plappert , Jerry Tworek

show 4 more authors

Jacob Hilton Reiichiro Nakano Christopher Hesse John Schulman

Authors on Pith no claims yet

classification 💻 cs.LG cs.CL

keywords performancemodelsgsm8khighmanymathproblemstest

0 comments

read the original abstract

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
cs.LG 2026-05 accept novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
cs.LG 2026-05 unverdicted novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
cs.LG 2026-05 unverdicted novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budget...
From Context to Skills: Can Language Models Learn from Context Skillfully?
cs.AI 2026-04 unverdicted novelty 8.0

Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?
cs.AI 2026-04 unverdicted novelty 8.0

Agent^2 RL-Bench shows LLM agents can achieve large RL gains on some tasks like ALFWorld (5.97 to 93.28) via SFT and online rollouts but only marginal progress on others, with supervised pipelines dominating under fix...
Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
cs.CL 2026-04 unverdicted novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
PAL: Program-aided Language Models
cs.CL 2022-11 conditional novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
Code as Policies: Language Model Programs for Embodied Control
cs.RO 2022-09 accept novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
cs.CL 2026-05 conditional novelty 7.0

EndPrompt induces reliable long-context generalization in LLaMA models from sparse positional supervision via a two-segment short-sequence construction with terminal anchoring.
Uncertainty Quantification for Large Language Diffusion Models
cs.CL 2026-05 unverdicted novelty 7.0

Uncertainty signals from LLDM denoising trajectories, including a proven lower bound on the training objective, achieve near sampling-based hallucination detection at up to 100x lower cost.
Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
cs.AI 2026-05 unverdicted novelty 7.0

Cattle Trade benchmark shows heuristic code agents outperforming most LLMs in integrated strategic tasks like bidding, bluffing, and resource allocation across 242 games, with strategic coherence predicting rank bette...
Dynamic Latent Routing
cs.LG 2026-05 unverdicted novelty 7.0

Dynamic Latent Routing jointly learns discrete latent codes, routing policies, and model parameters via dynamic search to match or exceed supervised fine-tuning by 6.6 points on average in low-data settings across fou...
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
cs.CL 2026-05 unverdicted novelty 7.0

TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4...
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
cs.CL 2026-05 unverdicted novelty 7.0

Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment
cs.LG 2026-05 unverdicted novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
cs.AR 2026-05 unverdicted novelty 7.0

Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 conditional novelty 7.0

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
cs.LG 2026-05 conditional novelty 7.0

AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
cs.AI 2026-05 conditional novelty 7.0

Random Soft Prompts (RSPs) sampled from the embedding distribution improve Pass@N on reasoning benchmarks by increasing early-stage token diversity without any training.
LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection
cs.LG 2026-05 unverdicted novelty 7.0

LOFT unifies orthogonal PEFT by treating adaptation as low-rank subspace rotation and adds task-aware support selection that improves efficiency under fixed budgets.
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
cs.CR 2026-05 unverdicted novelty 7.0

A binomial multibit watermarking scheme encodes every payload bit at each LLM token with dynamic redirection, outperforming baselines in accuracy and robustness for large payloads.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
cs.CL 2026-05 unverdicted novelty 7.0

Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
cs.AI 2026-05 unverdicted novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
cs.LG 2026-05 accept novelty 7.0

Corruption studies on CoT chains detect the position of explicit answer statements rather than computational steps, as evidenced by format ablations collapsing suffix sensitivity 19x and models following conflicting a...
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LLMs rely on semantic cues for matrix-game equilibria but can acquire approximate computation via residual training on small instances, with a Lipschitz proof enabling transfer to larger anonymous games.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation
cs.SI 2026-05 unverdicted novelty 7.0

GraphInstruct is a progressive benchmark with six complexity levels for LLM graph generation that identifies multi-constraint composition as the hardest point and shows a verification-guided iterative framework outper...
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
Entropy-informed Decoding: Adaptive Information-Driven Branching
cs.LG 2026-05 unverdicted novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fi...