Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt · 2021

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

representative citing papers

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

ECC calibrates semantic embeddings with posterior model comparisons and Bradley-Terry capability profiles to create flexible, mixed-membership query clusters that improve LLM capability ranking.

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

cs.CL · 2026-03-09 · unverdicted · novelty 6.0

CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 4.0 · 2 refs

A controlled multi-model evaluation on shared data subsets shows that deployment metrics and prompting choices create important tradeoffs and alter model rankings beyond accuracy alone.

citing papers explorer

Showing 5 of 5 citing papers.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 10
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 43
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering cs.AI · 2026-05-16 · unverdicted · none · ref 16
ECC calibrates semantic embeddings with posterior model comparisons and Bradley-Terry capability profiles to create flexible, mixed-membership query clusters that improve LLM capability ranking.
CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning cs.CL · 2026-03-09 · unverdicted · none · ref 12
CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.
Unified Deployment-Aware Evaluation of Open Reasoning Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 9 · 2 links
A controlled multi-model evaluation on shared data subsets shows that deployment metrics and prompting choices create important tradeoffs and alter model rankings beyond accuracy alone.

Measuring mathematical problem solving with the MATH dataset

fields

years

verdicts

representative citing papers

citing papers explorer