Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
super hub Baseline reference
Measuring Mathematical Problem Solving With the MATH Dataset
Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.
abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
Prefix gain measured via student-model solve-rate improvement is used to train a Prefix Utility Model (PUM) that supplies stronger supervision than correctness-based process rewards for mathematical reasoning.
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
Elmes* automates fine-grained rubric construction for LLM educational evaluation via multi-agent interactions and a self-evolving SceneGen module, producing the Edu-330 benchmark that demonstrates multidimensional differences in model teaching performance.
Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
Conformal language modeling samples from posterior approximations conditioned on high-scoring regions to achieve risk control with higher utility than post-hoc filtering in open-ended text generation.
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
citing papers explorer
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.
-
Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories
DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.
-
Tandem Reinforcement Learning with Verifiable Rewards
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
-
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
-
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
-
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.
-
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
-
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp performance drops with increasing depth.
-
Verifiable Counterfactual Supervision for Process Reward Models
Presents verifiable counterfactual process supervision that generates annotated trajectories via template-aware error injection on symbolic chains, improving Best-of-8 reranking on logical reasoning benchmarks with preliminary math transfer.
-
SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
SciEval is a new benchmark of expert-annotated K-12 science lessons for LLM-based automatic evaluation, where zero-shot models perform poorly but fine-tuning yields up to 11% gains.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
-
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementary to SFT.
-
ABD: Default Exception Abduction in Finite First Order Worlds
ABD benchmark evaluates LLMs on producing parsimonious first-order exception formulas in three observation regimes using SMT verification, finding high validity but persistent parsimony and generalization gaps.
-
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
When LLMs Develop Languages: Symbolic Communication for Efficient Multi-Agent Reasoning
CLSR lets LLM agents evolve and route symbolic languages that reduce generated tokens by 3-6x versus chain-of-thought while keeping accuracy on benchmarks.
-
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
-
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Systematic tests of LLM contamination detectors across 27 models show frequent failures from distribution shift and scale, concluding statistical methods cannot replace transparent data provenance.
-
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.
-
PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate
PEAR is a permutation-equivariant adaptive routing protocol for multi-agent LLM debate that reconfigures sparse topologies each round to improve accuracy over fixed debate baselines.
-
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Premature confidence in LLM chains of thought predicts flawed reasoning and is mitigated by progressive confidence shaping, a label-free RL objective that yields accuracy gains on arithmetic, math, and science tasks.
-
CLORE: Content-Level Optimization for Reasoning Efficiency
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
-
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.
-
Beyond Fixed Budgets: Characterizing the Inelasticity and Limitations of Tree-of-Thought Reasoning Strategies
DPTS shows cold-start bottlenecks at low budgets while SSDP exhibits frontier depletion, indicating fixed ToT strategies are inelastic across compute levels.
-
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces
Shepherd provides a reversible execution trace substrate for LLM agents that enables meta-agents to inspect and transform runs, yielding reported gains on coding and terminal benchmarks via supervision, counterfactual repair, and RL credit assignment.
-
Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation
Self-evolving LLM agents exhibit capability erosion under continual adaptation, which Capability-Preserving Evolution mitigates by raising retained simple-task performance from 41.8% to 52.8% in workflow evolution under GPT-5.1.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
OracleTSC introduces a reward hurdle and uncertainty regularization to stabilize LLM-based reinforcement learning for traffic signal control, delivering 75% lower travel time and 67% lower queue length on benchmarks plus cross-intersection generalization.
-
Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
-
Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation
A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 benchmarks.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
-
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
-
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
-
Differentiable Evolutionary Reinforcement Learning
DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.
-
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
UCAS refines RLVR advantage signals with a logit-space self-confidence proxy for response-level modulation and asymmetric token-level penalties based on raw logit certainty to boost exploration and reduce entropy collapse.
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
-
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
-
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.