super hub Baseline reference

Measuring Mathematical Problem Solving With the MATH Dataset

Akul Arora, Collin Burns, Dan Hendrycks, Eric Tang, Saurav Kadavath, Steven Basart · 2021 · cs.LG · arXiv 2103.03874

Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.

378 Pith papers citing it

Baseline 54% of classified citations

open full Pith review browse 378 citing papers more from Akul Arora arXiv PDF

abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 47 background 32 method 1

citation-polarity summary

use dataset 43 background 32 unclear 4 use method 1

claims ledger

abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are

authors

Akul Arora Collin Burns Dan Hendrycks Eric Tang Saurav Kadavath Steven Basart

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

cs.CL · 2026-04-08 · unverdicted · novelty 8.0

A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

cs.AI · 2026-03-30 · conditional · novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

cs.AI · 2021-08-31 · accept · novelty 8.0

MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.

When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Compositional Generalization in Autoregressive Models via Logit Composition

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Logit composition of autoregressive models is projective under factorized conditionals, preserved under smooth reparameterizations, and maintains length generalization when assumptions hold uniformly.

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

ATLAS traces RLVR data to 20 atomic sources, most datasets are variants, and DAPO++ curated with SCA improves RLVR performance while Q predicts training effectiveness.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

The paper introduces a multi-turn interactive benchmark using 474 executable games to evaluate LLMs on evidence acquisition, belief updating, contextual robustness, and metacognitive adaptation, revealing large performance gaps and sensitivity to perturbations.

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

citing papers explorer

Showing 36 of 36 citing papers after filters.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 11 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology cs.AI · 2026-03-30 · conditional · none · ref 6 · internal anchor
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Training Software Engineering Agents and Verifiers with SWE-Gym cs.SE · 2024-12-30 · conditional · none · ref 3 · internal anchor
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation cs.LG · 2026-05-20 · conditional · none · ref 15 · internal anchor
X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 7 · 2 links · internal anchor
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 29 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models cs.CL · 2026-04-01 · conditional · none · ref 8 · internal anchor
S0 tuning optimizes initial recurrent states in hybrid models to outperform LoRA with zero inference cost on HumanEval and partial cross-domain transfer.
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese cs.CL · 2026-04-01 · conditional · none · ref 1 · internal anchor
Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 19 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning cs.CL · 2025-10-10 · conditional · none · ref 12 · internal anchor
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework cs.CL · 2025-09-25 · conditional · none · ref 9 · internal anchor
Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models cs.CL · 2024-10-10 · conditional · none · ref 59 · internal anchor
Omni-MATH supplies 4428 human-verified Olympiad math problems that expose top LLMs achieving only 52.55% to 60.54% accuracy on the most difficult items.
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence cs.CL · 2026-05-30 · conditional · none · ref 32 · internal anchor
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning cs.LG · 2026-05-20 · conditional · none · ref 37 · internal anchor
ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models cs.CL · 2026-05-04 · conditional · none · ref 9 · internal anchor
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 50 · internal anchor
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Can LLMs Learn to Reason Robustly under Noisy Supervision? cs.LG · 2026-04-05 · conditional · none · ref 9 · internal anchor
Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning benchmarks even at high noise levels.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus cs.LG · 2026-04-04 · conditional · none · ref 8 · internal anchor
LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation cs.LG · 2026-03-05 · conditional · none · ref 9 · internal anchor
CRISP achieves 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models via iterative self-distillation of concise reasoning behavior.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 13 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Diffusion Language Models Know the Answer Before Decoding cs.CL · 2025-08-27 · conditional · none · ref 8 · internal anchor
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 136 · internal anchor
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 32 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving cs.CV · 2024-10-29 · conditional · none · ref 26 · internal anchor
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs cs.LG · 2024-06-26 · conditional · none · ref 7 · internal anchor
Step-DPO performs preference optimization on individual reasoning steps rather than complete answers, producing nearly 3% accuracy gains on MATH for 70B+ parameter models with 10K preference pairs.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark cs.CL · 2024-06-03 · conditional · none · ref 19 · internal anchor
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 20 · internal anchor
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 31 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 62 · internal anchor
Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.
Towards Understanding Sycophancy in Language Models cs.CL · 2023-10-20 · conditional · none · ref 9 · internal anchor
Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 43 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Training Verifiers to Solve Math Word Problems cs.LG · 2021-10-27 · conditional · none · ref 4 · internal anchor
Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs cs.CL · 2026-05-08 · conditional · none · ref 62 · 2 links · internal anchor
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
Early Stopping Chain-of-thoughts in Large Language Models cs.CL · 2025-09-17 · conditional · none · ref 6 · internal anchor
ES-CoT shortens LLM chain-of-thought generation by tracking runs of identical step answers after linguistic markers, cutting tokens 16% on average while keeping accuracy comparable to full CoT across six datasets and three models.
Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition cs.LG · 2026-06-30 · conditional · none · ref 5 · internal anchor
Distilling CoT from DeepSeek-R1 to Qwen2.5-7B on competition problems yields 4.76 pp accuracy gain to 69.43% and 73.1% on MATH-500, with accuracy falling as response length decreases.
When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems physics.ed-ph · 2025-08-05 · conditional · none · ref 59 · internal anchor
An exploratory study finds that a curated core of automated checks on learner-initiated AI-generated physics problems aligns with expert judgments and student preferences.

Measuring Mathematical Problem Solving With the MATH Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer