pith. sign in

super hub Baseline reference

Measuring Mathematical Problem Solving With the MATH Dataset

Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.

416 Pith papers citing it
Baseline 54% of classified citations
abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

hub tools

citation-role summary

dataset 47 background 32 method 1

citation-polarity summary

claims ledger

  • abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are

authors

co-cited works

clear filters

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

Conformal Language Modeling via Posterior Sampling

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Conformal language modeling samples from posterior approximations conditioned on high-scoring regions to achieve risk control with higher utility than post-hoc filtering in open-ended text generation.

citing papers explorer

Showing 10 of 10 citing papers after filters.

  • GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 101 · internal anchor

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  • Let's Verify Step by Step cs.LG · 2023-05-31 · accept · none · ref 7 · internal anchor

    Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.

  • Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 31 · internal anchor

    Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

  • Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations cs.AI · 2023-12-14 · conditional · none · ref 62 · internal anchor

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  • Towards Understanding Sycophancy in Language Models cs.CL · 2023-10-20 · conditional · none · ref 9 · internal anchor

    Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.

  • Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 63 · internal anchor

    RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.

  • CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 43 · internal anchor

    CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.

  • Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 13 · internal anchor

    Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.

  • PaLM 2 Technical Report cs.CL · 2023-05-17 · unverdicted · none · ref 177 · internal anchor

    PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.

  • Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 30 · internal anchor

    Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.