Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
super hub Baseline reference
Measuring Mathematical Problem Solving With the MATH Dataset
Baseline reference. 54% of citing Pith papers use this work as a benchmark or comparison.
abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are
authors
co-cited works
representative citing papers
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
A nine-dimension algebraic complexity framework shows that LLMs suffer a scale-invariant working memory bottleneck, collapsing at 20-30 parallel branches regardless of parameter count from 8B to 235B.
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
MiniF2F is a new cross-system benchmark containing 488 Olympiad-level mathematics problems formalized in Metamath, Lean, Isabelle, and HOL Light, together with baseline results from a GPT-3-based prover.
Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.
ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
Heterogeneous agents achieve dense latent KV-cache communication via lightweight cross-model transformation and two-phase training, outperforming text at lower compute in context-aware settings and enabling context-unaware transfer.
ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.
SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
INFRAMIND is an infrastructure-aware multi-agent orchestration framework that uses RL on a hierarchical constrained MDP to jointly optimize topology, model selection, and scheduling under dynamic load.
The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.
PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.
citing papers explorer
-
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
Inference Cost Attacks for Retrieval-Augmented Large Language Models
Poisoning external knowledge bases with LLM-agent-crafted documents can increase RAG inference token consumption by up to 13.12 times at over 90% success rate while preserving answer quality.
-
You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation
NeWTral is a non-linear weight translation framework using MoE routing that reduces average attack success rate from 70% to 13% on unsafe domain adapters across Llama, Mistral, Qwen, and Gemma models up to 72B while retaining 90% knowledge fidelity.
-
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
-
To trust or not to trust: Attention-based Trust Management for LLM Multi-Agent Systems
Introduces six-dimension trustworthiness definition and attention-based A-Trust score with a TMS to improve LLM-MAS robustness against malicious or unreliable messages.