From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

· 2026 · cs.AI · arXiv 2601.23048

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.

representative citing papers

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

cs.CL · 2026-05-19 · accept · novelty 7.0

LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.

citing papers explorer

Showing 1 of 1 citing paper.

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening cs.CL · 2026-05-19 · accept · none · ref 32 · internal anchor
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

fields

years

verdicts

representative citing papers

citing papers explorer