pith. sign in

arxiv: 2505.23851 · v2 · pith:OFSTEXT7new · submitted 2025-05-28 · 💻 cs.CL · cs.AI· cs.SC

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

classification 💻 cs.CL cs.AIcs.SC
keywords asymobsymbolicmodelstextbfintegrationllmsperformanceproblems
0
0 comments X
read the original abstract

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present \textbf{ASyMOB}, a high-resolution dataset of \textit{35,368} validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, \textbf{ASyMOB} systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent \textit{regime shift} in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. \textbf{ASyMOB} serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.

  2. Robust Reasoning Benchmark

    cs.LG 2026-03 unverdicted novelty 7.0

    The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems d...

  3. SciML Agents: Write the Solver, Not the Solution

    cs.LG 2025-09 unverdicted novelty 7.0

    LLMs prompted with domain knowledge can generate runnable, numerically valid code for stiff and non-stiff ODEs on new diagnostic and 1000-task benchmarks.