Toward Automated Robustness Evaluation of Mathematical Reasoning

Fei Yu; Guanhua Chen; Hailiang Huang; Ma Shuguang; Yihan Jiang; Yun Chen; Yutao Hou; Zeguan Xiao; Zhaoqian Dai

arxiv: 2506.05038 · v2 · submitted 2025-06-05 · 💻 cs.CL

Toward Automated Robustness Evaluation of Mathematical Reasoning

Yutao Hou , Zeguan Xiao , Fei Yu , Yihan Jiang , Ma Shuguang , Zhaoqian Dai , Hailiang Huang , Yun Chen

show 1 more author

Guanhua Chen

This is my paper

Pith reviewed 2026-05-19 11:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords mathematical reasoningrobustness evaluationadversarial testinglarge language modelsautomated frameworkGSM8KMATH-500fine-tuning

0 comments

The pith

MaSTer generates adversarial variants of math problems via a rewrite-verify loop to test LLM robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MaSTer as an automated framework to evaluate the robustness of large language models on mathematical reasoning tasks. Current methods use fixed templates that may not catch model-specific weaknesses and risk data contamination. MaSTer instead dynamically creates new problem variants for each model using repeated rewriting and verification steps that keep the core meaning intact. Tests on GSM8K and MATH-500 show it can induce failures, and using these variants for fine-tuning improves model performance on similar tasks. The approach also works for non-math problems.

Core claim

MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework's extensibility to non-mathematical tasks. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model's robustness.

What carries the argument

The multi-round rewrite-verify loop within the Math Stress Tester (MaSTer) framework, which rewrites problems and verifies semantic equivalence to produce adversarial examples.

If this is right

LLMs can be evaluated on model-specific adversarial variants that avoid data contamination issues.
Synthesized adversarial problems serve as effective datasets for fine-tuning to boost robustness.
The framework extends beyond mathematical reasoning to other task types.
Dynamic generation allows probing latent vulnerabilities unique to specific models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated application of MaSTer could create adaptive benchmarks that evolve with model capabilities.
Integration with human review might address cases where automated verification overlooks subtle semantic shifts.
Success here suggests similar stress-testing loops could apply to other AI evaluation domains like coding or logic.

Load-bearing premise

The automated verification step can reliably confirm semantic equivalence between the original problem and each rewritten variant without introducing or missing meaningful changes.

What would settle it

Reviewing generated variants by experts reveals that a substantial portion have unintended changes in meaning or difficulty level that the verification step failed to detect.

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework's extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model's robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MaSTer, an automated framework for robustness evaluation of LLMs on mathematical reasoning. It generates adversarial problem variants via a multi-round rewrite-verify loop intended to preserve semantic consistency while inducing failures, dynamically per model to avoid contamination. Experiments on GSM8K and MATH-500 are said to demonstrate effectiveness, with the generated variants usable as fine-tuning data to improve robustness; the framework is also shown to extend to non-mathematical tasks.

Significance. If the verification step reliably ensures that variants preserve the original solution and difficulty, MaSTer would provide a scalable, model-adaptive alternative to hand-crafted templates for probing LLM brittleness. This could reduce reliance on limited perturbation rules and support practical robustness gains through fine-tuning on synthesized examples.

major comments (2)

[framework description / rewrite-verify loop] The rewrite-verify loop (framework description): The central claim that generated variants are valid adversarial tests of the same underlying reasoning task rests on automated verification confirming semantic equivalence. The manuscript provides no mechanistic details on the verifier implementation, such as the prompts or model employed, the equivalence criteria (e.g., final-answer matching versus step-wise reasoning comparison), or procedures for handling false positives/negatives. Without these, it is unclear whether subtle changes to numerical relations, variable scoping, or implicit assumptions are reliably detected, which directly undermines both the robustness evaluation results and the downstream fine-tuning claim.
[Experiments on GSM8K and MATH-500] Experiments section (GSM8K and MATH-500 results): The abstract states that experiments demonstrate effectiveness and that variants improve robustness via fine-tuning, yet the provided description contains no quantitative metrics, baseline comparisons, or statistical details on failure induction rates or post-fine-tuning gains. This leaves the effectiveness claims under-supported and makes it difficult to assess whether the framework outperforms existing perturbation-based methods.

minor comments (1)

The abstract claims 'significantly enhance' robustness without accompanying numbers or effect sizes; adding these (or cross-references to tables) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and clarify our revisions to the manuscript.

read point-by-point responses

Referee: [framework description / rewrite-verify loop] The rewrite-verify loop (framework description): The central claim that generated variants are valid adversarial tests of the same underlying reasoning task rests on automated verification confirming semantic equivalence. The manuscript provides no mechanistic details on the verifier implementation, such as the prompts or model employed, the equivalence criteria (e.g., final-answer matching versus step-wise reasoning comparison), or procedures for handling false positives/negatives. Without these, it is unclear whether subtle changes to numerical relations, variable scoping, or implicit assumptions are reliably detected, which directly undermines both the robustness evaluation results and the downstream fine-tuning claim.

Authors: We agree that additional implementation details are necessary for full reproducibility and to substantiate the semantic equivalence claim. The original manuscript described the multi-round rewrite-verify loop at a high level in Section 3. In the revision, we will expand this section to specify: (1) the exact prompts for the rewrite and verify stages, (2) the verifier model (GPT-4o), (3) the equivalence criteria combining exact final-answer matching with an automated step-wise reasoning consistency check, and (4) our handling of false positives/negatives via sampling and manual review of discrepant cases. These additions directly address concerns about detecting subtle semantic shifts. revision: yes
Referee: [Experiments on GSM8K and MATH-500] Experiments section (GSM8K and MATH-500 results): The abstract states that experiments demonstrate effectiveness and that variants improve robustness via fine-tuning, yet the provided description contains no quantitative metrics, baseline comparisons, or statistical details on failure induction rates or post-fine-tuning gains. This leaves the effectiveness claims under-supported and makes it difficult to assess whether the framework outperforms existing perturbation-based methods.

Authors: The experiments section reports quantitative results including per-model failure induction rates on GSM8K and MATH-500, accuracy drops on adversarial variants, and post-fine-tuning robustness gains. However, we acknowledge that baseline comparisons and statistical details could be strengthened. In the revision, we will add explicit comparisons to prior perturbation methods (e.g., template-based and rule-based approaches), include statistical significance tests with confidence intervals, and report specific metrics such as average accuracy recovery after fine-tuning. Updated tables and figures will be included to better support the effectiveness claims. revision: yes

Circularity Check

0 steps flagged

No circularity in methodological framework or results

full rationale

The paper introduces MaSTer as a generative framework using a multi-round rewrite-verify loop to create adversarial variants of math problems, with effectiveness shown via experiments on external benchmarks GSM8K and MATH-500. No equations, fitted parameters, or results are presented where a claimed prediction or outcome reduces by construction to quantities defined inside the paper. The verification process is described as part of the automated method and evaluated empirically rather than derived tautologically. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to force the central claims. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of automated semantic verification and the assumption that induced failures are both meaningful and useful for fine-tuning; no free parameters or new entities are introduced.

axioms (1)

domain assumption Automated verification can accurately determine semantic consistency for rewritten math problems without human oversight.
Invoked in the description of the multi-round rewrite-verify loop to ensure variants remain valid tests.

pith-pipeline@v0.9.0 · 5735 in / 1401 out tokens · 77851 ms · 2026-05-19T11:06:25.026502+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

Perturbations to math problem text cause up to 55% average accuracy drops in open-weight LLMs and sequential solving reveals context pollution in attention mechanisms.
Robust Reasoning Benchmark
cs.LG 2026-03 unverdicted novelty 7.0

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems d...