Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Bishal Lakha; Edoardo Serra; Francesco Gullo; Malia Barker

arxiv: 2606.03606 · v2 · pith:F5M6AVAUnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Malia Barker , Bishal Lakha , Edoardo Serra , Francesco Gullo This is my paper

Pith reviewed 2026-06-28 09:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM robustnessarithmetic reasoningnumeric remappingGSM8Kadversarial attacksword problemsreasoning generalization

0 comments

The pith

Numeric-remapping attacks that keep the original reasoning program drop LLM accuracy on GSM8K by 12 to 26 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an automatic method to create numeric-remapping attacks on arithmetic word problems. These attacks change the numbers in a problem while preserving the exact reasoning steps and recomputing the correct answers, without relying on manual templates. When applied to models like DeepSeek-R1, Gemma4, and GPT-OSS, the attacks cause large conditional accuracy drops on GSM8K but leave shorter datasets like MAWPS and MultiArith nearly unaffected. This shows that apparent robustness on some benchmarks does not extend to longer, more varied problems even when the underlying logic stays identical.

Core claim

An automatic pipeline that derives symbolic representations of problems, generates constrained numeric remappings, recomputes gold answers, and applies deterministic edits via LLM-generated plans produces reliable attacks; on GSM8K these attacks reduce conditional accuracy by 12.16 to 25.82 points for the tested models while MAWPS and MultiArith remain stable near or above 98 percent.

What carries the argument

The automatic numeric-remapping attack generator that produces problem-specific symbolic forms, constrained remappings, recomputed answers, and LLM-guided deterministic edits followed by stage-wise validation.

If this is right

GSM8K remains sensitive to numeric variation even when reasoning programs are preserved and answers are recomputed.
Shorter and more regular datasets like MAWPS and MultiArith show near-perfect stability under the same attacks.
Trustworthy direct reasoning on word problems requires robustness to small numeric changes that preserve logic.
Dataset structure strongly determines how much numeric variation affects model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the attacks scale to other arithmetic benchmarks, current evaluation practices may systematically overestimate generalization.
Models that pass short regular problems may still fail on longer problems that require the same logic with varied numbers.
Future robustness tests could incorporate automatic remapping as a standard check rather than relying only on original problem sets.

Load-bearing premise

The generated numeric changes and edit plans keep the original reasoning steps exactly the same and yield valid new answers without creating ambiguities or calculation mistakes.

What would settle it

A manual audit of a sample of the generated attacks that finds any remapping where the reasoning program changes or the recomputed answer is incorrect.

Figures

Figures reproduced from arXiv: 2606.03606 by Bishal Lakha, Edoardo Serra, Francesco Gullo, Malia Barker.

**Figure 2.** Figure 2: Numeric-remapping attack-generation runtime by dataset on the ZGX Nano and GH200 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Baseline evaluation runtime for DeepSeek-R1 (70B). [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Baseline evaluation runtime for Gemma4 (31B). [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Baseline evaluation runtime for GPT-OSS (120B). [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external tools. Prior work shows that LLMs are sensitive to numerical variation: a model may solve an original problem but fail on structurally similar variants requiring the same reasoning procedure with different numbers. We ask whether this fragility persists under a stricter setting involving small, schema-preserving numeric changes that retain the original reasoning program and avoid large-number stress tests. We introduce an automatic algorithm for generating numeric-remapping attacks on arithmetic word problems. Unlike template-based perturbation methods requiring manual schemas or constraints, our approach derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and realizes transformed questions through deterministic edits guided by LLM-generated edit plans. Stage-wise validation and a high-confidence audit retain reliable attacks, making the pipeline scalable with limited human intervention. We evaluate DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith. On GSM8K, completed runs show conditional accuracy drops of 12.16 to 25.82 percentage points. MAWPS and MultiArith are far more stable, with most attacked accuracies near or above 98%. These results show that numeric-remapping robustness depends strongly on dataset structure: GSM8K remains sensitive even when reasoning programs are preserved and answers are recomputed, while shorter, more regular datasets are more robust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The automated numeric-remapping pipeline is a concrete technical step beyond template methods, but the GSM8K drops rest on unshown audit details that need to be checked in the full text.

read the letter

The main takeaway is an automated way to generate numeric variants of arithmetic word problems that keep the original reasoning program intact. On GSM8K it produces 12-26 point accuracy drops for the three models, while MAWPS and MultiArith stay near ceiling.

What is actually new is the pipeline itself: it derives a problem-specific symbolic representation, samples constrained numeric remappings, recomputes the gold answers, then applies deterministic edits from LLM-generated plans followed by stage-wise validation and audit. This removes the manual schema work that earlier perturbation papers required, and the approach scales with limited human effort.

The dataset comparison is also useful. It shows that numeric sensitivity is not uniform and depends on how structured and regular the problems are.

The soft spot is exactly the one the stress-test note flags. The headline numbers only hold if the retained attacks genuinely preserve the reasoning program and produce error-free gold labels. The abstract describes the audit but supplies no rejection rates, no failure-mode examples, and no comparison to random numeric changes. If those details are thin in the full paper, the drops could be overstated.

This is for readers working on LLM robustness evaluation or benchmark construction. Someone already running arithmetic tests would get practical value from the generation method.

It deserves peer review. The core idea is new enough and the empirical contrast across datasets is clear enough that referees should see the full validation numbers and any additional controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces an automatic pipeline for generating numeric-remapping attacks on arithmetic word problems. The method derives symbolic representations, samples constrained remappings, recomputes gold answers, and applies LLM-guided deterministic edits, followed by stage-wise validation and a high-confidence audit to retain only reliable attacks that preserve the original reasoning program. Experiments on DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) across GSM8K, MAWPS, and MultiArith report conditional accuracy drops of 12.16–25.82 pp on GSM8K while the shorter datasets remain near or above 98% accuracy, concluding that numeric-remapping robustness is dataset-dependent even under preserved reasoning.

Significance. If the attack validity holds, the result provides evidence that LLM arithmetic reasoning remains brittle to small numeric changes that preserve reasoning structure and gold labels, with clear dataset-structure dependence. The automatic, scalable pipeline (symbolic derivation + constrained remapping + audit) is a methodological strength over manual template methods.

major comments (2)

[Abstract / pipeline description] Abstract and pipeline description: the central claim of 12–26 pp drops on GSM8K is interpretable only if retained attacks preserve the reasoning program and produce error-free gold answers. No quantitative audit statistics (fraction filtered, failure modes, inter-rater protocol, or examples of rejected cases) are reported, leaving open the possibility that residual edit errors or introduced ambiguities drive the observed drops.
[Evaluation / results] Evaluation section: the reported conditional accuracy drops lack error bars, details on the number of attacks retained per model/dataset after the high-confidence audit, and a baseline comparison against random numeric perturbations that do not preserve reasoning. These omissions make it difficult to assess whether the drops exceed what would be expected from any numeric change.

minor comments (1)

[Abstract] The abstract states 'completed runs' without clarifying how many problems were ultimately evaluated after filtering; this should be stated explicitly with counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional transparency would strengthen the paper. We address both major comments below and will incorporate the suggested additions in the revision.

read point-by-point responses

Referee: [Abstract / pipeline description] Abstract and pipeline description: the central claim of 12–26 pp drops on GSM8K is interpretable only if retained attacks preserve the reasoning program and produce error-free gold answers. No quantitative audit statistics (fraction filtered, failure modes, inter-rater protocol, or examples of rejected cases) are reported, leaving open the possibility that residual edit errors or introduced ambiguities drive the observed drops.

Authors: We agree that quantitative audit statistics are necessary for full interpretability of the results. In the revised manuscript we will add a new subsection (or appendix table) reporting: (i) the fraction of candidates filtered at each stage of the pipeline, (ii) the main failure modes observed during validation, (iii) the inter-rater protocol and agreement rate used in the high-confidence audit, and (iv) representative examples of rejected attacks. These additions will directly address the concern that residual errors might explain the accuracy drops. revision: yes
Referee: [Evaluation / results] Evaluation section: the reported conditional accuracy drops lack error bars, details on the number of attacks retained per model/dataset after the high-confidence audit, and a baseline comparison against random numeric perturbations that do not preserve reasoning. These omissions make it difficult to assess whether the drops exceed what would be expected from any numeric change.

Authors: We will add bootstrap-derived error bars on all reported accuracy figures and explicitly state the number of retained attacks per model–dataset combination after the audit. We will also include a new baseline experiment using random numeric perturbations (subject to the same magnitude and non-zero constraints) that do not preserve the reasoning program. This comparison will allow readers to evaluate whether the observed drops exceed those attributable to unstructured numeric variation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack generation and evaluation on external benchmarks

full rationale

The paper describes an algorithmic pipeline for generating numeric-remapping attacks on arithmetic word problems and reports measured accuracy drops on fixed datasets (GSM8K, MAWPS, MultiArith) for specific LLMs. No mathematical derivations, equations, fitted parameters, or predictions appear. Central claims rest on empirical results from model outputs rather than any self-referential reduction, self-citation chain, or ansatz. The work is self-contained against external benchmarks with no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are invoked; the central claim is an empirical measurement of model behavior under a new attack generation procedure.

pith-pipeline@v0.9.1-grok · 5835 in / 1119 out tokens · 19446 ms · 2026-06-28T09:25:45.570197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

[1]

TrustLLM: Trustworthiness in Large Language Models

L. Sun, Y. Huang, H. Wang, S. Wu, et al. “TrustLLM: Trustworthiness in Large Language Models”. In:ArXivabs/2401.05561 (2024).url:https : / / api . semanticscholar . org / CorpusID:266933236. 22

Pith/arXiv arXiv 2024
[2]

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

B. Kosti’c, C. Fallon, J. Risch, and A. Loser. “Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation”. In:ArXivabs/2602.17316 (2026).url:https : //api.semanticscholar.org/CorpusID:285787510

arXiv 2026
[3]

On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code

M. I. Hossen and X. S. Hei. “On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code”. In:ArXivabs/2411.19508 (2024).url:https://api.semanticscholar. org/CorpusID:274422941

arXiv 2024
[4]

A Hierarchical Language Model For Interpretable Graph Reasoning

S. Khurana, X. Li, S. Gui, and S. Ji. “A Hierarchical Language Model For Interpretable Graph Reasoning”. In:ArXivabs/2410.22372 (2024).url:https://api.semanticscholar.org/ CorpusID:273695352

arXiv 2024
[5]

EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty

Y. Tian, R. Huang, X. Wang, J. Ma, et al. “EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty”. In:ArXivabs/2510.00732 (2025).url:https://api.semanticscholar.org/CorpusID:281706210

arXiv 2025
[6]

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

N. Cohen-Inger, Y. Elisha, B. Shapira, L. Rokach, et al. “Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon”. In:Conference on Empirical Methods in Natural Language Processing. 2025.url:https://api.semanticscholar.org/CorpusID:276259405

2025
[7]

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

R. Lunardi, V. D. Mea, S. Mizzaro, and K. Roitero. “On Robustness and Reliability of Benchmark-Based Evaluation of LLMs”. In:European Conference on Artificial Intelligence. 2025.url:https://api.semanticscholar.org/CorpusID:281103089

2025
[8]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh, K. Alizadeh-Vahid, H. Shahrokhi, O. Tuzel, et al. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”. In:ArXivabs/2410.05229 (2024).url:https://api.semanticscholar.org/CorpusID:273186279

Pith/arXiv arXiv 2024
[9]

Q. Li, L. Cui, X. Zhao, L. Kong, et al.GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers. 2024. arXiv:2402.19255 [cs.CL]. url:https://arxiv.org/abs/2402.19255

arXiv 2024
[10]

Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning

Y. Yang, H. Yamada, and T. Tokunaga. “Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning”. In:The Sixth Workshop on Insights from Negative Results in NLP. Ed. by A. Drozd, J. Sedoc, S. Tafreshi, A. Akula, et al. Albuquerque, New Mexico: As- sociation for Computational Linguistics, May 2025, pp. 171–180.isbn: 979-8-89176-240-...

work page doi:10.18653/v1/2025.insights- 2025
[11]

Huang, J

K. Huang, J. Guo, Z. Li, X. Ji, et al.MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations. 2025. arXiv:2502.06453 [cs.LG].url:https://arxiv. org/abs/2502.06453

arXiv 2025
[12]

Training verifiers to solve math word problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, et al. “Training verifiers to solve math word problems”. In:arXiv preprint arXiv:2110.14168(2021)

Pith/arXiv arXiv 2021
[13]

Are NLP Models really able to Solve Simple Math Word Problems?

A. Patel, S. Bhattamishra, and N. Goyal. “Are NLP Models really able to Solve Simple Math Word Problems?” In:North American Chapter of the Association for Computational Linguis- tics. 2021.url:https://api.semanticscholar.org/CorpusID:232223322

2021
[14]

Learning to Solve Arithmetic Word Problems with Verb Categorization

M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. “Learning to Solve Arithmetic Word Problems with Verb Categorization”. In:Conference on Empirical Methods in Natural Language Processing. 2014.url:https://api.semanticscholar.org/CorpusID:428579

2014
[15]

Solving General Arithmetic Word Problems

S. Roy and D. Roth. “Solving General Arithmetic Word Problems”. In:ArXivabs/1608.01413 (2016).url:https://api.semanticscholar.org/CorpusID:560565

Pith/arXiv arXiv 2016
[16]

A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers

S.-Y. Miao, C.-C. Liang, and K.-Y. Su. “A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers”. In:Annual Meeting of the Association for Computational Linguistics. 2020.url:https://api.semanticscholar.org/CorpusID:220047831. 23

2020
[17]

Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. “Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2017.url:https://api.semanticscholar.org/ CorpusID:12777818

2017
[18]

Parsing Algebraic Word Problems into Equations

R. Koncel-Kedziorski, H. Hajishirzi, A. Sabharwal, O. Etzioni, et al. “Parsing Algebraic Word Problems into Equations”. In:Transactions of the Association for Computational Linguistics 3 (2015), pp. 585–597.url:https://api.semanticscholar.org/CorpusID:4894130

2015
[19]

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, et al. “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning”. In:ArXivabs/2209.14610 (2022).url:https: //api.semanticscholar.org/CorpusID:252595921

arXiv 2022
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, et al. “Measuring Mathematical Problem Solving With the MATH Dataset”. In:ArXivabs/2103.03874 (2021).url:https://api. semanticscholar.org/CorpusID:232134851

Pith/arXiv arXiv 2021
[21]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

C. He, R. Luo, Y. Bai, S. Hu, et al. “OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267770504

2024
[22]

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

J. Li, R. Hu, K. Huang, Z. Yan, et al. “PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations”. In:ArXivabs/2405.19740 (2024).url:https : //api.semanticscholar.org/CorpusID:270123642

arXiv 2024
[23]

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

J. Roh, V. Gandhi, S. Anilkumar, and A. Garg. “Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation”. In:ArXivabs/2506.06971 (2025). url:https://api.semanticscholar.org/CorpusID:279251869

arXiv 2025
[24]

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

N. A. Alzahrani, H. A. Alyahya, S. Y. Alnumay, S. Z. Alsubaie, et al. “When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards”. In:Annual Meet- ing of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267412932

2024
[25]

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

P. Kumar and S. Mishra. “Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics”. In:Trans. Mach. Learn. Res.2025 (2025).url:https: //api.semanticscholar.org/CorpusID:278905678

2025
[26]

Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference

X. He, Y. Wu, O.-M. Camburu, P. Minervini, et al. “Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference”. In: Annual Meeting of the Association for Computational Linguistics. 2023.url:https://api. semanticscholar.org/CorpusID:265150621

2023
[27]

Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

L. Yuan, Y. Chen, G. Cui, H. Gao, et al. “Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations”. In:ArXivabs/2306.04618 (2023).url:https: //api.semanticscholar.org/CorpusID:259096157

arXiv 2023
[28]

Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk

Z. Chen, J. Chen, J. Chen, and M. Sra. “Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk”. In: 2025.url:https : / / api . semanticscholar . org / CorpusID:276575244

2025
[29]

From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered

S. Devic, T. Srinivasan, J. Thomason, W. Neiswanger, et al. “From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered”. In:ArXiv abs/2506.07461 (2025).url:https://api.semanticscholar.org/CorpusID:279251692

arXiv 2025
[30]

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

B. Wang, C. Xu, S. Wang, Z. Gan, et al. “Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models”. In:CoRRabs/2111.02840 (2021). arXiv: 2111.02840.url:https://arxiv.org/abs/2111.02840. 24

arXiv 2021
[31]

Chain of Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, et al. “Chain of Thought Prompting Elicits Reasoning in Large Language Models”. In:CoRRabs/2201.11903 (2022). arXiv:2201.11903. url:https://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2022
[32]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, et al.Large Language Models are Zero-Shot Rea- soners. 2023. arXiv:2205.11916 [cs.CL].url:https://arxiv.org/abs/2205.11916

Pith/arXiv arXiv 2023
[33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang, J. Wei, D. Schuurmans, Q. Le, et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:ArXivabs/2203.11171 (2022).url:https : / / api . semanticscholar.org/CorpusID:247595263

Pith/arXiv arXiv 2022
[34]

D. Zhou, N. Sch¨ arli, L. Hou, J. Wei, et al.Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. 2023. arXiv:2205.10625 [cs.AI].url:https://arxiv.org/ abs/2205.10625

Pith/arXiv arXiv 2023
[35]

S. Yao, D. Yu, J. Zhao, I. Shafran, et al.Tree of Thoughts: Deliberate Problem Solving with Large Language Models. 2023. arXiv:2305.10601 [cs.CL].url:https://arxiv.org/abs/ 2305.10601

Pith/arXiv arXiv 2023
[36]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, et al. “Graph of Thoughts: Solving Elab- orate Problems with Large Language Models”. In:Proceedings of the AAAI Conference on Artificial Intelligence38.16 (Mar. 2024), pp. 17682–17690.issn: 2159-5399.doi:10.1609/ aaai.v38i16.29720.url:http://dx.doi.org/10.1609/aaai.v38i16.29720

work page doi:10.1609/aaai.v38i16.29720 2024
[37]

L. Gao, A. Madaan, S. Zhou, U. Alon, et al.PAL: Program-aided Language Models. 2023. arXiv:2211.10435 [cs.CL].url:https://arxiv.org/abs/2211.10435

Pith/arXiv arXiv 2023
[38]

W. Chen, X. Ma, X. Wang, and W. W. Cohen.Program of Thoughts Prompting: Disentan- gling Computation from Reasoning for Numerical Reasoning Tasks. 2023. arXiv:2211.12588 [cs.CL].url:https://arxiv.org/abs/2211.12588

Pith/arXiv arXiv 2023
[39]

S. Yao, J. Zhao, D. Yu, N. Du, et al.ReAct: Synergizing Reasoning and Acting in Language Models. 2023. arXiv:2210.03629 [cs.CL].url:https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023
[40]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, et al.Self-Refine: Iterative Refinement with Self-Feedback. 2023. arXiv:2303.17651 [cs.CL].url:https://arxiv.org/abs/2303.17651

Pith/arXiv arXiv 2023
[41]

expression

N. Shinn, F. Cassano, E. Berman, A. Gopinath, et al.Reflexion: Language Agents with Verbal Reinforcement Learning. 2023. arXiv:2303.11366 [cs.AI].url:https://arxiv.org/abs/ 2303.11366. 11 Appendix A Additional Numeric-Remapping Examples Table 7 provides additional examples of valid numeric-remapping attacks. Each attacked problem changes the concrete quan...

Pith/arXiv arXiv 2023

[1] [1]

TrustLLM: Trustworthiness in Large Language Models

L. Sun, Y. Huang, H. Wang, S. Wu, et al. “TrustLLM: Trustworthiness in Large Language Models”. In:ArXivabs/2401.05561 (2024).url:https : / / api . semanticscholar . org / CorpusID:266933236. 22

Pith/arXiv arXiv 2024

[2] [2]

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

B. Kosti’c, C. Fallon, J. Risch, and A. Loser. “Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation”. In:ArXivabs/2602.17316 (2026).url:https : //api.semanticscholar.org/CorpusID:285787510

arXiv 2026

[3] [3]

On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code

M. I. Hossen and X. S. Hei. “On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code”. In:ArXivabs/2411.19508 (2024).url:https://api.semanticscholar. org/CorpusID:274422941

arXiv 2024

[4] [4]

A Hierarchical Language Model For Interpretable Graph Reasoning

S. Khurana, X. Li, S. Gui, and S. Ji. “A Hierarchical Language Model For Interpretable Graph Reasoning”. In:ArXivabs/2410.22372 (2024).url:https://api.semanticscholar.org/ CorpusID:273695352

arXiv 2024

[5] [5]

EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty

Y. Tian, R. Huang, X. Wang, J. Ma, et al. “EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty”. In:ArXivabs/2510.00732 (2025).url:https://api.semanticscholar.org/CorpusID:281706210

arXiv 2025

[6] [6]

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

N. Cohen-Inger, Y. Elisha, B. Shapira, L. Rokach, et al. “Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon”. In:Conference on Empirical Methods in Natural Language Processing. 2025.url:https://api.semanticscholar.org/CorpusID:276259405

2025

[7] [7]

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

R. Lunardi, V. D. Mea, S. Mizzaro, and K. Roitero. “On Robustness and Reliability of Benchmark-Based Evaluation of LLMs”. In:European Conference on Artificial Intelligence. 2025.url:https://api.semanticscholar.org/CorpusID:281103089

2025

[8] [8]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh, K. Alizadeh-Vahid, H. Shahrokhi, O. Tuzel, et al. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”. In:ArXivabs/2410.05229 (2024).url:https://api.semanticscholar.org/CorpusID:273186279

Pith/arXiv arXiv 2024

[9] [9]

Q. Li, L. Cui, X. Zhao, L. Kong, et al.GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers. 2024. arXiv:2402.19255 [cs.CL]. url:https://arxiv.org/abs/2402.19255

arXiv 2024

[10] [10]

Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning

Y. Yang, H. Yamada, and T. Tokunaga. “Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning”. In:The Sixth Workshop on Insights from Negative Results in NLP. Ed. by A. Drozd, J. Sedoc, S. Tafreshi, A. Akula, et al. Albuquerque, New Mexico: As- sociation for Computational Linguistics, May 2025, pp. 171–180.isbn: 979-8-89176-240-...

work page doi:10.18653/v1/2025.insights- 2025

[11] [11]

Huang, J

K. Huang, J. Guo, Z. Li, X. Ji, et al.MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations. 2025. arXiv:2502.06453 [cs.LG].url:https://arxiv. org/abs/2502.06453

arXiv 2025

[12] [12]

Training verifiers to solve math word problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, et al. “Training verifiers to solve math word problems”. In:arXiv preprint arXiv:2110.14168(2021)

Pith/arXiv arXiv 2021

[13] [13]

Are NLP Models really able to Solve Simple Math Word Problems?

A. Patel, S. Bhattamishra, and N. Goyal. “Are NLP Models really able to Solve Simple Math Word Problems?” In:North American Chapter of the Association for Computational Linguis- tics. 2021.url:https://api.semanticscholar.org/CorpusID:232223322

2021

[14] [14]

Learning to Solve Arithmetic Word Problems with Verb Categorization

M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. “Learning to Solve Arithmetic Word Problems with Verb Categorization”. In:Conference on Empirical Methods in Natural Language Processing. 2014.url:https://api.semanticscholar.org/CorpusID:428579

2014

[15] [15]

Solving General Arithmetic Word Problems

S. Roy and D. Roth. “Solving General Arithmetic Word Problems”. In:ArXivabs/1608.01413 (2016).url:https://api.semanticscholar.org/CorpusID:560565

Pith/arXiv arXiv 2016

[16] [16]

A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers

S.-Y. Miao, C.-C. Liang, and K.-Y. Su. “A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers”. In:Annual Meeting of the Association for Computational Linguistics. 2020.url:https://api.semanticscholar.org/CorpusID:220047831. 23

2020

[17] [17]

Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems

W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. “Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2017.url:https://api.semanticscholar.org/ CorpusID:12777818

2017

[18] [18]

Parsing Algebraic Word Problems into Equations

R. Koncel-Kedziorski, H. Hajishirzi, A. Sabharwal, O. Etzioni, et al. “Parsing Algebraic Word Problems into Equations”. In:Transactions of the Association for Computational Linguistics 3 (2015), pp. 585–597.url:https://api.semanticscholar.org/CorpusID:4894130

2015

[19] [19]

Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, et al. “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning”. In:ArXivabs/2209.14610 (2022).url:https: //api.semanticscholar.org/CorpusID:252595921

arXiv 2022

[20] [20]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, et al. “Measuring Mathematical Problem Solving With the MATH Dataset”. In:ArXivabs/2103.03874 (2021).url:https://api. semanticscholar.org/CorpusID:232134851

Pith/arXiv arXiv 2021

[21] [21]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

C. He, R. Luo, Y. Bai, S. Hu, et al. “OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267770504

2024

[22] [22]

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

J. Li, R. Hu, K. Huang, Z. Yan, et al. “PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations”. In:ArXivabs/2405.19740 (2024).url:https : //api.semanticscholar.org/CorpusID:270123642

arXiv 2024

[23] [23]

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

J. Roh, V. Gandhi, S. Anilkumar, and A. Garg. “Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation”. In:ArXivabs/2506.06971 (2025). url:https://api.semanticscholar.org/CorpusID:279251869

arXiv 2025

[24] [24]

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

N. A. Alzahrani, H. A. Alyahya, S. Y. Alnumay, S. Z. Alsubaie, et al. “When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards”. In:Annual Meet- ing of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267412932

2024

[25] [25]

Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

P. Kumar and S. Mishra. “Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics”. In:Trans. Mach. Learn. Res.2025 (2025).url:https: //api.semanticscholar.org/CorpusID:278905678

2025

[26] [26]

Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference

X. He, Y. Wu, O.-M. Camburu, P. Minervini, et al. “Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference”. In: Annual Meeting of the Association for Computational Linguistics. 2023.url:https://api. semanticscholar.org/CorpusID:265150621

2023

[27] [27]

Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

L. Yuan, Y. Chen, G. Cui, H. Gao, et al. “Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations”. In:ArXivabs/2306.04618 (2023).url:https: //api.semanticscholar.org/CorpusID:259096157

arXiv 2023

[28] [28]

Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk

Z. Chen, J. Chen, J. Chen, and M. Sra. “Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk”. In: 2025.url:https : / / api . semanticscholar . org / CorpusID:276575244

2025

[29] [29]

From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered

S. Devic, T. Srinivasan, J. Thomason, W. Neiswanger, et al. “From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered”. In:ArXiv abs/2506.07461 (2025).url:https://api.semanticscholar.org/CorpusID:279251692

arXiv 2025

[30] [30]

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

B. Wang, C. Xu, S. Wang, Z. Gan, et al. “Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models”. In:CoRRabs/2111.02840 (2021). arXiv: 2111.02840.url:https://arxiv.org/abs/2111.02840. 24

arXiv 2021

[31] [31]

Chain of Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, et al. “Chain of Thought Prompting Elicits Reasoning in Large Language Models”. In:CoRRabs/2201.11903 (2022). arXiv:2201.11903. url:https://arxiv.org/abs/2201.11903

Pith/arXiv arXiv 2022

[32] [32]

Kojima, S

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, et al.Large Language Models are Zero-Shot Rea- soners. 2023. arXiv:2205.11916 [cs.CL].url:https://arxiv.org/abs/2205.11916

Pith/arXiv arXiv 2023

[33] [33]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

X. Wang, J. Wei, D. Schuurmans, Q. Le, et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:ArXivabs/2203.11171 (2022).url:https : / / api . semanticscholar.org/CorpusID:247595263

Pith/arXiv arXiv 2022

[34] [34]

D. Zhou, N. Sch¨ arli, L. Hou, J. Wei, et al.Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. 2023. arXiv:2205.10625 [cs.AI].url:https://arxiv.org/ abs/2205.10625

Pith/arXiv arXiv 2023

[35] [35]

S. Yao, D. Yu, J. Zhao, I. Shafran, et al.Tree of Thoughts: Deliberate Problem Solving with Large Language Models. 2023. arXiv:2305.10601 [cs.CL].url:https://arxiv.org/abs/ 2305.10601

Pith/arXiv arXiv 2023

[36] [36]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, et al. “Graph of Thoughts: Solving Elab- orate Problems with Large Language Models”. In:Proceedings of the AAAI Conference on Artificial Intelligence38.16 (Mar. 2024), pp. 17682–17690.issn: 2159-5399.doi:10.1609/ aaai.v38i16.29720.url:http://dx.doi.org/10.1609/aaai.v38i16.29720

work page doi:10.1609/aaai.v38i16.29720 2024

[37] [37]

L. Gao, A. Madaan, S. Zhou, U. Alon, et al.PAL: Program-aided Language Models. 2023. arXiv:2211.10435 [cs.CL].url:https://arxiv.org/abs/2211.10435

Pith/arXiv arXiv 2023

[38] [38]

W. Chen, X. Ma, X. Wang, and W. W. Cohen.Program of Thoughts Prompting: Disentan- gling Computation from Reasoning for Numerical Reasoning Tasks. 2023. arXiv:2211.12588 [cs.CL].url:https://arxiv.org/abs/2211.12588

Pith/arXiv arXiv 2023

[39] [39]

S. Yao, J. Zhao, D. Yu, N. Du, et al.ReAct: Synergizing Reasoning and Acting in Language Models. 2023. arXiv:2210.03629 [cs.CL].url:https://arxiv.org/abs/2210.03629

Pith/arXiv arXiv 2023

[40] [40]

Madaan, N

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, et al.Self-Refine: Iterative Refinement with Self-Feedback. 2023. arXiv:2303.17651 [cs.CL].url:https://arxiv.org/abs/2303.17651

Pith/arXiv arXiv 2023

[41] [41]

expression

N. Shinn, F. Cassano, E. Berman, A. Gopinath, et al.Reflexion: Language Agents with Verbal Reinforcement Learning. 2023. arXiv:2303.11366 [cs.AI].url:https://arxiv.org/abs/ 2303.11366. 11 Appendix A Additional Numeric-Remapping Examples Table 7 provides additional examples of valid numeric-remapping attacks. Each attacked problem changes the concrete quan...

Pith/arXiv arXiv 2023