pith. sign in

arxiv: 2606.03606 · v2 · pith:F5M6AVAUnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Pith reviewed 2026-06-28 09:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM robustnessarithmetic reasoningnumeric remappingGSM8Kadversarial attacksword problemsreasoning generalization
0
0 comments X

The pith

Numeric-remapping attacks that keep the original reasoning program drop LLM accuracy on GSM8K by 12 to 26 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an automatic method to create numeric-remapping attacks on arithmetic word problems. These attacks change the numbers in a problem while preserving the exact reasoning steps and recomputing the correct answers, without relying on manual templates. When applied to models like DeepSeek-R1, Gemma4, and GPT-OSS, the attacks cause large conditional accuracy drops on GSM8K but leave shorter datasets like MAWPS and MultiArith nearly unaffected. This shows that apparent robustness on some benchmarks does not extend to longer, more varied problems even when the underlying logic stays identical.

Core claim

An automatic pipeline that derives symbolic representations of problems, generates constrained numeric remappings, recomputes gold answers, and applies deterministic edits via LLM-generated plans produces reliable attacks; on GSM8K these attacks reduce conditional accuracy by 12.16 to 25.82 points for the tested models while MAWPS and MultiArith remain stable near or above 98 percent.

What carries the argument

The automatic numeric-remapping attack generator that produces problem-specific symbolic forms, constrained remappings, recomputed answers, and LLM-guided deterministic edits followed by stage-wise validation.

If this is right

  • GSM8K remains sensitive to numeric variation even when reasoning programs are preserved and answers are recomputed.
  • Shorter and more regular datasets like MAWPS and MultiArith show near-perfect stability under the same attacks.
  • Trustworthy direct reasoning on word problems requires robustness to small numeric changes that preserve logic.
  • Dataset structure strongly determines how much numeric variation affects model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the attacks scale to other arithmetic benchmarks, current evaluation practices may systematically overestimate generalization.
  • Models that pass short regular problems may still fail on longer problems that require the same logic with varied numbers.
  • Future robustness tests could incorporate automatic remapping as a standard check rather than relying only on original problem sets.

Load-bearing premise

The generated numeric changes and edit plans keep the original reasoning steps exactly the same and yield valid new answers without creating ambiguities or calculation mistakes.

What would settle it

A manual audit of a sample of the generated attacks that finds any remapping where the reasoning program changes or the recomputed answer is incorrect.

Figures

Figures reproduced from arXiv: 2606.03606 by Bishal Lakha, Edoardo Serra, Francesco Gullo, Malia Barker.

Figure 1
Figure 1. Figure 1: Overview of the numeric-remapping attack-generation pipeline. Starting from an original [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Numeric-remapping attack-generation runtime by dataset on the ZGX Nano and GH200 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline evaluation runtime for DeepSeek-R1 (70B). [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Baseline evaluation runtime for Gemma4 (31B). [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Baseline evaluation runtime for GPT-OSS (120B). [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external tools. Prior work shows that LLMs are sensitive to numerical variation: a model may solve an original problem but fail on structurally similar variants requiring the same reasoning procedure with different numbers. We ask whether this fragility persists under a stricter setting involving small, schema-preserving numeric changes that retain the original reasoning program and avoid large-number stress tests. We introduce an automatic algorithm for generating numeric-remapping attacks on arithmetic word problems. Unlike template-based perturbation methods requiring manual schemas or constraints, our approach derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and realizes transformed questions through deterministic edits guided by LLM-generated edit plans. Stage-wise validation and a high-confidence audit retain reliable attacks, making the pipeline scalable with limited human intervention. We evaluate DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith. On GSM8K, completed runs show conditional accuracy drops of 12.16 to 25.82 percentage points. MAWPS and MultiArith are far more stable, with most attacked accuracies near or above 98%. These results show that numeric-remapping robustness depends strongly on dataset structure: GSM8K remains sensitive even when reasoning programs are preserved and answers are recomputed, while shorter, more regular datasets are more robust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces an automatic pipeline for generating numeric-remapping attacks on arithmetic word problems. The method derives symbolic representations, samples constrained remappings, recomputes gold answers, and applies LLM-guided deterministic edits, followed by stage-wise validation and a high-confidence audit to retain only reliable attacks that preserve the original reasoning program. Experiments on DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) across GSM8K, MAWPS, and MultiArith report conditional accuracy drops of 12.16–25.82 pp on GSM8K while the shorter datasets remain near or above 98% accuracy, concluding that numeric-remapping robustness is dataset-dependent even under preserved reasoning.

Significance. If the attack validity holds, the result provides evidence that LLM arithmetic reasoning remains brittle to small numeric changes that preserve reasoning structure and gold labels, with clear dataset-structure dependence. The automatic, scalable pipeline (symbolic derivation + constrained remapping + audit) is a methodological strength over manual template methods.

major comments (2)
  1. [Abstract / pipeline description] Abstract and pipeline description: the central claim of 12–26 pp drops on GSM8K is interpretable only if retained attacks preserve the reasoning program and produce error-free gold answers. No quantitative audit statistics (fraction filtered, failure modes, inter-rater protocol, or examples of rejected cases) are reported, leaving open the possibility that residual edit errors or introduced ambiguities drive the observed drops.
  2. [Evaluation / results] Evaluation section: the reported conditional accuracy drops lack error bars, details on the number of attacks retained per model/dataset after the high-confidence audit, and a baseline comparison against random numeric perturbations that do not preserve reasoning. These omissions make it difficult to assess whether the drops exceed what would be expected from any numeric change.
minor comments (1)
  1. [Abstract] The abstract states 'completed runs' without clarifying how many problems were ultimately evaluated after filtering; this should be stated explicitly with counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional transparency would strengthen the paper. We address both major comments below and will incorporate the suggested additions in the revision.

read point-by-point responses
  1. Referee: [Abstract / pipeline description] Abstract and pipeline description: the central claim of 12–26 pp drops on GSM8K is interpretable only if retained attacks preserve the reasoning program and produce error-free gold answers. No quantitative audit statistics (fraction filtered, failure modes, inter-rater protocol, or examples of rejected cases) are reported, leaving open the possibility that residual edit errors or introduced ambiguities drive the observed drops.

    Authors: We agree that quantitative audit statistics are necessary for full interpretability of the results. In the revised manuscript we will add a new subsection (or appendix table) reporting: (i) the fraction of candidates filtered at each stage of the pipeline, (ii) the main failure modes observed during validation, (iii) the inter-rater protocol and agreement rate used in the high-confidence audit, and (iv) representative examples of rejected attacks. These additions will directly address the concern that residual errors might explain the accuracy drops. revision: yes

  2. Referee: [Evaluation / results] Evaluation section: the reported conditional accuracy drops lack error bars, details on the number of attacks retained per model/dataset after the high-confidence audit, and a baseline comparison against random numeric perturbations that do not preserve reasoning. These omissions make it difficult to assess whether the drops exceed what would be expected from any numeric change.

    Authors: We will add bootstrap-derived error bars on all reported accuracy figures and explicitly state the number of retained attacks per model–dataset combination after the audit. We will also include a new baseline experiment using random numeric perturbations (subject to the same magnitude and non-zero constraints) that do not preserve the reasoning program. This comparison will allow readers to evaluate whether the observed drops exceed those attributable to unstructured numeric variation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack generation and evaluation on external benchmarks

full rationale

The paper describes an algorithmic pipeline for generating numeric-remapping attacks on arithmetic word problems and reports measured accuracy drops on fixed datasets (GSM8K, MAWPS, MultiArith) for specific LLMs. No mathematical derivations, equations, fitted parameters, or predictions appear. Central claims rest on empirical results from model outputs rather than any self-referential reduction, self-citation chain, or ansatz. The work is self-contained against external benchmarks with no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are invoked; the central claim is an empirical measurement of model behavior under a new attack generation procedure.

pith-pipeline@v0.9.1-grok · 5835 in / 1119 out tokens · 19446 ms · 2026-06-28T09:25:45.570197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages

  1. [1]

    TrustLLM: Trustworthiness in Large Language Models

    L. Sun, Y. Huang, H. Wang, S. Wu, et al. “TrustLLM: Trustworthiness in Large Language Models”. In:ArXivabs/2401.05561 (2024).url:https : / / api . semanticscholar . org / CorpusID:266933236. 22

  2. [2]

    Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

    B. Kosti’c, C. Fallon, J. Risch, and A. Loser. “Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation”. In:ArXivabs/2602.17316 (2026).url:https : //api.semanticscholar.org/CorpusID:285787510

  3. [3]

    On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code

    M. I. Hossen and X. S. Hei. “On the Adversarial Robustness of Instruction-Tuned Large Lan- guage Models for Code”. In:ArXivabs/2411.19508 (2024).url:https://api.semanticscholar. org/CorpusID:274422941

  4. [4]

    A Hierarchical Language Model For Interpretable Graph Reasoning

    S. Khurana, X. Li, S. Gui, and S. Ji. “A Hierarchical Language Model For Interpretable Graph Reasoning”. In:ArXivabs/2410.22372 (2024).url:https://api.semanticscholar.org/ CorpusID:273695352

  5. [5]

    EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty

    Y. Tian, R. Huang, X. Wang, J. Ma, et al. “EvolProver: Advancing Automated Theorem Prov- ing by Evolving Formalized Problems via Symmetry and Difficulty”. In:ArXivabs/2510.00732 (2025).url:https://api.semanticscholar.org/CorpusID:281706210

  6. [6]

    Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon

    N. Cohen-Inger, Y. Elisha, B. Shapira, L. Rokach, et al. “Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon”. In:Conference on Empirical Methods in Natural Language Processing. 2025.url:https://api.semanticscholar.org/CorpusID:276259405

  7. [7]

    On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

    R. Lunardi, V. D. Mea, S. Mizzaro, and K. Roitero. “On Robustness and Reliability of Benchmark-Based Evaluation of LLMs”. In:European Conference on Artificial Intelligence. 2025.url:https://api.semanticscholar.org/CorpusID:281103089

  8. [8]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    I. Mirzadeh, K. Alizadeh-Vahid, H. Shahrokhi, O. Tuzel, et al. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”. In:ArXivabs/2410.05229 (2024).url:https://api.semanticscholar.org/CorpusID:273186279

  9. [9]

    Q. Li, L. Cui, X. Zhao, L. Kong, et al.GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers. 2024. arXiv:2402.19255 [cs.CL]. url:https://arxiv.org/abs/2402.19255

  10. [10]

    Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning

    Y. Yang, H. Yamada, and T. Tokunaga. “Evaluating Robustness of LLMs to Numerical Varia- tions in Mathematical Reasoning”. In:The Sixth Workshop on Insights from Negative Results in NLP. Ed. by A. Drozd, J. Sedoc, S. Tafreshi, A. Akula, et al. Albuquerque, New Mexico: As- sociation for Computational Linguistics, May 2025, pp. 171–180.isbn: 979-8-89176-240-...

  11. [11]

    Huang, J

    K. Huang, J. Guo, Z. Li, X. Ji, et al.MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations. 2025. arXiv:2502.06453 [cs.LG].url:https://arxiv. org/abs/2502.06453

  12. [12]

    Training verifiers to solve math word problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, et al. “Training verifiers to solve math word problems”. In:arXiv preprint arXiv:2110.14168(2021)

  13. [13]

    Are NLP Models really able to Solve Simple Math Word Problems?

    A. Patel, S. Bhattamishra, and N. Goyal. “Are NLP Models really able to Solve Simple Math Word Problems?” In:North American Chapter of the Association for Computational Linguis- tics. 2021.url:https://api.semanticscholar.org/CorpusID:232223322

  14. [14]

    Learning to Solve Arithmetic Word Problems with Verb Categorization

    M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. “Learning to Solve Arithmetic Word Problems with Verb Categorization”. In:Conference on Empirical Methods in Natural Language Processing. 2014.url:https://api.semanticscholar.org/CorpusID:428579

  15. [15]

    Solving General Arithmetic Word Problems

    S. Roy and D. Roth. “Solving General Arithmetic Word Problems”. In:ArXivabs/1608.01413 (2016).url:https://api.semanticscholar.org/CorpusID:560565

  16. [16]

    A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers

    S.-Y. Miao, C.-C. Liang, and K.-Y. Su. “A Diverse Corpus for Evaluating and Developing En- glish Math Word Problem Solvers”. In:Annual Meeting of the Association for Computational Linguistics. 2020.url:https://api.semanticscholar.org/CorpusID:220047831. 23

  17. [17]

    Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems

    W. Ling, D. Yogatama, C. Dyer, and P. Blunsom. “Program Induction by Rationale Gener- ation: Learning to Solve and Explain Algebraic Word Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2017.url:https://api.semanticscholar.org/ CorpusID:12777818

  18. [18]

    Parsing Algebraic Word Problems into Equations

    R. Koncel-Kedziorski, H. Hajishirzi, A. Sabharwal, O. Etzioni, et al. “Parsing Algebraic Word Problems into Equations”. In:Transactions of the Association for Computational Linguistics 3 (2015), pp. 585–597.url:https://api.semanticscholar.org/CorpusID:4894130

  19. [19]

    Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning

    P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, et al. “Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning”. In:ArXivabs/2209.14610 (2022).url:https: //api.semanticscholar.org/CorpusID:252595921

  20. [20]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, et al. “Measuring Mathematical Problem Solving With the MATH Dataset”. In:ArXivabs/2103.03874 (2021).url:https://api. semanticscholar.org/CorpusID:232134851

  21. [21]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    C. He, R. Luo, Y. Bai, S. Hu, et al. “OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems”. In:Annual Meeting of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267770504

  22. [22]

    PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

    J. Li, R. Hu, K. Huang, Z. Yan, et al. “PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations”. In:ArXivabs/2405.19740 (2024).url:https : //api.semanticscholar.org/CorpusID:270123642

  23. [23]

    Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

    J. Roh, V. Gandhi, S. Anilkumar, and A. Garg. “Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation”. In:ArXivabs/2506.06971 (2025). url:https://api.semanticscholar.org/CorpusID:279251869

  24. [24]

    When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

    N. A. Alzahrani, H. A. Alyahya, S. Y. Alnumay, S. Z. Alsubaie, et al. “When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards”. In:Annual Meet- ing of the Association for Computational Linguistics. 2024.url:https://api.semanticscholar. org/CorpusID:267412932

  25. [25]

    Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics

    P. Kumar and S. Mishra. “Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics”. In:Trans. Mach. Learn. Res.2025 (2025).url:https: //api.semanticscholar.org/CorpusID:278905678

  26. [26]

    Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference

    X. He, Y. Wu, O.-M. Camburu, P. Minervini, et al. “Using Natural Language Explana- tions to Improve Robustness of In-context Learning for Natural Language Inference”. In: Annual Meeting of the Association for Computational Linguistics. 2023.url:https://api. semanticscholar.org/CorpusID:265150621

  27. [27]

    Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations

    L. Yuan, Y. Chen, G. Cui, H. Gao, et al. “Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations”. In:ArXivabs/2306.04618 (2023).url:https: //api.semanticscholar.org/CorpusID:259096157

  28. [28]

    Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk

    Z. Chen, J. Chen, J. Chen, and M. Sra. “Standard Benchmarks Fail – Auditing LLM Agents in Finance Must Prioritize Risk”. In: 2025.url:https : / / api . semanticscholar . org / CorpusID:276575244

  29. [29]

    From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered

    S. Devic, T. Srinivasan, J. Thomason, W. Neiswanger, et al. “From Calibration to Col- laboration: LLM Uncertainty Quantification Should Be More Human-Centered”. In:ArXiv abs/2506.07461 (2025).url:https://api.semanticscholar.org/CorpusID:279251692

  30. [30]

    Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

    B. Wang, C. Xu, S. Wang, Z. Gan, et al. “Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models”. In:CoRRabs/2111.02840 (2021). arXiv: 2111.02840.url:https://arxiv.org/abs/2111.02840. 24

  31. [31]

    Chain of Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, et al. “Chain of Thought Prompting Elicits Reasoning in Large Language Models”. In:CoRRabs/2201.11903 (2022). arXiv:2201.11903. url:https://arxiv.org/abs/2201.11903

  32. [32]

    Kojima, S

    T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, et al.Large Language Models are Zero-Shot Rea- soners. 2023. arXiv:2205.11916 [cs.CL].url:https://arxiv.org/abs/2205.11916

  33. [33]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models”. In:ArXivabs/2203.11171 (2022).url:https : / / api . semanticscholar.org/CorpusID:247595263

  34. [34]

    D. Zhou, N. Sch¨ arli, L. Hou, J. Wei, et al.Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. 2023. arXiv:2205.10625 [cs.AI].url:https://arxiv.org/ abs/2205.10625

  35. [35]

    S. Yao, D. Yu, J. Zhao, I. Shafran, et al.Tree of Thoughts: Deliberate Problem Solving with Large Language Models. 2023. arXiv:2305.10601 [cs.CL].url:https://arxiv.org/abs/ 2305.10601

  36. [36]

    Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, et al. “Graph of Thoughts: Solving Elab- orate Problems with Large Language Models”. In:Proceedings of the AAAI Conference on Artificial Intelligence38.16 (Mar. 2024), pp. 17682–17690.issn: 2159-5399.doi:10.1609/ aaai.v38i16.29720.url:http://dx.doi.org/10.1609/aaai.v38i16.29720

  37. [37]

    L. Gao, A. Madaan, S. Zhou, U. Alon, et al.PAL: Program-aided Language Models. 2023. arXiv:2211.10435 [cs.CL].url:https://arxiv.org/abs/2211.10435

  38. [38]

    W. Chen, X. Ma, X. Wang, and W. W. Cohen.Program of Thoughts Prompting: Disentan- gling Computation from Reasoning for Numerical Reasoning Tasks. 2023. arXiv:2211.12588 [cs.CL].url:https://arxiv.org/abs/2211.12588

  39. [39]

    S. Yao, J. Zhao, D. Yu, N. Du, et al.ReAct: Synergizing Reasoning and Acting in Language Models. 2023. arXiv:2210.03629 [cs.CL].url:https://arxiv.org/abs/2210.03629

  40. [40]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, et al.Self-Refine: Iterative Refinement with Self-Feedback. 2023. arXiv:2303.17651 [cs.CL].url:https://arxiv.org/abs/2303.17651

  41. [41]

    expression

    N. Shinn, F. Cassano, E. Berman, A. Gopinath, et al.Reflexion: Language Agents with Verbal Reinforcement Learning. 2023. arXiv:2303.11366 [cs.AI].url:https://arxiv.org/abs/ 2303.11366. 11 Appendix A Additional Numeric-Remapping Examples Table 7 provides additional examples of valid numeric-remapping attacks. Each attacked problem changes the concrete quan...