Robust Reasoning Benchmark

Evgenii Opryshko; Gennady Pekhimenko; Mark C. Jeffrey; Pavel Golikov

arxiv: 2604.08571 · v2 · pith:DKVJN7NYnew · submitted 2026-03-26 · 💻 cs.LG · cs.AI· cs.CL

Robust Reasoning Benchmark

Pavel Golikov , Evgenii Opryshko , Gennady Pekhimenko , Mark C. Jeffrey This is my paper

Pith reviewed 2026-05-22 11:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords robust reasoning benchmarkLLM robustnessAIME problemsattention dilutionchain of thoughttextual perturbationsmathematical reasoningmodel failure modes

0 comments

The pith

Open-weights reasoning models suffer up to 54% accuracy drops on perturbed math problems and decay on later problems due to attention dilution from their own reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies 13 deterministic textual perturbations to AIME 2024 and 2025 problems to create the Robust Reasoning Benchmark. It evaluates eight state-of-the-art models and shows that open-weights reasoning models undergo failure modes such as cognitive thrashing, tokenization breakdown, and reasoning collapse, with large accuracy losses. The work isolates one mechanism by having models solve multiple independent problems in one context window and documents progressive accuracy decay on later problems. A sympathetic reader would care because these results point to concrete limits on reliable mathematical reasoning when prompts vary or contexts lengthen.

Core claim

Open-weights reasoning models exhibit a range of failure modes under structural noise with up to 54% average accuracy drops across perturbations and up to 100% on some. When models solve multiple independent mathematical problems sequentially within a single context window, accuracy decays on subsequent problems because intermediate reasoning steps progressively pollute standard dense attention mechanisms, a phenomenon the authors term Intra-Query Attention Dilution. Frontier models are largely resilient except for Claude, which refuses many transformed prompts. The authors argue that reliable reasoning requires future architectures to integrate explicit contextual resets within models' own链

What carries the argument

The Robust Reasoning Benchmark pipeline of 13 deterministic textual perturbations on AIME problems, together with the isolation of Intra-Query Attention Dilution through sequential multi-problem prompts.

If this is right

Open-weights models from 7B to 120B parameters exhibit accuracy decay on subsequent problems in multi-problem contexts.
Explicit contextual resets within the model's own chain-of-thought are required to achieve reliable reasoning.
Standard dense attention mechanisms become polluted by intermediate reasoning steps.
Frontier models remain largely resilient except for categorical refusals on some transformed prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures could add automatic context clearing after each solved sub-problem to limit dilution.
Similar attention pollution may affect other long-context tasks beyond mathematics.
Varying perturbation types or model training regimes might identify mitigation strategies for the observed failures.

Load-bearing premise

The 13 deterministic textual perturbations preserve the original mathematical content and difficulty of the AIME problems so that observed performance changes can be attributed to model robustness rather than altered problem semantics.

What would settle it

Models maintaining their original accuracy across all 13 perturbations and showing no performance decline when solving multiple problems sequentially in one context would falsify the claims of failure modes and attention dilution.

Figures

Figures reproduced from arXiv: 2604.08571 by Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey, Pavel Golikov.

**Figure 1.** Figure 1: The Intra-Query Attention Dilution. Left: The sequential cognitive overload setup, where models are prompted to solve multiple independent AIME 2024 problems within a single prompt. Right: Mathematical accuracy strictly on the final problem of the sequence. While frontier APIs like Gemini 3.1 Pro and GPT-5.4 exhibit strong resilience, Claude Opus 4.6 and all tested open-weights models suffer a degradation … view at source ↗

**Figure 2.** Figure 2: Examples of the 13 structural transformations applied to a sample mathematical query. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Full bars represent Achieved Accuracy on the AIME 2024 benchmark modified with our [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Vulnerability Profiling: Average accuracy on transformations from each of the 4 categories [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Average Accuracy Drop across all models and transformations. 1 4 1 4 1 4 1 2 3 4 1 2 3 1 2 3 1 2 3 4 5 1 2 3 50 60 70 80 90 100 Accuracy on Last Problem (%) -0.4% -1.2% -13.8% -6.7% -7.1% -7.1% -10.6% -7.1% Gemini 3.1 Pro gpt-5.4 Claude Opus 4.6 Qwen3-30B-A3B Nemotron-32B Nemotron-7B GPT-OSS-120B DSR1-Llama-70B [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Reasoning Efficiency: Average output token length by task. On top of each bar is output [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Agent Context Leaks (1/3): Example of the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Agent Context Leaks (2/3): Observed internal reasoning steps. Images are displayed at full [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Agent Context Leaks (3/3): Note "Wait, the system message explicitly said You have [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces the Robust Reasoning Benchmark (RRB) consisting of 13 deterministic textual perturbations applied to AIME 2024 and 2025 problems. It evaluates 8 state-of-the-art LLMs and reports that frontier models are largely resilient (except Claude's refusals on transformed prompts), while open-weights reasoning models exhibit failure modes including cognitive thrashing, tokenization breakdown, and reasoning collapse, with average accuracy drops up to 54% and up to 100% on individual cases. The paper further isolates one failure mode by placing multiple independent AIME problems sequentially in a single context window and attributes observed per-problem accuracy decay in open-weights models (7B to 120B) to Intra-Query Attention Dilution from prior chain-of-thought, arguing for explicit contextual resets in future architectures.

Significance. If the empirical results and proposed mechanism hold after addressing controls, the work would be significant for LLM robustness research by providing a reproducible benchmark for structural noise and highlighting a concrete limitation in dense attention for long reasoning traces. The distinction between closed and open-weights model behaviors, plus the call for architectural resets, offers actionable insights for reliable multi-step reasoning systems.

major comments (3)

[§5] §5 (Multi-Problem Context Experiments): The central attribution of accuracy decay on subsequent problems to Intra-Query Attention Dilution is not isolated from confounds; the setup measures per-problem accuracy in sequential independent AIME problems but lacks a control holding total context length and token count fixed while replacing generated CoT with neutral fixed-length text, so the decay could stem from generic long-context degradation or task-switching costs rather than dilution by prior mathematical reasoning.
[§3] §3 (Benchmark Construction): The assumption that the 13 perturbations preserve original mathematical content and difficulty is load-bearing for attributing drops to robustness rather than semantics, yet the manuscript provides no explicit verification such as human equivalence ratings, semantic similarity metrics, or difficulty calibration against the unperturbed AIME problems.
[§4] §4 (Model Evaluations): Claims of up to 54% average accuracy drops and specific failure modes lack reported statistical significance tests, error bars, or controls for context length variations across perturbations, leaving the magnitude and reliability of the reported drops only partially supported.

minor comments (3)

[Abstract] The abstract and §4 would benefit from explicit listing of the 13 perturbation types with one-sentence definitions for reproducibility.
[Figures] Figure captions and legends should include sample sizes per model and perturbation to clarify the basis for reported averages.
[§5] Notation for 'Intra-Query Attention Dilution' is introduced without a formal definition or equation; adding a short mathematical characterization would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for strengthening the empirical rigor of the Robust Reasoning Benchmark. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§5] §5 (Multi-Problem Context Experiments): The central attribution of accuracy decay on subsequent problems to Intra-Query Attention Dilution is not isolated from confounds; the setup measures per-problem accuracy in sequential independent AIME problems but lacks a control holding total context length and token count fixed while replacing generated CoT with neutral fixed-length text, so the decay could stem from generic long-context degradation or task-switching costs rather than dilution by prior mathematical reasoning.

Authors: We agree that the current experimental design leaves open the possibility of confounds from generic long-context effects or task-switching costs. To isolate Intra-Query Attention Dilution more cleanly, we will add the suggested control condition in the revised §5: a variant in which prior problems are followed by neutral, fixed-length filler text of equivalent token count instead of generated CoT. Results from this control will be reported alongside the original sequential-problem results to demonstrate that accuracy decay is specifically tied to the presence of prior mathematical reasoning traces. revision: yes
Referee: [§3] §3 (Benchmark Construction): The assumption that the 13 perturbations preserve original mathematical content and difficulty is load-bearing for attributing drops to robustness rather than semantics, yet the manuscript provides no explicit verification such as human equivalence ratings, semantic similarity metrics, or difficulty calibration against the unperturbed AIME problems.

Authors: The perturbations were constructed to be purely structural (e.g., reordering clauses, altering whitespace, or inserting neutral delimiters) while leaving the underlying mathematical statements and solution paths unchanged. We acknowledge, however, that explicit verification strengthens the attribution. In the revised manuscript we will add (i) human equivalence ratings from three independent annotators on a random sample of 20 perturbed problems and (ii) cosine similarity scores between sentence embeddings of original and perturbed problem statements. These results will be presented in an expanded §3. revision: yes
Referee: [§4] §4 (Model Evaluations): Claims of up to 54% average accuracy drops and specific failure modes lack reported statistical significance tests, error bars, or controls for context length variations across perturbations, leaving the magnitude and reliability of the reported drops only partially supported.

Authors: We appreciate this observation. The original manuscript reported raw accuracy differences without formal statistical support. In the revision we will (i) add bootstrap 95% confidence intervals for all reported accuracy drops, (ii) include paired t-test p-values comparing each perturbation condition to the unperturbed baseline, and (iii) explicitly control for and report total context length (in tokens) for every evaluated prompt so that length variation does not confound the robustness results. These additions will appear in §4 and the associated figures. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation exhibits no circular derivation or self-referential reduction

full rationale

The paper presents an empirical study introducing 13 deterministic textual perturbations applied to external AIME 2024/2025 problems, followed by direct model evaluations measuring accuracy drops and sequential decay in multi-problem contexts. No equations, fitted parameters, or derivations are described that reduce to inputs by construction; the identification of Intra-Query Attention Dilution rests on observed performance changes against independent problems rather than self-definition or load-bearing self-citations. The central claims derive from external benchmarks and controlled prompt variations, remaining self-contained without tautological equivalence to the experimental inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that textual perturbations leave mathematical semantics unchanged and introduces the explanatory label Intra-Query Attention Dilution without an independent falsifiable prediction beyond the reported accuracy decay.

axioms (1)

domain assumption The 13 deterministic textual perturbations preserve the mathematical content and solution of the original AIME problems.
Invoked to ensure performance changes reflect robustness rather than problem alteration.

invented entities (1)

Intra-Query Attention Dilution no independent evidence
purpose: To name and explain the observed progressive accuracy decay when solving multiple independent problems in one context window.
New descriptive term for the phenomenon; no separate falsifiable prediction or external evidence is supplied.

pith-pipeline@v0.9.0 · 5771 in / 1127 out tokens · 37186 ms · 2026-05-22T11:15:02.776104+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify Intra-Query Attention Dilution... intermediate reasoning steps progressively pollute standard dense attention mechanisms.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and orbit embedding unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

future architectures must integrate explicit contextual resets within a model’s own Chain-of-Thought

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 9 internal anchors

[1]

More agents helps but adversarial robustness gap persists, 2025

Khashayar Alavi, Zhastay Yeltay, Lucie Flek, and Akbar Karimi. More agents helps but adversarial robustness gap persists, 2025. URLhttps://arxiv.org/abs/2511.07112

work page arXiv 2025
[2]

Cutting through the noise: Boosting llm performance on math word problems,

Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, and Swaroop Mishra. Cutting through the noise: Boosting llm performance on math word problems,

work page
[3]

URLhttps://arxiv.org/abs/2406.15444

work page arXiv
[4]

The claude 4.6 model family

Anthropic. The claude 4.6 model family. Technical report, Anthropic, 2026. URL https: //www.anthropic.com/news/claude-opus-4-6

work page 2026
[5]

Prompt caching, 2026

Google Gemini API. Prompt caching, 2026. URL https://ai.google.dev/gemini-api/ docs/caching

work page 2026
[6]

Fragile thoughts: How large language models handle chain-of-thought perturbations, 2026

Ashwath Vaithinathan Aravindan and Mayank Kejriwal. Fragile thoughts: How large language models handle chain-of-thought perturbations, 2026. URL https://arxiv.org/abs/2603. 03332

work page 2026
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Deepseek-r1-distill-llama-70b, January 2025

DeepSeek-AI. Deepseek-r1-distill-llama-70b, January 2025. URL https://huggingface. co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B

work page 2025
[9]

Prompt caching, 2026

Claude API Docs. Prompt caching, 2026. URL https://platform.claude.com/docs/en/ build-with-claude/prompt-caching

work page 2026
[10]

Math verify, 2026

Hugging Face. Math verify, 2026. URLhttps://pypi.org/project/math-verify/

work page 2026
[11]

Google antigravity, 2026

Google. Google antigravity, 2026. URLhttps://antigravity.google/

work page 2026
[12]

Gemini 3.1: Advancing the state of the art in multimodal reason- ing

Google DeepMind. Gemini 3.1: Advancing the state of the art in multimodal reason- ing. Technical report, Google, 2026. URL https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/

work page 2026
[13]

An investigation of robustness of llms in mathematical reasoning: Benchmarking with mathematically-equivalent transformation of advanced mathematical problems, 2025

Yuren Hao, Xiang Wan, and ChengXiang Zhai. An investigation of robustness of llms in mathematical reasoning: Benchmarking with mathematically-equivalent transformation of advanced mathematical problems, 2025. URLhttps://arxiv.org/abs/2508.08833

work page arXiv 2025
[14]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,

work page
[15]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Evaluating LLMs’ mathematical and coding competency through ontology- guided interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating LLMs’ mathematical and coding competency through ontology- guided interventions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22811–22849, Vienna, Austria, jul 2025. Association for Computational Linguis- tics. doi:...

work page doi:10.18653/v1/2025.findings-acl.1172 2025
[17]

Automatic robustness stress testing of llms as mathematical problem solvers,

Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, and Guanhua Chen. Automatic robustness stress testing of llms as mathematical problem solvers,

work page
[18]

URLhttps://arxiv.org/abs/2506.05038. 10

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations. InForty-second International Conference on Machine...

work page arXiv 2025
[20]

Quantifying artificial intelligence through algorithmic generalization

Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, and Parikshit Ram. Quantifying artificial intelligence through algorithmic generalization. InNature Machine Intelligence, 2025. URLhttps://arxiv.org/abs/2411.05943

work page arXiv 2025
[21]

Su, Camillo Jose Taylor, and Dan Roth

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo Jose Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4722–4756, Miami, Florida, USA, nov 2024. Association fo...

work page doi:10.18653/v1/2024.emnlp-main.272 2024
[22]

Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information, 2025

Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information, 2025. URLhttps://arxiv.org/abs/2408.10615

work page arXiv 2025
[23]

Farrar, Straus and Giroux, New York, 2011

Daniel Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637. URL https://mlsu.ac.in/econtents/2950_Daniel% 20Kahneman%20-%20Thinking,%20Fast%20and%20Slow%20(2013).pdf

work page 2011
[24]

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, and Abdul Sattar. Lost in cultural translation: Do llms struggle with math across cultural contexts?, 2025. URL https://arxiv.org/abs/2503.18018

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Mathrobust-lv: Evaluation of large language models’ robustness to linguistic variations in mathematical reasoning, 2025

Neeraja Kirtane, Yuvraj Khanna, and Peter Relan. Mathrobust-lv: Evaluation of large language models’ robustness to linguistic variations in mathematical reasoning, 2025. URL https: //arxiv.org/abs/2510.06430

work page arXiv 2025
[26]

GSM-plus: A comprehensive benchmark for evaluating the robustness of LLMs as mathematical prob- lem solvers

Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. GSM-plus: A comprehensive benchmark for evaluating the robustness of LLMs as mathematical prob- lem solvers. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2961–2984, Bangkok, Thailand, aug

work page
[27]

doi: 10.18653/v1/2024.acl-long.163

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.163. URL https://aclanthology.org/2024.acl-long.163

work page doi:10.18653/v1/2024.acl-long.163 2024
[28]

Ride: Difficulty evolving perturbation with item response theory for mathematical reasoning,

Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, and Yunshi Lan. Ride: Difficulty evolving perturbation with item response theory for mathematical reasoning,

work page
[29]

URLhttps://arxiv.org/abs/2511.04120

work page arXiv
[30]

Aime 2024 dataset, 2024

Mathematical Association of America. Aime 2024 dataset, 2024. URL https:// huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024
[31]

Thomas and Yao, Shunyu and Friedman, Dan and Hardy, Mathew D

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve. InProceedings of the National Academy of Sciences (PNAS), 2023. doi: 10. 1073/pnas.2322420121. URLhttps://www.pnas.org/doi/10.1073/pnas.2322420121

work page doi:10.1073/pnas.2322420121 2023
[32]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Allen Newell and Herbert A. Simon. Computer science as empirical inquiry: symbols and search.Commun. ACM, 19(3), March 1976. ISSN 0001-0782. doi: 10.1145/360018.360022. URLhttps://doi.org/10.1145/360018.360022

work page doi:10.1145/360018.360022 1976
[34]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models,

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models,

work page
[35]

URLhttps://arxiv.org/abs/2406.02061. 11

work page arXiv
[36]

Openreasoning-nemotron-32b, 2025

NVIDIA. Openreasoning-nemotron-32b, 2025. URL https://huggingface.co/nvidia/ OpenReasoning-Nemotron-32B

work page 2025
[37]

Openreasoning-nemotron-7b, 2025

NVIDIA. Openreasoning-nemotron-7b, 2025. URL https://huggingface.co/nvidia/ OpenReasoning-Nemotron-7B

work page 2025
[38]

gpt-oss-120b, 2025

OpenAI. gpt-oss-120b, 2025. URLhttps://huggingface.co/openai/gpt-oss-120b

work page 2025
[39]

Gpt-5.4 technical report

OpenAI. Gpt-5.4 technical report. Technical report, OpenAI, 2026. URL https://openai. com/index/introducing-gpt-5-4/

work page 2026
[40]

Prompt caching, 2026

OpenAI. Prompt caching, 2026. URL https://openai.com/index/ api-prompt-caching/

work page 2026
[41]

VarBench: Robust language model benchmarking through dynamic variable perturbation

Kun Qian, Shunji Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, and Zhou Yu. VarBench: Robust language model benchmarking through dynamic variable perturbation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16131–16161, Miami, Florida, USA, nov 2024. Association for Computational Linguis- tics. doi: 10.1...

work page doi:10.18653/v1/2024.findings-emnlp.946 2024
[42]

Qwen3: The next generation of qwen large language models, 2026

Qwen Team. Qwen3: The next generation of qwen large language models, 2026. URL https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

work page 2026
[43]

Lost in the middle: An emergent property from information retrieval demands in llms, 2025

Nikolaus Salvatore, Hao Wang, and Qiong Zhang. Lost in the middle: An emergent property from information retrieval demands in llms, 2025. URL https://arxiv.org/abs/2510. 10276

work page 2025
[44]

Asymob: Algebraic symbolic mathematical operations benchmark, 2025

Michael Shalyt, Rotem Elimelech, and Ido Kaminer. Asymob: Algebraic symbolic mathematical operations benchmark, 2025. URLhttps://arxiv.org/abs/2505.23851

work page arXiv 2025
[45]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 31210–31227, 2023. URLhttps://proceedings.mlr.press/v202/shi23a.html

work page 2023
[46]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://arxiv.org/abs/ 2506.06941

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024

Saurabh Srivastava, Annarose M B, Anto P V , Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024. URL https://arxiv. org/abs/2402.19450

work page arXiv 2024
[48]

Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025

Zhishen Sun, Guang Dai, Ivor Tsang, and Haishan Ye. Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025. URL https: //arxiv.org/abs/2511.08022

work page arXiv 2025
[49]

Mscr: Exploring the vulnerability of llms’ mathematical reasoning abilities using multi-source candidate replacement, 2025

Zhishen Sun, Guang Dai, and Haishan Ye. Mscr: Exploring the vulnerability of llms’ mathematical reasoning abilities using multi-source candidate replacement, 2025. URL https://arxiv.org/abs/2511.08055

work page arXiv 2025
[50]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Falcon-h1r: Pushing the reasoning frontiers with a hy- brid model for efficient test-time scaling, January 2026

Technology Innovation Institute. Falcon-h1r: Pushing the reasoning frontiers with a hy- brid model for efficient test-time scaling, January 2026. URL https://huggingface.co/ tiiuae/Falcon-H1R-7B

work page 2026
[52]

Leaked: I caught google antigravity’s hidden inner prompt, 2026

Reddit User. Leaked: I caught google antigravity’s hidden inner prompt, 2026. URL https://www.reddit.com/r/google_antigravity/comments/1rl1vjx/leaked_i_ caught_google_antigravitys_hidden_inner/

work page 2026
[53]

Are reasoning llms robust to interventions on their chain-of-thought? InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

Alexander von Recum, Leander Girrbach, and Zeynep Akata. Are reasoning llms robust to interventions on their chain-of-thought? InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2602.07470

work page arXiv 2026
[54]

Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models, 2024

Yuqing Wang and Yun Zhao. Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models, 2024. URL https://arxiv.org/abs/2406. 11020

work page 2024
[55]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088. 13

work page 2022
[56]

Rail fence cipher, 2026

Wikipedia. Rail fence cipher, 2026. URLhttps://en.wikipedia.org/wiki/Rail_fence_ cipher

work page 2026
[57]

On memorization of large language models in logical reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...

work page 2025
[58]

Adversarial math word prob- lem generation

Roy Xie, Chengxuan Huang, Junlin Wang, and Bhuwan Dhingra. Adversarial math word prob- lem generation. pages 5075–5093, Miami, Florida, USA, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.292. URL https://aclanthology.org/ 2024.findings-emnlp.292

work page doi:10.18653/v1/2024.findings-emnlp.292 2024
[59]

Evaluating robustness of LLMs to numerical variations in mathematical reasoning

Yuli Yang, Hiroaki Yamada, and Takenobu Tokunaga. Evaluating robustness of LLMs to numerical variations in mathematical reasoning. InThe Sixth Workshop on Insights from Negative Results in NLP, pages 171–180, Albuquerque, New Mexico, May 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.insights-1.16. URL https://aclanthology.org/2025...

work page doi:10.18653/v1/2025.insights-1.16 2025
[60]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InProceedings of the Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Recursive Language Models

Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2026. URL https://arxiv.org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Yilun Zhao, Guo Gan, Chengye Wang, Chen Zhao, and Arman Cohan. Are multimodal LLMs robust against adversarial perturbations? RoMMath: A systematic evaluation on multimodal math reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lingui...

work page doi:10.18653/v1/2025 2025
[63]

Dyval 2: Dynamic evaluation of large language models by meta probing agents

Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, and Xing Xie. Dynamic evaluation of large language models by meta probing agents. InForty-first International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.14865. A Appendix A.1 A - Methodology Details When the model is given a transformed user query, it is given instructions ...

work page arXiv 2024
[64]

Word Reversal: The order of words (words are defined as sequences of symbols separated by spaces) in the user query has been reversed

work page
[65]

Sentences are defined as sequences of symbols separated by periods

Sentence Reversal: The order of sentences in the user query has been reversed. Sentences are defined as sequences of symbols separated by periods

work page
[66]

First word belongs to problem A, second word belongs to problem B, third word belongs to problem A, and so on

Interleaved Context Word: User query will consist of two problems - A and B, whose statements are interleaved word by word. First word belongs to problem A, second word belongs to problem B, third word belongs to problem A, and so on. You need to solve only problem A. Words are defined as sequences of symbols separated by spaces. If one problem statement ...

work page
[67]

Each segment is followed by a space and a problem tag (e.g

Interleaved Context Line: User query will consist of two problems - A and B, whose statements are split into line segments at most 60 symbols long. Each segment is followed by a space and a problem tag (e.g. problem A or B). The segments are interleaved. You need to solve only problem A. If one problem statement is shorter than the other, the empty lines ...

work page
[68]

If the word has odd number of symbols, the first part has one symbol less than the second part

Word Split Swap: Every word (words are defined as sequences of symbols separated by spaces) in user query is split into 2 parts down the middle. If the word has odd number of symbols, the first part has one symbol less than the second part. After splitting, the 2 parts are swapped

work page
[69]

Split Reversal: Every word (words are defined as sequences of symbols separated by spaces) in user query has its symbols in reverse order

work page
[70]

The remappings are defined inside ’defyn’ block in the middle of user query

Opposites: There will be terms remapped in the user query. The remappings are defined inside ’defyn’ block in the middle of user query

work page
[71]

The remappings are defined inside ’defyn’ block in the middle of user query

Wrappers: There will be terms remapped in the user query. The remappings are defined inside ’defyn’ block in the middle of user query

work page
[72]

Rail Fence: The user query is encoded using the Rail Fence Cipher. The input is provided as a visual grid where the symbols (including spaces) of the encoded message string (message string does NOT contain any newline characters) are placed in a zigzag pattern across multiple rails (rows), and empty spaces are filled with dots (.). To decode, read the cha...

work page
[73]

The message is written as a single continuous string following the edges of the shape in a clockwise manner, beginning at the top-left

Rectangle Perimeter: "The user query is mapped onto the perimeter of a rectangle. The message is written as a single continuous string following the edges of the shape in a clockwise manner, beginning at the top-left. The TRANSFORMED INPUT is provided as a visual text block representing this rectangle with GRID START and GRID END markers. The center of th...

work page
[74]

Starting from the top-left, the text is written down the first column, then up the second column, then down the third, and so on

Snake Vertical: The user query is written into a grid using a vertical ’snake’ (zigzag) pattern. Starting from the top-left, the text is written down the first column, then up the second column, then down the third, and so on. The TRANSFORMED INPUT is provided as a visual grid with GRID START and GRID END markers

work page
[75]

Starting from the top-left, the text is written across the first row, then left across the second row, then right across the third, and so on

Snake Horizontal: The user query is written into a grid using a horizontal ’snake’ (zigzag) pattern. Starting from the top-left, the text is written across the first row, then left across the second row, then right across the third, and so on. The TRANSFORMED INPUT is provided as a visual grid with GRID START and GRID END markers. Disclosure:This research...

work page
[76]

TRANSFORMATION RULE

Read the "TRANSFORMATION RULE" provided by the user and reverse the transformation on the "TRANSFORMED INPUT" to obtain the original problem statement

work page
[77]

Once you have the original problem statement, proceed to solve the math problem

work page
[78]

Thinking

Put your final answer within\\boxed{}. A.3 Cognitive Thrashing Figure 7 shows the average output token length for each transformation. The number in the center of each bar is the accuracy of the model on that transformation. Analysis of the figure reveals a 15 BaselineNot-Not OppositesWrappers Interleave-LInterleave-WInterleave-S Context Sentence-Rev Word...

work page 2000

[1] [1]

More agents helps but adversarial robustness gap persists, 2025

Khashayar Alavi, Zhastay Yeltay, Lucie Flek, and Akbar Karimi. More agents helps but adversarial robustness gap persists, 2025. URLhttps://arxiv.org/abs/2511.07112

work page arXiv 2025

[2] [2]

Cutting through the noise: Boosting llm performance on math word problems,

Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, and Swaroop Mishra. Cutting through the noise: Boosting llm performance on math word problems,

work page

[3] [3]

URLhttps://arxiv.org/abs/2406.15444

work page arXiv

[4] [4]

The claude 4.6 model family

Anthropic. The claude 4.6 model family. Technical report, Anthropic, 2026. URL https: //www.anthropic.com/news/claude-opus-4-6

work page 2026

[5] [5]

Prompt caching, 2026

Google Gemini API. Prompt caching, 2026. URL https://ai.google.dev/gemini-api/ docs/caching

work page 2026

[6] [6]

Fragile thoughts: How large language models handle chain-of-thought perturbations, 2026

Ashwath Vaithinathan Aravindan and Mayank Kejriwal. Fragile thoughts: How large language models handle chain-of-thought perturbations, 2026. URL https://arxiv.org/abs/2603. 03332

work page 2026

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Deepseek-r1-distill-llama-70b, January 2025

DeepSeek-AI. Deepseek-r1-distill-llama-70b, January 2025. URL https://huggingface. co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B

work page 2025

[9] [9]

Prompt caching, 2026

Claude API Docs. Prompt caching, 2026. URL https://platform.claude.com/docs/en/ build-with-claude/prompt-caching

work page 2026

[10] [10]

Math verify, 2026

Hugging Face. Math verify, 2026. URLhttps://pypi.org/project/math-verify/

work page 2026

[11] [11]

Google antigravity, 2026

Google. Google antigravity, 2026. URLhttps://antigravity.google/

work page 2026

[12] [12]

Gemini 3.1: Advancing the state of the art in multimodal reason- ing

Google DeepMind. Gemini 3.1: Advancing the state of the art in multimodal reason- ing. Technical report, Google, 2026. URL https://blog.google/innovation-and-ai/ models-and-research/gemini-models/gemini-3-1-pro/

work page 2026

[13] [13]

An investigation of robustness of llms in mathematical reasoning: Benchmarking with mathematically-equivalent transformation of advanced mathematical problems, 2025

Yuren Hao, Xiang Wan, and ChengXiang Zhai. An investigation of robustness of llms in mathematical reasoning: Benchmarking with mathematically-equivalent transformation of advanced mathematical problems, 2025. URLhttps://arxiv.org/abs/2508.08833

work page arXiv 2025

[14] [14]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,

work page

[15] [15]

URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Evaluating LLMs’ mathematical and coding competency through ontology- guided interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, and Soujanya Poria. Evaluating LLMs’ mathematical and coding competency through ontology- guided interventions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22811–22849, Vienna, Austria, jul 2025. Association for Computational Linguis- tics. doi:...

work page doi:10.18653/v1/2025.findings-acl.1172 2025

[17] [17]

Automatic robustness stress testing of llms as mathematical problem solvers,

Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Xuetao Wei, Hailiang Huang, Yun Chen, and Guanhua Chen. Automatic robustness stress testing of llms as mathematical problem solvers,

work page

[18] [18]

URLhttps://arxiv.org/abs/2506.05038. 10

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. Math-perturb: Benchmarking llms’ math reasoning abilities against hard perturbations. InForty-second International Conference on Machine...

work page arXiv 2025

[20] [20]

Quantifying artificial intelligence through algorithmic generalization

Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, and Parikshit Ram. Quantifying artificial intelligence through algorithmic generalization. InNature Machine Intelligence, 2025. URLhttps://arxiv.org/abs/2411.05943

work page arXiv 2025

[21] [21]

Su, Camillo Jose Taylor, and Dan Roth

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo Jose Taylor, and Dan Roth. A peek into token bias: Large language models are not yet genuine reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4722–4756, Miami, Florida, USA, nov 2024. Association fo...

work page doi:10.18653/v1/2024.emnlp-main.272 2024

[22] [22]

Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information, 2025

Ming Jiang, Tingting Huang, Biao Guo, Yao Lu, and Feng Zhang. Enhancing robustness in large language models: Prompting for mitigating the impact of irrelevant information, 2025. URLhttps://arxiv.org/abs/2408.10615

work page arXiv 2025

[23] [23]

Farrar, Straus and Giroux, New York, 2011

Daniel Kahneman.Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637. URL https://mlsu.ac.in/econtents/2950_Daniel% 20Kahneman%20-%20Thinking,%20Fast%20and%20Slow%20(2013).pdf

work page 2011

[24] [24]

Lost in Cultural Translation: Do LLMs Struggle with Math Across Cultural Contexts?

Aabid Karim, Abdul Karim, Bhoomika Lohana, Matt Keon, Jaswinder Singh, and Abdul Sattar. Lost in cultural translation: Do llms struggle with math across cultural contexts?, 2025. URL https://arxiv.org/abs/2503.18018

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Mathrobust-lv: Evaluation of large language models’ robustness to linguistic variations in mathematical reasoning, 2025

Neeraja Kirtane, Yuvraj Khanna, and Peter Relan. Mathrobust-lv: Evaluation of large language models’ robustness to linguistic variations in mathematical reasoning, 2025. URL https: //arxiv.org/abs/2510.06430

work page arXiv 2025

[26] [26]

GSM-plus: A comprehensive benchmark for evaluating the robustness of LLMs as mathematical prob- lem solvers

Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. GSM-plus: A comprehensive benchmark for evaluating the robustness of LLMs as mathematical prob- lem solvers. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 2961–2984, Bangkok, Thailand, aug

work page

[27] [27]

doi: 10.18653/v1/2024.acl-long.163

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.163. URL https://aclanthology.org/2024.acl-long.163

work page doi:10.18653/v1/2024.acl-long.163 2024

[28] [28]

Ride: Difficulty evolving perturbation with item response theory for mathematical reasoning,

Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, and Yunshi Lan. Ride: Difficulty evolving perturbation with item response theory for mathematical reasoning,

work page

[29] [29]

URLhttps://arxiv.org/abs/2511.04120

work page arXiv

[30] [30]

Aime 2024 dataset, 2024

Mathematical Association of America. Aime 2024 dataset, 2024. URL https:// huggingface.co/datasets/HuggingFaceH4/aime_2024

work page 2024

[31] [31]

Thomas and Yao, Shunyu and Friedman, Dan and Hardy, Mathew D

R. Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L. Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve. InProceedings of the National Academy of Sciences (PNAS), 2023. doi: 10. 1073/pnas.2322420121. URLhttps://www.pnas.org/doi/10.1073/pnas.2322420121

work page doi:10.1073/pnas.2322420121 2023

[32] [32]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/abs/2410.05229

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Allen Newell and Herbert A. Simon. Computer science as empirical inquiry: symbols and search.Commun. ACM, 19(3), March 1976. ISSN 0001-0782. doi: 10.1145/360018.360022. URLhttps://doi.org/10.1145/360018.360022

work page doi:10.1145/360018.360022 1976

[34] [34]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models,

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, and Jenia Jitsev. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models,

work page

[35] [35]

URLhttps://arxiv.org/abs/2406.02061. 11

work page arXiv

[36] [36]

Openreasoning-nemotron-32b, 2025

NVIDIA. Openreasoning-nemotron-32b, 2025. URL https://huggingface.co/nvidia/ OpenReasoning-Nemotron-32B

work page 2025

[37] [37]

Openreasoning-nemotron-7b, 2025

NVIDIA. Openreasoning-nemotron-7b, 2025. URL https://huggingface.co/nvidia/ OpenReasoning-Nemotron-7B

work page 2025

[38] [38]

gpt-oss-120b, 2025

OpenAI. gpt-oss-120b, 2025. URLhttps://huggingface.co/openai/gpt-oss-120b

work page 2025

[39] [39]

Gpt-5.4 technical report

OpenAI. Gpt-5.4 technical report. Technical report, OpenAI, 2026. URL https://openai. com/index/introducing-gpt-5-4/

work page 2026

[40] [40]

Prompt caching, 2026

OpenAI. Prompt caching, 2026. URL https://openai.com/index/ api-prompt-caching/

work page 2026

[41] [41]

VarBench: Robust language model benchmarking through dynamic variable perturbation

Kun Qian, Shunji Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang, Maximillian Chen, and Zhou Yu. VarBench: Robust language model benchmarking through dynamic variable perturbation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 16131–16161, Miami, Florida, USA, nov 2024. Association for Computational Linguis- tics. doi: 10.1...

work page doi:10.18653/v1/2024.findings-emnlp.946 2024

[42] [42]

Qwen3: The next generation of qwen large language models, 2026

Qwen Team. Qwen3: The next generation of qwen large language models, 2026. URL https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

work page 2026

[43] [43]

Lost in the middle: An emergent property from information retrieval demands in llms, 2025

Nikolaus Salvatore, Hao Wang, and Qiong Zhang. Lost in the middle: An emergent property from information retrieval demands in llms, 2025. URL https://arxiv.org/abs/2510. 10276

work page 2025

[44] [44]

Asymob: Algebraic symbolic mathematical operations benchmark, 2025

Michael Shalyt, Rotem Elimelech, and Ido Kaminer. Asymob: Algebraic symbolic mathematical operations benchmark, 2025. URLhttps://arxiv.org/abs/2505.23851

work page arXiv 2025

[45] [45]

Chi, Nathanael Schärli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 31210–31227, 2023. URLhttps://proceedings.mlr.press/v202/shi23a.html

work page 2023

[46] [46]

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025. URL https://arxiv.org/abs/ 2506.06941

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024

Saurabh Srivastava, Annarose M B, Anto P V , Shashank Menon, Ajay Sukumar, Adwaith Samod T, Alan Philipose, Stevin Prince, and Sooraj Thomas. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024. URL https://arxiv. org/abs/2402.19450

work page arXiv 2024

[48] [48]

Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025

Zhishen Sun, Guang Dai, Ivor Tsang, and Haishan Ye. Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025. URL https: //arxiv.org/abs/2511.08022

work page arXiv 2025

[49] [49]

Mscr: Exploring the vulnerability of llms’ mathematical reasoning abilities using multi-source candidate replacement, 2025

Zhishen Sun, Guang Dai, and Haishan Ye. Mscr: Exploring the vulnerability of llms’ mathematical reasoning abilities using multi-source candidate replacement, 2025. URL https://arxiv.org/abs/2511.08055

work page arXiv 2025

[50] [50]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Falcon-h1r: Pushing the reasoning frontiers with a hy- brid model for efficient test-time scaling, January 2026

Technology Innovation Institute. Falcon-h1r: Pushing the reasoning frontiers with a hy- brid model for efficient test-time scaling, January 2026. URL https://huggingface.co/ tiiuae/Falcon-H1R-7B

work page 2026

[52] [52]

Leaked: I caught google antigravity’s hidden inner prompt, 2026

Reddit User. Leaked: I caught google antigravity’s hidden inner prompt, 2026. URL https://www.reddit.com/r/google_antigravity/comments/1rl1vjx/leaked_i_ caught_google_antigravitys_hidden_inner/

work page 2026

[53] [53]

Are reasoning llms robust to interventions on their chain-of-thought? InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

Alexander von Recum, Leander Girrbach, and Zeynep Akata. Are reasoning llms robust to interventions on their chain-of-thought? InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2602.07470

work page arXiv 2026

[54] [54]

Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models, 2024

Yuqing Wang and Yun Zhao. Rupbench: Benchmarking reasoning under perturbations for robustness evaluation in large language models, 2024. URL https://arxiv.org/abs/2406. 11020

work page 2024

[55] [55]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871088. 13

work page 2022

[56] [56]

Rail fence cipher, 2026

Wikipedia. Rail fence cipher, 2026. URLhttps://en.wikipedia.org/wiki/Rail_fence_ cipher

work page 2026

[57] [57]

On memorization of large language models in logical reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, and Ravi Kumar. On memorization of large language models in logical reasoning. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Li...

work page 2025

[58] [58]

Adversarial math word prob- lem generation

Roy Xie, Chengxuan Huang, Junlin Wang, and Bhuwan Dhingra. Adversarial math word prob- lem generation. pages 5075–5093, Miami, Florida, USA, 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.292. URL https://aclanthology.org/ 2024.findings-emnlp.292

work page doi:10.18653/v1/2024.findings-emnlp.292 2024

[59] [59]

Evaluating robustness of LLMs to numerical variations in mathematical reasoning

Yuli Yang, Hiroaki Yamada, and Takenobu Tokunaga. Evaluating robustness of LLMs to numerical variations in mathematical reasoning. InThe Sixth Workshop on Insights from Negative Results in NLP, pages 171–180, Albuquerque, New Mexico, May 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.insights-1.16. URL https://aclanthology.org/2025...

work page doi:10.18653/v1/2025.insights-1.16 2025

[60] [60]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InProceedings of the Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Recursive Language Models

Alex L. Zhang, Tim Kraska, and Omar Khattab. Recursive language models, 2026. URL https://arxiv.org/abs/2512.24601

work page internal anchor Pith review Pith/arXiv arXiv 2026

[62] [62]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Yilun Zhao, Guo Gan, Chengye Wang, Chen Zhao, and Arman Cohan. Are multimodal LLMs robust against adversarial perturbations? RoMMath: A systematic evaluation on multimodal math reasoning. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lingui...

work page doi:10.18653/v1/2025 2025

[63] [63]

Dyval 2: Dynamic evaluation of large language models by meta probing agents

Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, and Xing Xie. Dynamic evaluation of large language models by meta probing agents. InForty-first International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.14865. A Appendix A.1 A - Methodology Details When the model is given a transformed user query, it is given instructions ...

work page arXiv 2024

[64] [64]

Word Reversal: The order of words (words are defined as sequences of symbols separated by spaces) in the user query has been reversed

work page

[65] [65]

Sentences are defined as sequences of symbols separated by periods

Sentence Reversal: The order of sentences in the user query has been reversed. Sentences are defined as sequences of symbols separated by periods

work page

[66] [66]

First word belongs to problem A, second word belongs to problem B, third word belongs to problem A, and so on

Interleaved Context Word: User query will consist of two problems - A and B, whose statements are interleaved word by word. First word belongs to problem A, second word belongs to problem B, third word belongs to problem A, and so on. You need to solve only problem A. Words are defined as sequences of symbols separated by spaces. If one problem statement ...

work page

[67] [67]

Each segment is followed by a space and a problem tag (e.g

Interleaved Context Line: User query will consist of two problems - A and B, whose statements are split into line segments at most 60 symbols long. Each segment is followed by a space and a problem tag (e.g. problem A or B). The segments are interleaved. You need to solve only problem A. If one problem statement is shorter than the other, the empty lines ...

work page

[68] [68]

If the word has odd number of symbols, the first part has one symbol less than the second part

Word Split Swap: Every word (words are defined as sequences of symbols separated by spaces) in user query is split into 2 parts down the middle. If the word has odd number of symbols, the first part has one symbol less than the second part. After splitting, the 2 parts are swapped

work page

[69] [69]

Split Reversal: Every word (words are defined as sequences of symbols separated by spaces) in user query has its symbols in reverse order

work page

[70] [70]

The remappings are defined inside ’defyn’ block in the middle of user query

Opposites: There will be terms remapped in the user query. The remappings are defined inside ’defyn’ block in the middle of user query

work page

[71] [71]

The remappings are defined inside ’defyn’ block in the middle of user query

Wrappers: There will be terms remapped in the user query. The remappings are defined inside ’defyn’ block in the middle of user query

work page

[72] [72]

Rail Fence: The user query is encoded using the Rail Fence Cipher. The input is provided as a visual grid where the symbols (including spaces) of the encoded message string (message string does NOT contain any newline characters) are placed in a zigzag pattern across multiple rails (rows), and empty spaces are filled with dots (.). To decode, read the cha...

work page

[73] [73]

The message is written as a single continuous string following the edges of the shape in a clockwise manner, beginning at the top-left

Rectangle Perimeter: "The user query is mapped onto the perimeter of a rectangle. The message is written as a single continuous string following the edges of the shape in a clockwise manner, beginning at the top-left. The TRANSFORMED INPUT is provided as a visual text block representing this rectangle with GRID START and GRID END markers. The center of th...

work page

[74] [74]

Starting from the top-left, the text is written down the first column, then up the second column, then down the third, and so on

Snake Vertical: The user query is written into a grid using a vertical ’snake’ (zigzag) pattern. Starting from the top-left, the text is written down the first column, then up the second column, then down the third, and so on. The TRANSFORMED INPUT is provided as a visual grid with GRID START and GRID END markers

work page

[75] [75]

Starting from the top-left, the text is written across the first row, then left across the second row, then right across the third, and so on

Snake Horizontal: The user query is written into a grid using a horizontal ’snake’ (zigzag) pattern. Starting from the top-left, the text is written across the first row, then left across the second row, then right across the third, and so on. The TRANSFORMED INPUT is provided as a visual grid with GRID START and GRID END markers. Disclosure:This research...

work page

[76] [76]

TRANSFORMATION RULE

Read the "TRANSFORMATION RULE" provided by the user and reverse the transformation on the "TRANSFORMED INPUT" to obtain the original problem statement

work page

[77] [77]

Once you have the original problem statement, proceed to solve the math problem

work page

[78] [78]

Thinking

Put your final answer within\\boxed{}. A.3 Cognitive Thrashing Figure 7 shows the average output token length for each transformation. The number in the center of each bar is the accuracy of the model on that transformation. Analysis of the figure reveals a 15 BaselineNot-Not OppositesWrappers Interleave-LInterleave-WInterleave-S Context Sentence-Rev Word...

work page 2000