pith. sign in

arxiv: 2603.03332 · v3 · submitted 2026-02-11 · 💻 cs.CL · cs.AI· cs.LG

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Pith reviewed 2026-05-16 03:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords chain-of-thoughtLLM robustnessreasoning perturbationsmathematical reasoningmodel scalingvulnerability patternsempirical evaluation
0
0 comments X

The pith

Large language models show uneven drops in accuracy when their chain-of-thought steps contain different kinds of errors, with effects that depend on error type and model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well LLMs maintain correct answers on math problems when the provided reasoning chain is altered by one of five specific perturbations. Thirteen models ranging from small to large parameter counts were measured for accuracy loss after the perturbations were injected into otherwise correct chains. Math errors caused the largest drops in small models, unit conversions stayed difficult across sizes, and extra steps barely changed outcomes. These patterns indicate that chain-of-thought prompting is not uniformly fragile and that size helps against some but not all error types. The results point to the need for targeted checks when using LLMs in longer reasoning pipelines.

Core claim

LLM robustness to CoT perturbations is heterogeneous across five error types. MathError perturbations produce the most severe degradation in small models with 50-60% accuracy loss but show strong scaling benefits. UnitConversion remains challenging across all scales with greater than 5% loss even for midsized models. ExtraSteps incur minimal accuracy degradation of 0-6% even for the smallest models. Sycophancy and SkippedSteps produce modest effects of about 10% loss for small models and improve slightly with scale. Model size serves as a protective factor against many perturbations but not always.

What carries the argument

A taxonomy of five CoT perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) injected into reasoning chains for mathematical tasks, measured by accuracy change across models of different sizes.

Load-bearing premise

The five perturbation types represent the main real-world errors that occur in LLM chain-of-thought reasoning and the chosen math tasks isolate those errors without other confounding influences.

What would settle it

Finding that all five perturbation types produce roughly equal accuracy losses with no consistent scaling benefit would contradict the reported heterogeneous vulnerability patterns.

Figures

Figures reproduced from arXiv: 2603.03332 by Ashwath Vaithinathan Aravindan, Mayank Kejriwal.

Figure 1
Figure 1. Figure 1: Fixed-effects regression lines quantifying the relationship between model size and robustness across perturbation types. The x-axis shows log10 model size (billions of parameters), and the y-axis displays robustness under perturbation. Each colored line represents a fitted regression with slope and intercept shown in the legend. The heterogeneous regression slopes reveal fundamentally different scaling beh… view at source ↗
read the original abstract

Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (>5\% loss even for midsized models); ExtraSteps incur minimal accuracy degradation (0-6\%) even for the smallest of models; Sycophancy and SkippedSteps produce modest effects ($\sim$10\% loss for small models) and slightly improve with scale. Scaling relationships show that model size serve as a protective factor against many perturbations but not always. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available at https://github.com/Mystic-Slice/CoTPerturbation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an empirical study evaluating the robustness of 13 LLMs (spanning three orders of magnitude in size) to five author-defined perturbations injected into chain-of-thought reasoning chains on mathematical tasks: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. It reports heterogeneous vulnerability patterns, including 50-60% accuracy loss from MathError in small models with scaling benefits, persistent >5% loss from UnitConversion across scales, minimal 0-6% loss from ExtraSteps, and modest ~10% effects from the others that improve slightly with scale, concluding that model size is a partial protective factor with implications for multi-stage reasoning pipelines.

Significance. If the central empirical patterns hold after addressing methodological gaps, the work provides a useful large-scale measurement of CoT fragility across model sizes and perturbation types, highlighting that scaling mitigates some but not all vulnerabilities and underscoring the need for task-specific robustness checks in deployed reasoning systems. The public code release strengthens reproducibility.

major comments (2)
  1. [Abstract and Results] The abstract and results sections report specific accuracy losses (e.g., 50-60% for MathError in small models) without providing base CoT accuracies per task or per model prior to perturbation, making it impossible to determine whether the deltas reflect perturbation-specific effects or interactions with inherent task difficulty.
  2. [Introduction and Methods] The taxonomy of five perturbation types is presented as structured and representative, but the manuscript provides no derivation from observed LLM error distributions or validation that the injection process isolates each type without confounds such as altered prompt length, token count, or solvability across types.
minor comments (1)
  1. [Abstract] The GitHub link is provided but the manuscript does not specify which exact tasks, prompts, or exclusion criteria were used, limiting independent verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to incorporate additional baseline data and methodological details.

read point-by-point responses
  1. Referee: [Abstract and Results] The abstract and results sections report specific accuracy losses (e.g., 50-60% for MathError in small models) without providing base CoT accuracies per task or per model prior to perturbation, making it impossible to determine whether the deltas reflect perturbation-specific effects or interactions with inherent task difficulty.

    Authors: We agree that baseline CoT accuracies are necessary to interpret the reported deltas. The revised manuscript now includes Table 1, which reports unperturbed CoT accuracy for every model and task. This table makes clear that the 50-60% MathError losses occur on tasks where baseline accuracy is already moderate (60-75%), while the smaller losses for ExtraSteps occur on tasks with higher baselines, confirming the effects are perturbation-specific rather than solely driven by task difficulty. revision: yes

  2. Referee: [Introduction and Methods] The taxonomy of five perturbation types is presented as structured and representative, but the manuscript provides no derivation from observed LLM error distributions or validation that the injection process isolates each type without confounds such as altered prompt length, token count, or solvability across types.

    Authors: The taxonomy draws from error categories documented in prior CoT studies (e.g., arithmetic mistakes, unit inconsistencies, and step omissions). To address potential confounds, the revised Methods section now reports mean prompt lengths and token counts per perturbation type (all within 5% of the unperturbed baseline) and includes a human solvability check confirming that perturbed chains remain solvable at comparable rates. We have also added a short paragraph explaining the construction rules used to isolate each perturbation type. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical robustness evaluation

full rationale

The paper conducts a direct empirical evaluation by defining five perturbation types, injecting them into CoT chains on mathematical tasks, and reporting observed accuracy changes across 13 models. There are no equations, derivations, fitted parameters, or self-citations that reduce the reported vulnerability patterns (e.g., MathError degradation or scaling benefits) to quantities defined by the authors' own choices. Results are presented as measured outcomes from external model testing, with no load-bearing steps that collapse by construction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical results from the perturbation experiments. It assumes the perturbation taxonomy captures meaningful error modes and that the math tasks isolate the effect of those perturbations.

axioms (2)
  • domain assumption The five perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) accurately represent potential real-world corruptions in LLM chain-of-thought reasoning.
    The taxonomy is introduced by the authors without external validation or citation to prior taxonomies in the abstract.
  • domain assumption Accuracy on the chosen mathematical reasoning tasks is a valid proxy for general CoT robustness.
    The abstract does not discuss whether task selection or domain specificity limits generalization.

pith-pipeline@v0.9.0 · 5565 in / 1390 out tokens · 123012 ms · 2026-05-16T03:38:26.537951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

    cs.AI 2026-03 unverdicted novelty 7.0

    WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.

  2. SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

    cs.CL 2026-04 unverdicted novelty 6.0

    SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey

    Mondorf, P. & Plank, B. Beyond accuracy: Evaluating the reasoning behavior of large language models: a survey.arXiv preprint arXiv:2404.01869(2024)

  2. [2]

    Golovneva, O.et al.ROSCOE: A suite of metrics for scoring step-by-step reasoning.arXiv preprint arXiv:2212.07919 (2022). 11/17

  3. [3]

    & Bansal, M

    Prasad, A., Saha, S., Zhou, X. & Bansal, M. RecEval: Evaluating reasoning chains via correctness and informativeness. arXiv preprint arXiv:2304.10703(2023)

  4. [4]

    neural information processing systems35, 24824–24837 (2022)

    Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models.Adv. neural information processing systems35, 24824–24837 (2022)

  5. [5]

    arXiv preprint arXiv:2407.08989 , year=

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y . & Iwasawa, Y . Large language models are zero-shot reasoners.Adv. neural information processing systems35, 22199–22213 (2022). 6.Singh, A., Singh, N. & Vatsal, S. Robustness of llms to perturbations in text.arXiv preprint arXiv:2407.08989(2024)

  6. [6]

    S., Hall, L., Mouton, P

    Alahmari, S. S., Hall, L., Mouton, P. R. & Goldgof, D. Large language models robustness against perturbation: S. alahmari et al.Sci. Reports(2025)

  7. [7]

    G., Riols, F

    Bogavelli, T., Bamgbose, O., Melançon, G. G., Riols, F. & Sharma, R. Evaluating robustness of large language models in enterprise applications: Benchmarks for perturbation consistency across formats and languages.arXiv preprint arXiv:2601.06341(2026)

  8. [8]

    & Kyriazis, D

    Fatouros, G., Metaxas, K., Soldatos, J. & Kyriazis, D. Can large language models beat wall street? evaluating gpt-4’s impact on financial decision-making with marketsenseai.Neural Comput. Appl.1–26 (2024). 10.Moor, M.et al.Foundation models for generalist medical artificial intelligence.Nature616, 259–265 (2023). 11.Loos, R. J. F. & Yeo, G. S. H. The gene...

  9. [9]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 10449–10459 (2024)

    Gan, E.et al.Reasoning robustness of llms to adversarial typographical errors. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 10449–10459 (2024)

  10. [10]

    Chain -of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation,

    Roh, J., Gandhi, V ., Anilkumar, S. & Garg, A. Break-the-chain: Reasoning failures in llms via adversarial prompting in code generation.arXiv preprint arXiv:2506.06971(2025)

  11. [11]

    Zhu, Z.et al.Advchain: Adversarial chain-of-thought tuning for robust safety alignment of large reasoning models.arXiv preprint arXiv:2509.24269(2025)

  12. [12]

    Zhu, K.et al.Promptbench: towards evaluating the robustness of large language models on adversarial prompts.arXiv preprint arXiv:2306.04528(2023)

  13. [13]

    Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023

    Wu, F., Liu, X. & Xiao, C. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions.arXiv preprint arXiv:2312.04730(2023)

  14. [14]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Wang, L.et al.Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091(2023)

  15. [15]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Chen, W., Ma, X., Wang, X. & Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588(2022)

  16. [16]

    & Bowman, S

    Turpin, M., Michael, J., Perez, E. & Bowman, S. R. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAnnual Meeting of the Association for Computational Linguistics (ACL) (2023)

  17. [17]

    Processbench: Identifying process errors in mathematical reasoning

    Zheng, C.et al.Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559 (2024)

  18. [18]

    He, Y .et al.Can large language models detect errors in long chain-of-thought reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 18468–18489 (2025)

  19. [19]

    InThe Twelfth International Conference on Learning Representations (ICLR)(2024)

    Xiang, Z.et al.Badchain: Backdoor chain-of-thought prompting for large language models. InThe Twelfth International Conference on Learning Representations (ICLR)(2024)

  20. [20]

    & Wang, W

    Yue, X., Zhang, Z., Jing, J. & Wang, W. Ctta: a novel chain-of-thought transfer adversarial attacks framework for large language models.Cybersecurity8, 36 (2025)

  21. [21]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Mirzadeh, I.et al.GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229(2024)

  22. [22]

    & Fang, Y

    Zhang, C., Xiao, Z., Han, C., Lian, Y . & Fang, Y . Learning to check: Unleashing potentials for self-correction in large language models.arXiv preprint arXiv:2402.13035(2024)

  23. [23]

    Guo, B., Gu, J., Zhou, J. P. & Sun, W. Learning to self-correct through chain-of-thought verification. InSecond Workshop on Test-Time Adaptation: Putting Updates to the Test! at ICML 2025(2025). 12/17

  24. [24]

    Zhang, D.et al.Ascot: An adaptive self-correction chain-of-thought method for late-stage fragility in llms.arXiv preprint arXiv:2508.05282(2025)

  25. [25]

    Chain-of-scrutiny: Detecting backdoor attacks for large language models,

    Li, X., Zhang, Y ., Lou, R., Wu, C. & Wang, J. Chain-of-scrutiny: Detecting backdoor attacks for large language models. arXiv preprint arXiv:2406.05948 (2024)

  26. [26]

    Safechain: Safety of language models with long chain-of-thought reasoning capabilities, 2025

    Jiang, F.et al.Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025 (2025)

  27. [27]

    & Dubey, H

    Vatsal, S. & Dubey, H. A survey of prompt engineering methods in large language models for different nlp tasks. arXiv preprint (2024)

  28. [28]

    arXiv preprint (2025)

    Sahoo, P.et al.A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint (2025)

  29. [29]

    M., Joswin, E

    Mammen, P. M., Joswin, E. & Venkitachalam, S. Trust me, i’m an expert: Decoding and steering authority bias in large language models.arXiv preprint arXiv:2601.13433(2026)

  30. [30]

    Wang, Q.et al.Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946 (2025)

  31. [31]

    Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024

    Chen, G. H., Chen, S., Liu, Z., Jiang, F. & Wang, B. Humans or LLMs as the judge? A study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)

  32. [32]

    arXiv preprint arXiv:2401.17882 (2024)

    Li, Y .et al.I think, therefore I am: Benchmarking awareness of large language models using AwareBench. arXiv preprint arXiv:2401.17882 (2024)

  33. [33]

    [Liet al., 2025 ] Zhaohui Li, Feiwen Xiao, Jiaju Lin, Xiao- han Zou, Qingxiao Zheng, and Jinjun Xiong

    Li, Y .et al.Quantifying AI psychology: A psychometrics benchmark for large language models.arXiv preprint arXiv:2406.17675(2024). 37.Cobbe, K.et al.Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168(2021)

  34. [34]

    Qodo raises $70M for code verification as AI coding scales | TechCrunch — techcrunch.com

    Park, K. Qodo raises $70M for code verification as AI coding scales | TechCrunch — techcrunch.com. https://techcrunch. com/2026/03/30/qodo-bets-on-code-verification-as-ai-coding-scales-raises-70m/. [Accessed 01-04-2026]. Author Contributions The experiments, data analysis and interpretation of the results were performed by A V A under the supervision of M...