Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
Pith reviewed 2026-05-16 03:38 UTC · model grok-4.3
The pith
Large language models show uneven drops in accuracy when their chain-of-thought steps contain different kinds of errors, with effects that depend on error type and model size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM robustness to CoT perturbations is heterogeneous across five error types. MathError perturbations produce the most severe degradation in small models with 50-60% accuracy loss but show strong scaling benefits. UnitConversion remains challenging across all scales with greater than 5% loss even for midsized models. ExtraSteps incur minimal accuracy degradation of 0-6% even for the smallest models. Sycophancy and SkippedSteps produce modest effects of about 10% loss for small models and improve slightly with scale. Model size serves as a protective factor against many perturbations but not always.
What carries the argument
A taxonomy of five CoT perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) injected into reasoning chains for mathematical tasks, measured by accuracy change across models of different sizes.
Load-bearing premise
The five perturbation types represent the main real-world errors that occur in LLM chain-of-thought reasoning and the chosen math tasks isolate those errors without other confounding influences.
What would settle it
Finding that all five perturbation types produce roughly equal accuracy losses with no consistent scaling benefit would contradict the reported heterogeneous vulnerability patterns.
Figures
read the original abstract
Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count, testing their ability to complete mathematical reasoning tasks despite perturbations injected in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (>5\% loss even for midsized models); ExtraSteps incur minimal accuracy degradation (0-6\%) even for the smallest of models; Sycophancy and SkippedSteps produce modest effects ($\sim$10\% loss for small models) and slightly improve with scale. Scaling relationships show that model size serve as a protective factor against many perturbations but not always. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available at https://github.com/Mystic-Slice/CoTPerturbation
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study evaluating the robustness of 13 LLMs (spanning three orders of magnitude in size) to five author-defined perturbations injected into chain-of-thought reasoning chains on mathematical tasks: MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps. It reports heterogeneous vulnerability patterns, including 50-60% accuracy loss from MathError in small models with scaling benefits, persistent >5% loss from UnitConversion across scales, minimal 0-6% loss from ExtraSteps, and modest ~10% effects from the others that improve slightly with scale, concluding that model size is a partial protective factor with implications for multi-stage reasoning pipelines.
Significance. If the central empirical patterns hold after addressing methodological gaps, the work provides a useful large-scale measurement of CoT fragility across model sizes and perturbation types, highlighting that scaling mitigates some but not all vulnerabilities and underscoring the need for task-specific robustness checks in deployed reasoning systems. The public code release strengthens reproducibility.
major comments (2)
- [Abstract and Results] The abstract and results sections report specific accuracy losses (e.g., 50-60% for MathError in small models) without providing base CoT accuracies per task or per model prior to perturbation, making it impossible to determine whether the deltas reflect perturbation-specific effects or interactions with inherent task difficulty.
- [Introduction and Methods] The taxonomy of five perturbation types is presented as structured and representative, but the manuscript provides no derivation from observed LLM error distributions or validation that the injection process isolates each type without confounds such as altered prompt length, token count, or solvability across types.
minor comments (1)
- [Abstract] The GitHub link is provided but the manuscript does not specify which exact tasks, prompts, or exclusion criteria were used, limiting independent verification.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and have revised the manuscript to incorporate additional baseline data and methodological details.
read point-by-point responses
-
Referee: [Abstract and Results] The abstract and results sections report specific accuracy losses (e.g., 50-60% for MathError in small models) without providing base CoT accuracies per task or per model prior to perturbation, making it impossible to determine whether the deltas reflect perturbation-specific effects or interactions with inherent task difficulty.
Authors: We agree that baseline CoT accuracies are necessary to interpret the reported deltas. The revised manuscript now includes Table 1, which reports unperturbed CoT accuracy for every model and task. This table makes clear that the 50-60% MathError losses occur on tasks where baseline accuracy is already moderate (60-75%), while the smaller losses for ExtraSteps occur on tasks with higher baselines, confirming the effects are perturbation-specific rather than solely driven by task difficulty. revision: yes
-
Referee: [Introduction and Methods] The taxonomy of five perturbation types is presented as structured and representative, but the manuscript provides no derivation from observed LLM error distributions or validation that the injection process isolates each type without confounds such as altered prompt length, token count, or solvability across types.
Authors: The taxonomy draws from error categories documented in prior CoT studies (e.g., arithmetic mistakes, unit inconsistencies, and step omissions). To address potential confounds, the revised Methods section now reports mean prompt lengths and token counts per perturbation type (all within 5% of the unperturbed baseline) and includes a human solvability check confirming that perturbed chains remain solvable at comparable rates. We have also added a short paragraph explaining the construction rules used to isolate each perturbation type. revision: yes
Circularity Check
No circularity in empirical robustness evaluation
full rationale
The paper conducts a direct empirical evaluation by defining five perturbation types, injecting them into CoT chains on mathematical tasks, and reporting observed accuracy changes across 13 models. There are no equations, derivations, fitted parameters, or self-citations that reduce the reported vulnerability patterns (e.g., MathError degradation or scaling benefits) to quantities defined by the authors' own choices. Results are presented as measured outcomes from external model testing, with no load-bearing steps that collapse by construction to inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The five perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) accurately represent potential real-world corruptions in LLM chain-of-thought reasoning.
- domain assumption Accuracy on the chosen mathematical reasoning tasks is a valid proxy for general CoT robustness.
Forward citations
Cited by 2 Pith papers
-
WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking
WMF-AM is a depth-parameterized benchmark that measures LLMs' cumulative state tracking ability without scratchpads, validated on 28 models across arithmetic and non-arithmetic tasks with ablations confirming the construct.
-
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
Reference graph
Works this paper leans on
-
[1]
Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey
Mondorf, P. & Plank, B. Beyond accuracy: Evaluating the reasoning behavior of large language models: a survey.arXiv preprint arXiv:2404.01869(2024)
- [2]
-
[3]
Prasad, A., Saha, S., Zhou, X. & Bansal, M. RecEval: Evaluating reasoning chains via correctness and informativeness. arXiv preprint arXiv:2304.10703(2023)
-
[4]
neural information processing systems35, 24824–24837 (2022)
Wei, J.et al.Chain-of-thought prompting elicits reasoning in large language models.Adv. neural information processing systems35, 24824–24837 (2022)
work page 2022
-
[5]
arXiv preprint arXiv:2407.08989 , year=
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y . & Iwasawa, Y . Large language models are zero-shot reasoners.Adv. neural information processing systems35, 22199–22213 (2022). 6.Singh, A., Singh, N. & Vatsal, S. Robustness of llms to perturbations in text.arXiv preprint arXiv:2407.08989(2024)
-
[6]
Alahmari, S. S., Hall, L., Mouton, P. R. & Goldgof, D. Large language models robustness against perturbation: S. alahmari et al.Sci. Reports(2025)
work page 2025
-
[7]
Bogavelli, T., Bamgbose, O., Melançon, G. G., Riols, F. & Sharma, R. Evaluating robustness of large language models in enterprise applications: Benchmarks for perturbation consistency across formats and languages.arXiv preprint arXiv:2601.06341(2026)
-
[8]
Fatouros, G., Metaxas, K., Soldatos, J. & Kyriazis, D. Can large language models beat wall street? evaluating gpt-4’s impact on financial decision-making with marketsenseai.Neural Comput. Appl.1–26 (2024). 10.Moor, M.et al.Foundation models for generalist medical artificial intelligence.Nature616, 259–265 (2023). 11.Loos, R. J. F. & Yeo, G. S. H. The gene...
work page 2024
-
[9]
Gan, E.et al.Reasoning robustness of llms to adversarial typographical errors. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 10449–10459 (2024)
work page 2024
-
[10]
Chain -of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation,
Roh, J., Gandhi, V ., Anilkumar, S. & Garg, A. Break-the-chain: Reasoning failures in llms via adversarial prompting in code generation.arXiv preprint arXiv:2506.06971(2025)
- [11]
- [12]
-
[13]
Wu, F., Liu, X. & Xiao, C. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions.arXiv preprint arXiv:2312.04730(2023)
-
[14]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Wang, L.et al.Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091(2023)
work page internal anchor Pith review arXiv 2023
-
[15]
Chen, W., Ma, X., Wang, X. & Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Turpin, M., Michael, J., Perez, E. & Bowman, S. R. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAnnual Meeting of the Association for Computational Linguistics (ACL) (2023)
work page 2023
-
[17]
Processbench: Identifying process errors in mathematical reasoning
Zheng, C.et al.Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559 (2024)
-
[18]
He, Y .et al.Can large language models detect errors in long chain-of-thought reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 18468–18489 (2025)
work page 2025
-
[19]
InThe Twelfth International Conference on Learning Representations (ICLR)(2024)
Xiang, Z.et al.Badchain: Backdoor chain-of-thought prompting for large language models. InThe Twelfth International Conference on Learning Representations (ICLR)(2024)
work page 2024
- [20]
-
[21]
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Mirzadeh, I.et al.GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
-
[23]
Guo, B., Gu, J., Zhou, J. P. & Sun, W. Learning to self-correct through chain-of-thought verification. InSecond Workshop on Test-Time Adaptation: Putting Updates to the Test! at ICML 2025(2025). 12/17
work page 2025
- [24]
-
[25]
Chain-of-scrutiny: Detecting backdoor attacks for large language models,
Li, X., Zhang, Y ., Lou, R., Wu, C. & Wang, J. Chain-of-scrutiny: Detecting backdoor attacks for large language models. arXiv preprint arXiv:2406.05948 (2024)
-
[26]
Safechain: Safety of language models with long chain-of-thought reasoning capabilities, 2025
Jiang, F.et al.Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025 (2025)
-
[27]
Vatsal, S. & Dubey, H. A survey of prompt engineering methods in large language models for different nlp tasks. arXiv preprint (2024)
work page 2024
-
[28]
Sahoo, P.et al.A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint (2025)
work page 2025
-
[29]
Mammen, P. M., Joswin, E. & Venkitachalam, S. Trust me, i’m an expert: Decoding and steering authority bias in large language models.arXiv preprint arXiv:2601.13433(2026)
work page internal anchor Pith review arXiv 2026
- [30]
-
[31]
Humans or llms as the judge? a study on judgement biases.arXiv preprint arXiv:2402.10669, 2024
Chen, G. H., Chen, S., Liu, Z., Jiang, F. & Wang, B. Humans or LLMs as the judge? A study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)
-
[32]
arXiv preprint arXiv:2401.17882 (2024)
Li, Y .et al.I think, therefore I am: Benchmarking awareness of large language models using AwareBench. arXiv preprint arXiv:2401.17882 (2024)
-
[33]
Li, Y .et al.Quantifying AI psychology: A psychometrics benchmark for large language models.arXiv preprint arXiv:2406.17675(2024). 37.Cobbe, K.et al.Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168(2021)
-
[34]
Qodo raises $70M for code verification as AI coding scales | TechCrunch — techcrunch.com
Park, K. Qodo raises $70M for code verification as AI coding scales | TechCrunch — techcrunch.com. https://techcrunch. com/2026/03/30/qodo-bets-on-code-verification-as-ai-coding-scales-raises-70m/. [Accessed 01-04-2026]. Author Contributions The experiments, data analysis and interpretation of the results were performed by A V A under the supervision of M...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.