pith. machine review for the scientific record. sign in

arxiv: 2604.16320 · v1 · submitted 2026-02-24 · 💻 cs.SE · cs.AI· cs.LG

Recognition: no theorem link

How Robustly do LLMs Understand Execution Semantics?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords LLM robustnesscode execution semanticsprogram output predictionCRUXEval benchmarkinput perturbationsexception handlingGPT-5.2DeepSeek-R1
0
0 comments X

The pith

Frontier LLMs like GPT-5.2 suffer 20-24 percent accuracy drops on code execution prediction when code or inputs are perturbed, while open-source models stay more stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs truly grasp how code runs or simply match patterns by measuring performance on a program-output prediction task before and after code transformations and input changes. GPT-5.2 reaches 99 percent accuracy on the original CRUXEval cases yet falls sharply on the perturbed versions. Open-source models such as the DeepSeek-R1 family hold steadier performance levels, though lower overall, across the same changes. Prediction quality declines further for cases that raise exceptions, and the drop varies with the exception type. The work also checks whether targeted fixes for exception cases affect accuracy on normal behaviors.

Core claim

LLMs exhibit limited robustness in understanding execution semantics: frontier models achieve near-perfect scores on unperturbed benchmarks but degrade markedly under code transformations and input perturbations, while open-source reasoning models maintain more consistent though lower accuracies, with particular weakness in predicting exception-raising behaviors that depends on exception kind.

What carries the argument

Code transformations together with input perturbations applied to the CRUXEval program-output prediction task, used to expose brittleness in execution-semantics understanding.

If this is right

  • High scores on standard benchmarks do not guarantee reliable code-execution understanding under realistic input variation.
  • Prediction accuracy is lower for inputs that raise exceptions and varies by exception type across models.
  • Remedies aimed at exception prediction can be tested for side effects on non-exception prediction accuracy.
  • All evaluated models show limitations in code understanding that perturbation testing can surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation protocols for code models should routinely include perturbation suites to avoid overestimating semantic competence.
  • The stability difference between model families suggests that training regimes emphasizing reasoning traces may confer greater robustness to surface changes.
  • Similar perturbation methods could be applied to other structured prediction domains to test whether high benchmark scores reflect genuine internal models.

Load-bearing premise

The chosen code transformations and input perturbations specifically measure understanding of execution semantics rather than unrelated surface difficulties or distribution shifts.

What would settle it

A controlled replication in which GPT-5.2 maintains its 99 percent accuracy on the same perturbed CRUXEval inputs would falsify the reported brittleness.

Figures

Figures reproduced from arXiv: 2604.16320 by Claudio Spiess, Earl T. Barr, Prem Devanbu.

Figure 1
Figure 1. Figure 1: Number of decisions encountered on execution [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The prompts we used: the “Direct Output Prompt” [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits significant brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and establish the value of using perturbation to evaluate code models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study evaluating the robustness of LLMs' code execution semantics understanding via the CRUXEval benchmark. It reports near-perfect (99%) accuracy for GPT-5.2 on unperturbed inputs but 20-24% drops under code transformations and input perturbations, with open-source models (DeepSeek-R1 family) showing more stable though lower accuracies (38-67%). Additional findings include poorer performance on exception-raising cases that varies by exception type, along with evaluation of remedies for exception prediction and their impact on non-exception cases.

Significance. If the perturbations are confirmed to preserve execution semantics, the results would provide concrete evidence that frontier LLMs rely on surface patterns rather than robust internal execution models, while highlighting the diagnostic value of perturbation testing for code models. The work supplies direct accuracy measurements across model families and exception categories, which could inform more reliable evaluation practices in LLM-based code understanding.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (results on perturbed inputs): the reported 20-24% accuracy decline for GPT-5.2 and the exception-type dependence lack sample sizes, statistical tests, exact perturbation definitions, and error bars, leaving the central brittleness claim under-supported.
  2. [Methods] Methods section on perturbations: no verification step is described confirming that the chosen code transformations and input perturbations preserve original execution behavior (output or exception type). Without this, the accuracy drops could arise from distribution shift or novel surface forms rather than failures in execution-semantics understanding.
minor comments (2)
  1. [Results] Results tables and figures would benefit from explicit confidence intervals or p-values for all accuracy comparisons to improve interpretability.
  2. [Methods] Clarify the exact definitions and examples of each perturbation type in a dedicated subsection or table to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each of the major comments below and have revised the paper accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results on perturbed inputs): the reported 20-24% accuracy decline for GPT-5.2 and the exception-type dependence lack sample sizes, statistical tests, exact perturbation definitions, and error bars, leaving the central brittleness claim under-supported.

    Authors: We agree that the results would be more robustly supported with explicit sample sizes, statistical tests, and error bars. In the revised manuscript we now report the exact sample size used for the perturbation experiments (N=500), include bootstrap-derived 95% confidence intervals as error bars, and apply McNemar's test to establish that the observed accuracy drops for GPT-5.2 are statistically significant (p < 0.001). We have also expanded the perturbation definitions in Section 3 with precise, reproducible descriptions. revision: yes

  2. Referee: [Methods] Methods section on perturbations: no verification step is described confirming that the chosen code transformations and input perturbations preserve original execution behavior (output or exception type). Without this, the accuracy drops could arise from distribution shift or novel surface forms rather than failures in execution-semantics understanding.

    Authors: This is a fair and important point. While the original CRUXEval perturbations were intended to preserve semantics, we did not document an explicit verification step in the initial submission. In the revised Methods section we now describe and report a verification procedure: every perturbed program was re-executed in an independent Python interpreter, confirming that the output value or exception type matched the original execution in 100% of cases. The verification script is included in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper conducts an empirical evaluation by applying code transformations and input perturbations to the existing CRUXEval benchmark and directly measuring LLM accuracy on original versus perturbed inputs. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. All reported results consist of accuracy counts on held-out test cases. No load-bearing steps reduce to inputs by construction, and no self-citations or ansatzes are invoked to justify any derivation chain. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that CRUXEval tasks and the chosen perturbations measure execution semantics rather than surface statistics. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Perturbations to code and inputs preserve the underlying execution semantics while changing surface form.
    Implicit in the claim that accuracy drops reveal lack of robust understanding.

pith-pipeline@v0.9.0 · 5493 in / 1191 out tokens · 23707 ms · 2026-05-15T19:43:36.927156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 conditional novelty 6.0

    LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.

  2. (How) Do Large Language Models Understand High-Level Message Sequence Charts?

    cs.SE 2026-05 unverdicted novelty 5.0

    LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Wasi Uddin Ahmad et al. 2025. OpenCodeReasoning: Advancing Data Dis- tillation for Competitive Coding. arXiv, (Aug. 2025). arXiv: 2504.01943[cs]. doi:10.48550/arXiv.2504.01943

  2. [2]

    Miltiadis Allamanis et al. 2025. Disproving Program Equivalence with LLMs. arXiv, (Feb. 2025). arXiv: 2502.18473[cs]. doi:10.48550/arXiv.2502.18473

  3. [3]

    Ali Asgari et al. 2025. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol., (Sept. 2025). doi:10.1145 /3766552

  4. [5]

    David Bieber et al. 2022. Static prediction of runtime errors by learning to execute programs with external resource descriptions. (2022). https://arxiv.org /abs/2203.03771 arXiv: 2203.03771[cs.LG]

  5. [6]

    Aymen Bouguerra et al. 2026. Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy. arXiv, (Feb. 2026). arXiv: 2509.21173[cs]. doi:10.48550/arXiv.2509.21173

  6. [7]

    Islem Bouzenia et al. 2024. DyPyBench: a benchmark of executable python software.Proc. ACM Softw. Eng., 1, Article 16, (July 2024), FSE, (July 2024). doi:10.1145/3643742

  7. [8]

    Ira Ceka et al. 2025. How Does LLM Reasoning Work for Code? A Survey and a Call to Action. arXiv, (June 2025). arXiv: 2506.13932[cs]. doi:10.48550/arXiv .2506.13932

  8. [9]

    natural- izing

    Saikat Chakraborty et al. 2022. NatGen: Generative pre-training by “natural- izing” source code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineer- ing(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, (Nov. 2022), 18–30.isbn: 978-1-4503-9413-0. doi:10.1...

  9. [10]

    Junkai Chen et al. 2025. Reasoning Runtime Behavior of a Program with LLM: How Far Are We? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. IEEE Press, (Sept. 2025), 1869–1881.isbn: 979-8-3315- 0569-1

  10. [11]

    DeepSeek-AI et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.Nature, 645, 8081, (Sept. 2025), 633–638. doi:10.1038/s41586-025-09422-z

  11. [12]

    Benedetta Donato et al. 2025. Studying How Configurations Impact Code Generation in LLMs: The Case of ChatGPT. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, (Apr. 2025), 442–453.isbn: 979-8-3315-0223-2. doi:10.1109/ICPC66645.2025.00055

  12. [13]

    Zeming Dong et al. 2023. MixCode: Enhancing Code Classification by Mixup- Based Data Augmentation. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). (Mar. 2023), 379–390. doi:10.11 09/SANER56733.2023.00043

  13. [14]

    Aaron Grattafiori et al. 2024. The Llama 3 Herd of Models. arXiv, (Nov. 2024). arXiv: 2407.21783[cs]. doi:10.48550/arXiv.2407.21783

  14. [15]

    Alex Gu et al. 2024. CRUXEval: A Benchmark for Code Reasoning, Under- standing and Execution. InProceedings of the 41st International Conference on Machine Learning. PMLR, (July 2024), 16568–16621

  15. [16]

    Chenchen Gu et al. 2025. Auditing Prompt Caching in Language Model APIs. InProceedings of the 42nd International Conference on Machine Learning. PMLR, (Oct. 2025), 20477–20496

  16. [17]

    Shuo Han et al. 2025. Prompting Instability: An Empirical Study of LLM Ro- bustness in Code Vulnerability Detection. InAI 2025: Advances in Artificial Intelligence. Miaomiao Liu et al., editors. Springer Nature, Singapore, 233–245. isbn: 978-981-9549-69-6. doi:10.1007/978-981-95-4969-6_18

  17. [18]

    Dan Hendrycks et al. 2020. Measuring Massive Multitask Language Under- standing. InInternational Conference on Learning Representations. (Oct. 2020). isbn: 979-8-3313-2194-9

  18. [19]

    Jordan Henkel et al. 2022. Semantic Robustness of Models of Source Code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER). (Mar. 2022), 526–537. doi:10.1109/SANER53432.2022.00070

  19. [20]

    Ashish Hooda et al. 2024. Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates. InProceedings of the 41st International Conference on Machine Learning. PMLR, (July 2024), 18738– 18748

  20. [21]

    Chao Hu et al. 2024. How Effectively Do Code Language Models Understand Poor-Readability Code? InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(ASE ’24). Association for Com- puting Machinery, New York, NY, USA, (Oct. 2024), 795–806.isbn: 979-8-4007- 1248-7. doi:10.1145/3691620.3695072

  21. [22]

    Naman Jain et al. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.International Conference on Representation Learning, 2025, (May 2025), 58791–58831

  22. [23]

    Saqib Javed et al. 2025. QT-DoG: Quantization-Aware Training for Domain Generalization. InProceedings of the 42nd International Conference on Machine Learning. PMLR, (Oct. 2025), 26981–27004

  23. [24]

    Juyong Jiang et al. 2025. A Survey on Large Language Models for Code Gener- ation.ACM Trans. Softw. Eng. Methodol., (July 2025). doi:10.1145/3747588

  24. [25]

    Chris F Kemerer. 1995. Software complexity and software maintenance: A survey of empirical research.Annals of Software Engineering, 1, 1, 1–22

  25. [26]

    Man Ho Lam et al. 2025. CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. (Oct. 2025)

  26. [27]

    Changshu Liu et al. 2025. A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models. InProceedings of the 33rd ACM Inter- national Conference on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, (July 2025), 1178–1182.isbn: 979-8-4007-1276-0

  27. [28]

    Jiawei Liu et al. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InPro- ceedings of the 37th International Conference on Neural Information Processing Systems(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, (Dec. 2023), 21558–21572

  28. [29]

    Norman Mu et al. 2024. A Closer Look at System Message Robustness. In Neurips Safe Generative AI Workshop 2024. (Oct. 2024)

  29. [30]

    Dung Nguyen et al. 2025. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs.International Conference on Representation Learning, 2025, (May 2025), 2614–2672

  30. [31]

    Smit Patel et al. 2025. Planning a large language model for static detection of runtime errors in code snippets. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 639–639

  31. [32]

    Julian Aron Prenner et al. 2025. ThrowBench: Benchmarking llms by predicting runtime exceptions. (2025). https://arxiv.org/abs/2503.04241 arXiv: 2503.04241 [cs.SE]

  32. [33]

    Qwen et al. 2025. Qwen2.5 Technical Report. arXiv, (Jan. 2025). arXiv: 2412.151 15[cs]. doi:10.48550/arXiv.2412.15115

  33. [34]

    Monoshi Kumar Roy et al. 2025. CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning. arXiv, (Oct. 2025). arXiv: 2506.00750 [cs]. doi:10.48550/arXiv.2506.00750

  34. [35]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma et al. Towards understanding sycophancy in language models. (2023). arXiv: 2310.13548

  35. [36]

    Peiyang Song et al. 2026. Large language model reasoning failures.Transactions on Machine Learning Research

  36. [37]

    Claudio Spiess et al. 2024. Calibration and Correctness of Language Models for Code. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, (Oct. 2024), 495–507.isbn: 979-8-3315-0569-1. doi:10.1109/ICSE55347.2025.00040

  37. [38]

    Weisong Sun et al. 2025. Source Code Summarization in the Era of Large Language Models. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. IEEE Press, (Sept. 2025), 1882–1894.isbn: 979-8-3315- 0569-1

  38. [39]

    Alex Wang et al. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. InAdvances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc

  39. [40]

    Shiqi Wang et al. 2023. ReCode: Robustness Evaluation of Code Generation Models. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Anna Rogers et al., editors. Associa- tion for Computational Linguistics, Toronto, Canada, (July 2023), 13818–13843. doi:10.18653/v1/2023.acl-long.773

  40. [41]

    Shufan Wang et al. 2025. Open the Oyster: Empirical Evaluation and Improve- ment of Code Reasoning Confidence in LLMs. arXiv, (Nov. 2025). arXiv: 2511.0 2197[cs]. doi:10.48550/arXiv.2511.02197

  41. [42]

    Anjiang Wei et al. 2025. EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Christos Christodoulopoulos et al., editors. Association for Computational Linguistics, Suzhou, China, (Nov. 2025), 33856–33869.isbn: 9...

  42. [43]

    Moshi Wei et al. 2023. CoCoFuzzing: Testing Neural Code Models With Coverage- Guided Fuzzing.IEEE Transactions on Reliability, 72, 3, (Sept. 2023), 1276–1289. doi:10.1109/TR.2022.3208239

  43. [44]

    Xin Xia et al. 2017. Measuring program comprehension: A large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976

  44. [45]

    Danning Xie et al. 2025. CoRe: Benchmarking LLMs Code Reasoning Capabili- ties through Static Analysis Tasks. arXiv, (Nov. 2025). arXiv: 2507.05269[cs]. doi:10.48550/arXiv.2507.05269

  45. [46]

    Ruiyang Xu et al. 2025. CRUXEVAL-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Wanxiang Che et al., editors. Association for Computational Linguistics, Vienna, Austria, (July 2025), 23762–23779.isbn: 979-8-89176...

  46. [47]

    Kaiwen Yan et al. 2025. STEPWISE-CODEX-Bench: Evaluating Complex Multi- Function Comprehension and Fine-Grained Execution Reasoning. arXiv, (Aug. 2025). arXiv: 2508.05193[cs]. doi:10.48550/arXiv.2508.05193

  47. [48]

    An Yang et al. 2024. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv, (Sept. 2024). arXiv: 2409.12122 [cs]. doi:10.48550/arXiv.2409.12122

  48. [49]

    Boyang Yang et al. 2025. A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications. arXiv, (Dec. 2025). arXiv: 2506.23749[cs]. doi:10.48550/arXiv.2506.23749

  49. [50]

    Guang Yang et al. 2025. Assessing and improving syntactic adversarial ro- bustness of pre-trained models for code translation.Information and Software Technology, 181, (May 2025), 107699. doi:10.1016/j.infsof.2025.107699