Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?
Pith reviewed 2026-05-22 14:45 UTC · model grok-4.3
The pith
Large language models for code often reach correct answers through flawed reasoning and shift predictions when code is rewritten in ways that preserve exact meaning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art LLMs produce correct predictions on program output tasks based on flawed reasoning in between 10% and 50% of cases according to human expert analysis. When the same programs undergo semantics-preserving mutations, the models frequently alter their predictions, producing performance drops reaching up to 70%. This shows that current LLMs do not yet exhibit stable, semantically grounded reasoning about code even when their initial accuracy is high.
What carries the argument
Five semantics-preserving mutations (variable renaming, mirroring comparisons, if-else branch swaps, for-to-while conversion, and loop unrolling) applied to Python programs to test whether predictions remain consistent under syntax changes that leave semantics identical.
If this is right
- High accuracy on code benchmarks does not guarantee that the model reasons correctly about program semantics.
- LLMs can produce inconsistent results on functionally equivalent code that differs only in syntax.
- Real-world code tasks may lead to unexpected failures when developers refactor programs without altering behavior.
- Robustness checks against semantics-preserving transformations should become part of standard LLM evaluation for software engineering.
Where Pith is reading between the lines
- Developers could cross-check LLM outputs by submitting multiple syntax variants of the same program to detect instability.
- Future training objectives that explicitly penalize changes in prediction under semantics-preserving edits may produce more grounded models.
- The same mutation approach could be used to test whether LLMs in other domains rely on superficial cues rather than underlying meaning.
Load-bearing premise
The five chosen mutations are assumed to be purely semantics-preserving and sufficient to reveal any lack of genuine semantic reasoning rather than testing only surface sensitivity.
What would settle it
A large set of programs where the LLM gives the same correct answer and the same sound reasoning explanation under every one of the five mutations with no drop in accuracy.
Figures
read the original abstract
With the widespread adoption of vibe coding, understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies assess LLMs' ability to predict program outputs, most focus on accuracy alone, without evaluating the underlying reasoning. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this paper we assess whether state-of-the-art LLMs can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated nine LLMs, including both open-source and closed-access models, and performed a human expert analysis using LiveCodeBench to assess whether correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. While proprietary models achieve the strongest predictive accuracy and reasoning quality in the expert evaluation, our robustness analysis reveals substantial fragility under semantics-preserving transformations. Our findings show that LLMs trained for code produce correct predictions based on flawed reasoning in between 10% and 50% of cases. Furthermore, LLMs often change predictions in response to our code mutations, with performance drops reaching up to 70%, indicating that they do not yet exhibit stable, semantically grounded reasoning, even when initial accuracy is high.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates nine LLMs on Python code output prediction tasks using LiveCodeBench and CruxEval. It applies five syntax-altering but purportedly semantics-preserving mutations (variable renaming, comparison mirroring, if-else swapping, for-to-while conversion, loop unrolling), measures accuracy drops and prediction instability, and supplements with human expert review of reasoning quality on a subset of LiveCodeBench cases. The central claims are that correct predictions rest on flawed reasoning in 10–50% of examined cases and that performance can drop by as much as 70% under the mutations, indicating that current LLMs lack stable, semantically grounded code understanding.
Significance. If the mutations are verifiably semantics-preserving and the human judgments are reliable, the work supplies concrete empirical evidence that high initial accuracy on code tasks does not imply robust semantic reasoning. The multi-model, multi-benchmark design together with the human analysis component strengthens the contribution relative to accuracy-only studies.
major comments (1)
- [§4] §4 (Mutation Design and Robustness Experiments): The paper states that the five transformations are semantics-preserving, yet no execution-based equivalence verification is reported (i.e., running original and mutated programs on the same LiveCodeBench/CruxEval inputs and confirming identical outputs). Because the central claim equates prediction changes with lack of semantic reasoning, the absence of such checks leaves open the possibility that observed drops partly reflect unintended semantic alterations rather than model fragility.
minor comments (2)
- [Results] Results tables and figures do not report exact instance counts, confidence intervals, or error bars for the 10–50% flawed-reasoning range and the 70% drop figures, reducing interpretability of the quantitative claims.
- [Human Evaluation] The human-review protocol (number of experts, inter-annotator agreement, exact sampling procedure) is described only at a high level; additional detail would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the suggested verification to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§4] §4 (Mutation Design and Robustness Experiments): The paper states that the five transformations are semantics-preserving, yet no execution-based equivalence verification is reported (i.e., running original and mutated programs on the same LiveCodeBench/CruxEval inputs and confirming identical outputs). Because the central claim equates prediction changes with lack of semantic reasoning, the absence of such checks leaves open the possibility that observed drops partly reflect unintended semantic alterations rather than model fragility.
Authors: We agree that explicit execution-based verification would eliminate any residual doubt about whether the observed prediction changes could arise from unintended semantic shifts. Although the five transformations (variable renaming, comparison mirroring, if-else swapping, for-to-while conversion, and loop unrolling) were chosen as standard, well-documented semantics-preserving operations in Python, the original submission did not report running the original and mutated programs on the LiveCodeBench and CruxEval inputs to confirm identical outputs. In the revised manuscript we will add this verification step to Section 4, documenting that outputs match for all evaluated cases (with any edge-case discrepancies, such as floating-point tolerance, explicitly noted and handled). This addition directly addresses the concern and reinforces that instability under mutation reflects fragility in semantic reasoning rather than mutation artifacts. revision: yes
Circularity Check
No circularity: direct empirical measurements on benchmarks
full rationale
The paper reports an empirical evaluation of LLMs on output prediction tasks using five explicitly listed mutations applied to programs from LiveCodeBench and CruxEval, followed by accuracy comparisons and human expert review of reasoning quality. No equations, fitted parameters, or first-principles derivations appear; the central claims rest on observed accuracy drops and flawed-reasoning counts measured against external benchmarks and human judgment rather than reducing to self-defined inputs or self-citation chains. The assertion that the mutations preserve semantics is presented as a premise supported by the listed transformations, not derived circularly from the experimental outcomes themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five mutations maintain identical program semantics while changing syntax.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLMs often change predictions in response to our code mutations, with performance drops reaching up to 70%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Self-Supervised Bug Detection and Repair. In NeurIPS. Allen, F. E. 1970. Control flow analysis. In Northcote, R. S., ed., Proceedings of a Symposium on Compiler Optimization, Urbana-Champaign, Illinois, USA, July 27-28, 1970 , 1–19. ACM. Barone, A. V . M.; Barez, F.; Cohen, S. B.; and Konstas, I
work page 1970
-
[2]
Codegemma: Open code models based on gemma,
The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python. In ACL 2023, 272–292. Codeforces. 2025. https://codeforces.com. Accessed: 2025- 07-15. CodeGemma. 2024. CodeGemma: Open Code Models Based on Gemma. CoRR, abs/2406.11409. CodeLlama. 2023. Code Llama: Open Foundation Models for Code. CoRR, abs/2308.12950....
-
[3]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
IEEE. HuggingFace. 2025. . https://huggingface.co. [Online; ac- cessed 15-July-2025]. Jain, N.; Han, K.; Gu, A.; Li, W.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; and Stoica, I. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. CoRR, abs/2403.07974. LeetCode. 2025. https://leetcode.com. Acc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
It initializes `i` to 1 and creates a set `arr` containing the integer 1
-
[5]
It enters a while loop that continues until `arr` contains `n` elements
-
[6]
If it's not, it adds `i` to `arr`
Inside the loop, it increments `i` by 1 and checks if `target - i` is not in `arr`. If it's not, it adds `i` to `arr`
-
[7]
Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step
Finally, it returns the sum of the elements in `arr`. Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step
-
[8]
The function is called with `n = 1` and `target = 1`
-
[9]
[Line 3]: `i` is initialized to 1 and `arr` is initialized to `{1}`
-
[10]
[Line 4]: The while loop checks the condition `len(arr) < n` (1 < 1), which is `False`, so the loop does not execute
-
[11]
So, the function returns `1` when called with `n = 1` and `target = 1`
[Line 7]: The function returns the sum of the elements in `arr`, which is `sum({1}) = 1`. So, the function returns `1` when called with `n = 1` and `target = 1`. Therefore, the correct assertion would be: [/Code] [ANSWER] assert minimumPossibleSum(n = 1, target = 1) == 1 [/ANSWER] A.2. Correct prediction based on sound reasoning after more than one attemp...
work page 2024
-
[12]
predicts the correct output of False for the previ- ous programming contest #2828, isAcronym. However, CODE GEMMA ’s reasoning is flawed. Hence, this interaction was considered a correct guess based on flawed reasoning. • Our prompt: Simulate the Execution: You are given a Python function and an assertion containing a function input. Complete the assertio...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.