Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Marta Kwiatkowska; Pedro Orvalho

arxiv: 2505.10443 · v3 · submitted 2025-05-15 · 💻 cs.SE · cs.AI

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Pedro Orvalho , Marta Kwiatkowska This is my paper

Pith reviewed 2026-05-22 14:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelscode understandingsemantics-preserving mutationsrobustnessflawed reasoningprogram output predictionLiveCodeBenchCruxEval

0 comments

The pith

Large language models for code often reach correct answers through flawed reasoning and shift predictions when code is rewritten in ways that preserve exact meaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether LLMs truly grasp what programs mean or simply guess from surface patterns. It introduces five syntax changes that leave program behavior unchanged, including variable renaming, comparison mirroring, if-else swaps, for-to-while conversion, and loop unrolling. On LiveCodeBench and CruxEval, expert review finds that correct predictions rest on unsound reasoning in 10 to 50 percent of cases. When the mutations are applied, many models flip their outputs and suffer accuracy losses as high as 70 percent. The results indicate that even high-performing models lack stable, semantics-based understanding of code.

Core claim

State-of-the-art LLMs produce correct predictions on program output tasks based on flawed reasoning in between 10% and 50% of cases according to human expert analysis. When the same programs undergo semantics-preserving mutations, the models frequently alter their predictions, producing performance drops reaching up to 70%. This shows that current LLMs do not yet exhibit stable, semantically grounded reasoning about code even when their initial accuracy is high.

What carries the argument

Five semantics-preserving mutations (variable renaming, mirroring comparisons, if-else branch swaps, for-to-while conversion, and loop unrolling) applied to Python programs to test whether predictions remain consistent under syntax changes that leave semantics identical.

If this is right

High accuracy on code benchmarks does not guarantee that the model reasons correctly about program semantics.
LLMs can produce inconsistent results on functionally equivalent code that differs only in syntax.
Real-world code tasks may lead to unexpected failures when developers refactor programs without altering behavior.
Robustness checks against semantics-preserving transformations should become part of standard LLM evaluation for software engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could cross-check LLM outputs by submitting multiple syntax variants of the same program to detect instability.
Future training objectives that explicitly penalize changes in prediction under semantics-preserving edits may produce more grounded models.
The same mutation approach could be used to test whether LLMs in other domains rely on superficial cues rather than underlying meaning.

Load-bearing premise

The five chosen mutations are assumed to be purely semantics-preserving and sufficient to reveal any lack of genuine semantic reasoning rather than testing only surface sensitivity.

What would settle it

A large set of programs where the LLM gives the same correct answer and the same sound reasoning explanation under every one of the five mutations with no drop in accuracy.

Figures

Figures reproduced from arXiv: 2505.10443 by Marta Kwiatkowska, Pedro Orvalho.

**Figure 1.** Figure 1: LLM-Based Program Output Prediction. correct output, the time limit is reached, or the number of iterations exceeds five, typically indicating that the model is stuck and repeatedly returning the same incorrect answer. Prompts. The prompts used to query the LLMS follow a similar format to those adopted in prior works (Ding et al. 2024; Gu et al. 2024b). Each prompt asks the model to complete a Python asse… view at source ↗

read the original abstract

With the widespread adoption of vibe coding, understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies assess LLMs' ability to predict program outputs, most focus on accuracy alone, without evaluating the underlying reasoning. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this paper we assess whether state-of-the-art LLMs can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated nine LLMs, including both open-source and closed-access models, and performed a human expert analysis using LiveCodeBench to assess whether correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. While proprietary models achieve the strongest predictive accuracy and reasoning quality in the expert evaluation, our robustness analysis reveals substantial fragility under semantics-preserving transformations. Our findings show that LLMs trained for code produce correct predictions based on flawed reasoning in between 10% and 50% of cases. Furthermore, LLMs often change predictions in response to our code mutations, with performance drops reaching up to 70%, indicating that they do not yet exhibit stable, semantically grounded reasoning, even when initial accuracy is high.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs often reach correct code predictions via flawed reasoning and drop up to 70% under simple syntax changes that preserve meaning, though the mutations need explicit equivalence checks to fully support the fragility claim.

read the letter

The key point from this paper is that state-of-the-art LLMs for code understanding often arrive at correct predictions through flawed reasoning, in 10 to 50 percent of cases according to human experts. On top of that, simple semantics-preserving mutations cause prediction changes and accuracy drops as high as 70 percent, suggesting the models lack stable semantic grounding. What the work does well is extend beyond pure accuracy metrics by including human analysis of reasoning quality. They test nine models on LiveCodeBench with expert review to see if correct answers come from sound logic. The five mutations—variable renaming, mirroring comparisons, if-else branch swaps, for-to-while conversion, and loop unrolling—are applied to check robustness on both LiveCodeBench and CruxEval. This targeted probe adds to prior work that mostly looked at final outputs without digging into the reasoning process. The results highlight practical issues for using these models in real software workflows where code gets refactored often. Proprietary models come out ahead in both accuracy and reasoning soundness, but the fragility shows up across the board. One area that could be stronger is confirmation that the mutations truly preserve semantics in all tested cases. The abstract claims they do, and without explicit execution equivalence tests on the benchmark inputs, it's possible some mutations introduce unintended changes, especially with loop transformations in Python. If the full paper includes such checks or formal arguments, that would address the concern directly. Otherwise, it leaves some ambiguity in interpreting the drops as pure model issues. The human expert analysis is a plus, but more transparency on the annotation process would help. Overall, this is relevant for anyone building or relying on LLMs in programming tools. It raises good questions about what high accuracy really means for code tasks. I would recommend sending it for peer review. The empirical setup is clear enough to warrant referee feedback on the details and implications.

Referee Report

1 major / 2 minor

Summary. The paper evaluates nine LLMs on Python code output prediction tasks using LiveCodeBench and CruxEval. It applies five syntax-altering but purportedly semantics-preserving mutations (variable renaming, comparison mirroring, if-else swapping, for-to-while conversion, loop unrolling), measures accuracy drops and prediction instability, and supplements with human expert review of reasoning quality on a subset of LiveCodeBench cases. The central claims are that correct predictions rest on flawed reasoning in 10–50% of examined cases and that performance can drop by as much as 70% under the mutations, indicating that current LLMs lack stable, semantically grounded code understanding.

Significance. If the mutations are verifiably semantics-preserving and the human judgments are reliable, the work supplies concrete empirical evidence that high initial accuracy on code tasks does not imply robust semantic reasoning. The multi-model, multi-benchmark design together with the human analysis component strengthens the contribution relative to accuracy-only studies.

major comments (1)

[§4] §4 (Mutation Design and Robustness Experiments): The paper states that the five transformations are semantics-preserving, yet no execution-based equivalence verification is reported (i.e., running original and mutated programs on the same LiveCodeBench/CruxEval inputs and confirming identical outputs). Because the central claim equates prediction changes with lack of semantic reasoning, the absence of such checks leaves open the possibility that observed drops partly reflect unintended semantic alterations rather than model fragility.

minor comments (2)

[Results] Results tables and figures do not report exact instance counts, confidence intervals, or error bars for the 10–50% flawed-reasoning range and the 70% drop figures, reducing interpretability of the quantitative claims.
[Human Evaluation] The human-review protocol (number of experts, inter-annotator agreement, exact sampling procedure) is described only at a high level; additional detail would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and will incorporate the suggested verification to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4] §4 (Mutation Design and Robustness Experiments): The paper states that the five transformations are semantics-preserving, yet no execution-based equivalence verification is reported (i.e., running original and mutated programs on the same LiveCodeBench/CruxEval inputs and confirming identical outputs). Because the central claim equates prediction changes with lack of semantic reasoning, the absence of such checks leaves open the possibility that observed drops partly reflect unintended semantic alterations rather than model fragility.

Authors: We agree that explicit execution-based verification would eliminate any residual doubt about whether the observed prediction changes could arise from unintended semantic shifts. Although the five transformations (variable renaming, comparison mirroring, if-else swapping, for-to-while conversion, and loop unrolling) were chosen as standard, well-documented semantics-preserving operations in Python, the original submission did not report running the original and mutated programs on the LiveCodeBench and CruxEval inputs to confirm identical outputs. In the revised manuscript we will add this verification step to Section 4, documenting that outputs match for all evaluated cases (with any edge-case discrepancies, such as floating-point tolerance, explicitly noted and handled). This addition directly addresses the concern and reinforces that instability under mutation reflects fragility in semantic reasoning rather than mutation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on benchmarks

full rationale

The paper reports an empirical evaluation of LLMs on output prediction tasks using five explicitly listed mutations applied to programs from LiveCodeBench and CruxEval, followed by accuracy comparisons and human expert review of reasoning quality. No equations, fitted parameters, or first-principles derivations appear; the central claims rest on observed accuracy drops and flawed-reasoning counts measured against external benchmarks and human judgment rather than reducing to self-defined inputs or self-citation chains. The assertion that the mutations preserve semantics is presented as a premise supported by the listed transformations, not derived circularly from the experimental outcomes themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the listed mutations preserve semantics exactly and that human experts can reliably distinguish sound from flawed reasoning; no free parameters or new entities are introduced.

axioms (1)

domain assumption The five mutations maintain identical program semantics while changing syntax.
Invoked in the experimental design to isolate reasoning from surface form.

pith-pipeline@v0.9.0 · 5803 in / 1223 out tokens · 51212 ms · 2026-05-22T14:45:39.585693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs often change predictions in response to our code mutations, with performance drops reaching up to 70%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

In NeurIPS

Self-Supervised Bug Detection and Repair. In NeurIPS. Allen, F. E. 1970. Control flow analysis. In Northcote, R. S., ed., Proceedings of a Symposium on Compiler Optimization, Urbana-Champaign, Illinois, USA, July 27-28, 1970 , 1–19. ACM. Barone, A. V . M.; Barez, F.; Cohen, S. B.; and Konstas, I

work page 1970
[2]

Codegemma: Open code models based on gemma,

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python. In ACL 2023, 272–292. Codeforces. 2025. https://codeforces.com. Accessed: 2025- 07-15. CodeGemma. 2024. CodeGemma: Open Code Models Based on Gemma. CoRR, abs/2406.11409. CodeLlama. 2023. Code Llama: Open Foundation Models for Code. CoRR, abs/2308.12950....

work page arXiv 2023
[3]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

IEEE. HuggingFace. 2025. . https://huggingface.co. [Online; ac- cessed 15-July-2025]. Jain, N.; Han, K.; Gu, A.; Li, W.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; and Stoica, I. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. CoRR, abs/2403.07974. LeetCode. 2025. https://leetcode.com. Acc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

It initializes `i` to 1 and creates a set `arr` containing the integer 1

work page
[5]

It enters a while loop that continues until `arr` contains `n` elements

work page
[6]

If it's not, it adds `i` to `arr`

Inside the loop, it increments `i` by 1 and checks if `target - i` is not in `arr`. If it's not, it adds `i` to `arr`

work page
[7]

Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step

Finally, it returns the sum of the elements in `arr`. Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step

work page
[8]

The function is called with `n = 1` and `target = 1`

work page
[9]

[Line 3]: `i` is initialized to 1 and `arr` is initialized to `{1}`

work page
[10]

[Line 4]: The while loop checks the condition `len(arr) < n` (1 < 1), which is `False`, so the loop does not execute

work page
[11]

So, the function returns `1` when called with `n = 1` and `target = 1`

[Line 7]: The function returns the sum of the elements in `arr`, which is `sum({1}) = 1`. So, the function returns `1` when called with `n = 1` and `target = 1`. Therefore, the correct assertion would be: [/Code] [ANSWER] assert minimumPossibleSum(n = 1, target = 1) == 1 [/ANSWER] A.2. Correct prediction based on sound reasoning after more than one attemp...

work page 2024
[12]

an apple

predicts the correct output of False for the previ- ous programming contest #2828, isAcronym. However, CODE GEMMA ’s reasoning is flawed. Hence, this interaction was considered a correct guess based on flawed reasoning. • Our prompt: Simulate the Execution: You are given a Python function and an assertion containing a function input. Complete the assertio...

work page 2023

[1] [1]

In NeurIPS

Self-Supervised Bug Detection and Repair. In NeurIPS. Allen, F. E. 1970. Control flow analysis. In Northcote, R. S., ed., Proceedings of a Symposium on Compiler Optimization, Urbana-Champaign, Illinois, USA, July 27-28, 1970 , 1–19. ACM. Barone, A. V . M.; Barez, F.; Cohen, S. B.; and Konstas, I

work page 1970

[2] [2]

Codegemma: Open code models based on gemma,

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python. In ACL 2023, 272–292. Codeforces. 2025. https://codeforces.com. Accessed: 2025- 07-15. CodeGemma. 2024. CodeGemma: Open Code Models Based on Gemma. CoRR, abs/2406.11409. CodeLlama. 2023. Code Llama: Open Foundation Models for Code. CoRR, abs/2308.12950....

work page arXiv 2023

[3] [3]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

IEEE. HuggingFace. 2025. . https://huggingface.co. [Online; ac- cessed 15-July-2025]. Jain, N.; Han, K.; Gu, A.; Li, W.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; and Stoica, I. 2024. Live- CodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. CoRR, abs/2403.07974. LeetCode. 2025. https://leetcode.com. Acc...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

It initializes `i` to 1 and creates a set `arr` containing the integer 1

work page

[5] [5]

It enters a while loop that continues until `arr` contains `n` elements

work page

[6] [6]

If it's not, it adds `i` to `arr`

Inside the loop, it increments `i` by 1 and checks if `target - i` is not in `arr`. If it's not, it adds `i` to `arr`

work page

[7] [7]

Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step

Finally, it returns the sum of the elements in `arr`. Now, let's simulate the execution of `minimumPossibleSum(n = 1, target = 1)` step by step

work page

[8] [8]

The function is called with `n = 1` and `target = 1`

work page

[9] [9]

[Line 3]: `i` is initialized to 1 and `arr` is initialized to `{1}`

work page

[10] [10]

[Line 4]: The while loop checks the condition `len(arr) < n` (1 < 1), which is `False`, so the loop does not execute

work page

[11] [11]

So, the function returns `1` when called with `n = 1` and `target = 1`

[Line 7]: The function returns the sum of the elements in `arr`, which is `sum({1}) = 1`. So, the function returns `1` when called with `n = 1` and `target = 1`. Therefore, the correct assertion would be: [/Code] [ANSWER] assert minimumPossibleSum(n = 1, target = 1) == 1 [/ANSWER] A.2. Correct prediction based on sound reasoning after more than one attemp...

work page 2024

[12] [12]

an apple

predicts the correct output of False for the previ- ous programming contest #2828, isAcronym. However, CODE GEMMA ’s reasoning is flawed. Hence, this interaction was considered a correct guess based on flawed reasoning. • Our prompt: Simulate the Execution: You are given a Python function and an assertion containing a function input. Complete the assertio...

work page 2023