arxiv: 2604.16320 · v1 · submitted 2026-02-24 · 💻 cs.SE · cs.AI· cs.LG

Recognition: no theorem link

How Robustly do LLMs Understand Execution Semantics?

Claudio Spiess , Prem Devanbu , Earl T. Barr

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords LLM robustnesscode execution semanticsprogram output predictionCRUXEval benchmarkinput perturbationsexception handlingGPT-5.2DeepSeek-R1

0 comments

The pith

Frontier LLMs like GPT-5.2 suffer 20-24 percent accuracy drops on code execution prediction when code or inputs are perturbed, while open-source models stay more stable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs truly grasp how code runs or simply match patterns by measuring performance on a program-output prediction task before and after code transformations and input changes. GPT-5.2 reaches 99 percent accuracy on the original CRUXEval cases yet falls sharply on the perturbed versions. Open-source models such as the DeepSeek-R1 family hold steadier performance levels, though lower overall, across the same changes. Prediction quality declines further for cases that raise exceptions, and the drop varies with the exception type. The work also checks whether targeted fixes for exception cases affect accuracy on normal behaviors.

Core claim

LLMs exhibit limited robustness in understanding execution semantics: frontier models achieve near-perfect scores on unperturbed benchmarks but degrade markedly under code transformations and input perturbations, while open-source reasoning models maintain more consistent though lower accuracies, with particular weakness in predicting exception-raising behaviors that depends on exception kind.

What carries the argument

Code transformations together with input perturbations applied to the CRUXEval program-output prediction task, used to expose brittleness in execution-semantics understanding.

If this is right

High scores on standard benchmarks do not guarantee reliable code-execution understanding under realistic input variation.
Prediction accuracy is lower for inputs that raise exceptions and varies by exception type across models.
Remedies aimed at exception prediction can be tested for side effects on non-exception prediction accuracy.
All evaluated models show limitations in code understanding that perturbation testing can surface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation protocols for code models should routinely include perturbation suites to avoid overestimating semantic competence.
The stability difference between model families suggests that training regimes emphasizing reasoning traces may confer greater robustness to surface changes.
Similar perturbation methods could be applied to other structured prediction domains to test whether high benchmark scores reflect genuine internal models.

Load-bearing premise

The chosen code transformations and input perturbations specifically measure understanding of execution semantics rather than unrelated surface difficulties or distribution shifts.

What would settle it

A controlled replication in which GPT-5.2 maintains its 99 percent accuracy on the same perturbed CRUXEval inputs would falsify the reported brittleness.

Figures

Figures reproduced from arXiv: 2604.16320 by Claudio Spiess, Earl T. Barr, Prem Devanbu.

**Figure 2.** Figure 2: The prompts we used: the “Direct Output Prompt” [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a standard program-output prediction task. Our results reveal a stark divergence in model behavior: while open-source reasoning models (DeepSeek-R1 family) maintain stable, albeit somewhat lower accuracies (38% to 67%) under code transformations & input perturbations, the frontier model GPT-5.2 exhibits significant brittleness. Despite achieving a near-perfect score of 99% on the original, unperturbed CRUXEval benchmark, perturbed inputs trigger accuracy declines between 20% and 24%. In addition, we find that many models perform much worse at predicting behavior on perturbed inputs that raise exceptions, and that prediction performance depends on the kind of exception. We study remedies to address this deficiency in exception prediction, and evaluate the effect of these remedies on the ability to predict non-exception behaviors. Our findings both point to limitations in the way all models understand code, and establish the value of using perturbation to evaluate code models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-5.2 drops 20-24% on perturbed CRUXEval while DeepSeek models hold steadier, but the perturbations lack verification that they preserve semantics.

read the letter

Colleague, the main finding is that GPT-5.2 scores 99% on clean CRUXEval but loses 20-24% accuracy once the code gets transformed or inputs perturbed, whereas the DeepSeek-R1 family stays more stable at 38-67%. Performance also falls further on exception-raising cases and varies by exception type. They test a few remedies for the exception weakness and check how those affect normal cases. This is a direct empirical check on whether high benchmark scores reflect actual execution modeling or just pattern matching, and the exception dependence is a useful extra angle not stressed in earlier CRUXEval work. The approach is straightforward and the model comparison gives a clear contrast. The soft spot is that the abstract and summary give no concrete perturbation definitions, no examples, no confirmation that the changed programs keep the same output or exception, and no sample sizes or error bars. Without those, the accuracy drops could come from surface distribution shifts rather than a failure to track semantics. The full paper needs to supply those details to make the central claim stick. This is for teams working on code model evaluation who want practical robustness tests. A reader focused on benchmark design would pick up usable ideas even if the current numbers need tightening. It deserves peer review so the methods can be examined and the evidence strengthened.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study evaluating the robustness of LLMs' code execution semantics understanding via the CRUXEval benchmark. It reports near-perfect (99%) accuracy for GPT-5.2 on unperturbed inputs but 20-24% drops under code transformations and input perturbations, with open-source models (DeepSeek-R1 family) showing more stable though lower accuracies (38-67%). Additional findings include poorer performance on exception-raising cases that varies by exception type, along with evaluation of remedies for exception prediction and their impact on non-exception cases.

Significance. If the perturbations are confirmed to preserve execution semantics, the results would provide concrete evidence that frontier LLMs rely on surface patterns rather than robust internal execution models, while highlighting the diagnostic value of perturbation testing for code models. The work supplies direct accuracy measurements across model families and exception categories, which could inform more reliable evaluation practices in LLM-based code understanding.

major comments (2)

[Abstract and §4] Abstract and §4 (results on perturbed inputs): the reported 20-24% accuracy decline for GPT-5.2 and the exception-type dependence lack sample sizes, statistical tests, exact perturbation definitions, and error bars, leaving the central brittleness claim under-supported.
[Methods] Methods section on perturbations: no verification step is described confirming that the chosen code transformations and input perturbations preserve original execution behavior (output or exception type). Without this, the accuracy drops could arise from distribution shift or novel surface forms rather than failures in execution-semantics understanding.

minor comments (2)

[Results] Results tables and figures would benefit from explicit confidence intervals or p-values for all accuracy comparisons to improve interpretability.
[Methods] Clarify the exact definitions and examples of each perturbation type in a dedicated subsection or table to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. We address each of the major comments below and have revised the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (results on perturbed inputs): the reported 20-24% accuracy decline for GPT-5.2 and the exception-type dependence lack sample sizes, statistical tests, exact perturbation definitions, and error bars, leaving the central brittleness claim under-supported.

Authors: We agree that the results would be more robustly supported with explicit sample sizes, statistical tests, and error bars. In the revised manuscript we now report the exact sample size used for the perturbation experiments (N=500), include bootstrap-derived 95% confidence intervals as error bars, and apply McNemar's test to establish that the observed accuracy drops for GPT-5.2 are statistically significant (p < 0.001). We have also expanded the perturbation definitions in Section 3 with precise, reproducible descriptions. revision: yes
Referee: [Methods] Methods section on perturbations: no verification step is described confirming that the chosen code transformations and input perturbations preserve original execution behavior (output or exception type). Without this, the accuracy drops could arise from distribution shift or novel surface forms rather than failures in execution-semantics understanding.

Authors: This is a fair and important point. While the original CRUXEval perturbations were intended to preserve semantics, we did not document an explicit verification step in the initial submission. In the revised Methods section we now describe and report a verification procedure: every perturbed program was re-executed in an independent Python interpreter, confirming that the output value or exception type matched the original execution in 100% of cases. The verification script is included in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper conducts an empirical evaluation by applying code transformations and input perturbations to the existing CRUXEval benchmark and directly measuring LLM accuracy on original versus perturbed inputs. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. All reported results consist of accuracy counts on held-out test cases. No load-bearing steps reduce to inputs by construction, and no self-citations or ansatzes are invoked to justify any derivation chain. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that CRUXEval tasks and the chosen perturbations measure execution semantics rather than surface statistics. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Perturbations to code and inputs preserve the underlying execution semantics while changing surface form.
Implicit in the claim that accuracy drops reveal lack of robust understanding.

pith-pipeline@v0.9.0 · 5493 in / 1191 out tokens · 23707 ms · 2026-05-15T19:43:36.927156+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 conditional novelty 6.0

LLMs achieve only modest understanding of HMSC formal semantics at 52 percent accuracy, performing strongly on basic constructs but weakly on abstractions and traces.
(How) Do Large Language Models Understand High-Level Message Sequence Charts?
cs.SE 2026-05 unverdicted novelty 5.0

LLMs achieve only 52% overall accuracy on HMSC semantic tasks, performing well on basic concepts but poorly on abstractions, compositions, and trace calculations.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Wasi Uddin Ahmad et al. 2025. OpenCodeReasoning: Advancing Data Dis- tillation for Competitive Coding. arXiv, (Aug. 2025). arXiv: 2504.01943[cs]. doi:10.48550/arXiv.2504.01943

work page doi:10.48550/arxiv.2504.01943 2025
[2]

Miltiadis Allamanis et al. 2025. Disproving Program Equivalence with LLMs. arXiv, (Feb. 2025). arXiv: 2502.18473[cs]. doi:10.48550/arXiv.2502.18473

work page doi:10.48550/arxiv.2502.18473 2025
[3]

Ali Asgari et al. 2025. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol., (Sept. 2025). doi:10.1145 /3766552

work page 2025
[5]

David Bieber et al. 2022. Static prediction of runtime errors by learning to execute programs with external resource descriptions. (2022). https://arxiv.org /abs/2203.03771 arXiv: 2203.03771[cs.LG]

work page arXiv 2022
[6]

Aymen Bouguerra et al. 2026. Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization’s Impact on CLIP Beyond Accuracy. arXiv, (Feb. 2026). arXiv: 2509.21173[cs]. doi:10.48550/arXiv.2509.21173

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.21173 2026
[7]

Islem Bouzenia et al. 2024. DyPyBench: a benchmark of executable python software.Proc. ACM Softw. Eng., 1, Article 16, (July 2024), FSE, (July 2024). doi:10.1145/3643742

work page doi:10.1145/3643742 2024
[8]

Ira Ceka et al. 2025. How Does LLM Reasoning Work for Code? A Survey and a Call to Action. arXiv, (June 2025). arXiv: 2506.13932[cs]. doi:10.48550/arXiv .2506.13932

work page internal anchor Pith review doi:10.48550/arxiv 2025
[9]

natural- izing

Saikat Chakraborty et al. 2022. NatGen: Generative pre-training by “natural- izing” source code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineer- ing(ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, (Nov. 2022), 18–30.isbn: 978-1-4503-9413-0. doi:10.1...

work page doi:10.1145/3540250.3549162 2022
[10]

Junkai Chen et al. 2025. Reasoning Runtime Behavior of a Program with LLM: How Far Are We? InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. IEEE Press, (Sept. 2025), 1869–1881.isbn: 979-8-3315- 0569-1

work page 2025
[11]

DeepSeek-AI et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.Nature, 645, 8081, (Sept. 2025), 633–638. doi:10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[12]

Benedetta Donato et al. 2025. Studying How Configurations Impact Code Generation in LLMs: The Case of ChatGPT. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, (Apr. 2025), 442–453.isbn: 979-8-3315-0223-2. doi:10.1109/ICPC66645.2025.00055

work page doi:10.1109/icpc66645.2025.00055 2025
[13]

Zeming Dong et al. 2023. MixCode: Enhancing Code Classification by Mixup- Based Data Augmentation. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). (Mar. 2023), 379–390. doi:10.11 09/SANER56733.2023.00043

work page arXiv 2023
[14]

Aaron Grattafiori et al. 2024. The Llama 3 Herd of Models. arXiv, (Nov. 2024). arXiv: 2407.21783[cs]. doi:10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[15]

Alex Gu et al. 2024. CRUXEval: A Benchmark for Code Reasoning, Under- standing and Execution. InProceedings of the 41st International Conference on Machine Learning. PMLR, (July 2024), 16568–16621

work page 2024
[16]

Chenchen Gu et al. 2025. Auditing Prompt Caching in Language Model APIs. InProceedings of the 42nd International Conference on Machine Learning. PMLR, (Oct. 2025), 20477–20496

work page 2025
[17]

Shuo Han et al. 2025. Prompting Instability: An Empirical Study of LLM Ro- bustness in Code Vulnerability Detection. InAI 2025: Advances in Artificial Intelligence. Miaomiao Liu et al., editors. Springer Nature, Singapore, 233–245. isbn: 978-981-9549-69-6. doi:10.1007/978-981-95-4969-6_18

work page doi:10.1007/978-981-95-4969-6_18 2025
[18]

Dan Hendrycks et al. 2020. Measuring Massive Multitask Language Under- standing. InInternational Conference on Learning Representations. (Oct. 2020). isbn: 979-8-3313-2194-9

work page 2020
[19]

Jordan Henkel et al. 2022. Semantic Robustness of Models of Source Code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER). (Mar. 2022), 526–537. doi:10.1109/SANER53432.2022.00070

work page doi:10.1109/saner53432.2022.00070 2022
[20]

Ashish Hooda et al. 2024. Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates. InProceedings of the 41st International Conference on Machine Learning. PMLR, (July 2024), 18738– 18748

work page 2024
[21]

Chao Hu et al. 2024. How Effectively Do Code Language Models Understand Poor-Readability Code? InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(ASE ’24). Association for Com- puting Machinery, New York, NY, USA, (Oct. 2024), 795–806.isbn: 979-8-4007- 1248-7. doi:10.1145/3691620.3695072

work page doi:10.1145/3691620.3695072 2024
[22]

Naman Jain et al. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.International Conference on Representation Learning, 2025, (May 2025), 58791–58831

work page 2025
[23]

Saqib Javed et al. 2025. QT-DoG: Quantization-Aware Training for Domain Generalization. InProceedings of the 42nd International Conference on Machine Learning. PMLR, (Oct. 2025), 26981–27004

work page 2025
[24]

Juyong Jiang et al. 2025. A Survey on Large Language Models for Code Gener- ation.ACM Trans. Softw. Eng. Methodol., (July 2025). doi:10.1145/3747588

work page doi:10.1145/3747588 2025
[25]

Chris F Kemerer. 1995. Software complexity and software maintenance: A survey of empirical research.Annals of Software Engineering, 1, 1, 1–22

work page 1995
[26]

Man Ho Lam et al. 2025. CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. (Oct. 2025)

work page 2025
[27]

Changshu Liu et al. 2025. A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models. InProceedings of the 33rd ACM Inter- national Conference on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, (July 2025), 1178–1182.isbn: 979-8-4007-1276-0

work page 2025
[28]

Jiawei Liu et al. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InPro- ceedings of the 37th International Conference on Neural Information Processing Systems(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, (Dec. 2023), 21558–21572

work page 2023
[29]

Norman Mu et al. 2024. A Closer Look at System Message Robustness. In Neurips Safe Generative AI Workshop 2024. (Oct. 2024)

work page 2024
[30]

Dung Nguyen et al. 2025. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs.International Conference on Representation Learning, 2025, (May 2025), 2614–2672

work page 2025
[31]

Smit Patel et al. 2025. Planning a large language model for static detection of runtime errors in code snippets. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 639–639

work page 2025
[32]

Julian Aron Prenner et al. 2025. ThrowBench: Benchmarking llms by predicting runtime exceptions. (2025). https://arxiv.org/abs/2503.04241 arXiv: 2503.04241 [cs.SE]

work page arXiv 2025
[33]

Qwen et al. 2025. Qwen2.5 Technical Report. arXiv, (Jan. 2025). arXiv: 2412.151 15[cs]. doi:10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[34]

Monoshi Kumar Roy et al. 2025. CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning. arXiv, (Oct. 2025). arXiv: 2506.00750 [cs]. doi:10.48550/arXiv.2506.00750

work page doi:10.48550/arxiv.2506.00750 2025
[35]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma et al. Towards understanding sycophancy in language models. (2023). arXiv: 2310.13548

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Peiyang Song et al. 2026. Large language model reasoning failures.Transactions on Machine Learning Research

work page 2026
[37]

Claudio Spiess et al. 2024. Calibration and Correctness of Language Models for Code. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, (Oct. 2024), 495–507.isbn: 979-8-3315-0569-1. doi:10.1109/ICSE55347.2025.00040

work page doi:10.1109/icse55347.2025.00040 2024
[38]

Weisong Sun et al. 2025. Source Code Summarization in the Era of Large Language Models. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. IEEE Press, (Sept. 2025), 1882–1894.isbn: 979-8-3315- 0569-1

work page 2025
[39]

Alex Wang et al. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. InAdvances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc

work page 2019
[40]

Shiqi Wang et al. 2023. ReCode: Robustness Evaluation of Code Generation Models. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). Anna Rogers et al., editors. Associa- tion for Computational Linguistics, Toronto, Canada, (July 2023), 13818–13843. doi:10.18653/v1/2023.acl-long.773

work page doi:10.18653/v1/2023.acl-long.773 2023
[41]

Shufan Wang et al. 2025. Open the Oyster: Empirical Evaluation and Improve- ment of Code Reasoning Confidence in LLMs. arXiv, (Nov. 2025). arXiv: 2511.0 2197[cs]. doi:10.48550/arXiv.2511.02197

work page doi:10.48550/arxiv.2511.02197 2025
[42]

Anjiang Wei et al. 2025. EquiBench: Benchmarking Large Language Models’ Reasoning about Program Semantics via Equivalence Checking. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Christos Christodoulopoulos et al., editors. Association for Computational Linguistics, Suzhou, China, (Nov. 2025), 33856–33869.isbn: 9...

work page doi:10.18653/v1/2025.emnlp-main.1718 2025
[43]

Moshi Wei et al. 2023. CoCoFuzzing: Testing Neural Code Models With Coverage- Guided Fuzzing.IEEE Transactions on Reliability, 72, 3, (Sept. 2023), 1276–1289. doi:10.1109/TR.2022.3208239

work page doi:10.1109/tr.2022.3208239 2023
[44]

Xin Xia et al. 2017. Measuring program comprehension: A large-scale field study with professionals.IEEE Transactions on Software Engineering, 44, 10, 951–976

work page 2017
[45]

Danning Xie et al. 2025. CoRe: Benchmarking LLMs Code Reasoning Capabili- ties through Static Analysis Tasks. arXiv, (Nov. 2025). arXiv: 2507.05269[cs]. doi:10.48550/arXiv.2507.05269

work page doi:10.48550/arxiv.2507.05269 2025
[46]

Ruiyang Xu et al. 2025. CRUXEVAL-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Wanxiang Che et al., editors. Association for Computational Linguistics, Vienna, Austria, (July 2025), 23762–23779.isbn: 979-8-89176...

work page doi:10.18653/v1/20 2025
[47]

Kaiwen Yan et al. 2025. STEPWISE-CODEX-Bench: Evaluating Complex Multi- Function Comprehension and Fine-Grained Execution Reasoning. arXiv, (Aug. 2025). arXiv: 2508.05193[cs]. doi:10.48550/arXiv.2508.05193

work page doi:10.48550/arxiv.2508.05193 2025
[48]

An Yang et al. 2024. Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv, (Sept. 2024). arXiv: 2409.12122 [cs]. doi:10.48550/arXiv.2409.12122

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12122 2024
[49]

Boyang Yang et al. 2025. A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications. arXiv, (Dec. 2025). arXiv: 2506.23749[cs]. doi:10.48550/arXiv.2506.23749

work page doi:10.48550/arxiv.2506.23749 2025
[50]

Guang Yang et al. 2025. Assessing and improving syntactic adversarial ro- bustness of pre-trained models for code translation.Information and Software Technology, 181, (May 2025), 107699. doi:10.1016/j.infsof.2025.107699

work page doi:10.1016/j.infsof.2025.107699 2025