On-the-Fly Input Adaptation for Reliable Code Intelligence

Ravishka Rathnasuriya; Wei Yang

arxiv: 2605.19365 · v1 · pith:SO3IJ26Inew · submitted 2026-05-19 · 💻 cs.SE

On-the-Fly Input Adaptation for Reliable Code Intelligence

Ravishka Rathnasuriya , Wei Yang This is my paper

Pith reviewed 2026-05-20 04:54 UTC · model grok-4.3

classification 💻 cs.SE

keywords code language modelsinput adaptationmisprediction reductionsoftware engineeringreliable code intelligenceon-the-fly processingcode understanding tasks

0 comments

The pith

Code language models reduce mispredictions by adapting inputs on the fly with syntax- and semantics-preserving operations, without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that code language models still err on real inputs despite up-to-date training, and that common fixes like retraining or prompt changes cost too much and fail to generalize. It offers on-the-fly input adaptation as an alternative: first validate which inputs are likely to fail, then transform them using operations that keep syntax and meaning intact so the inputs better match what the model already knows. This two-part process lifts accuracy on multiple code understanding tasks while leaving the model itself untouched. A reader should care because the method is lightweight, requires no new labels or redeployment, and targets exactly the reliability problems that matter in deployed software tools.

Core claim

The paper claims that an on-the-fly input adaptation strategy consisting of input validation to detect likely mispredictions followed by input adaptation via syntax- and semantics-preserving operations reduces mispredictions across diverse code understanding tasks and boosts model performance without necessitating retraining or additional supervision.

What carries the argument

On-the-fly input adaptation, a two-stage process of validation to flag risky inputs and transformation using syntax- and semantics-preserving operations to align them with the model's learned behavior.

If this is right

Model performance rises on code tasks at inference time rather than through expensive retraining cycles.
The same method applies across multiple code understanding tasks without task-specific retraining or tuning.
Computational and labeling costs drop because no model updates or new supervised data are required.
Reliability improves in high-stakes software engineering settings where consistent outputs matter most.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested as a lightweight complement to prompt-engineering techniques on the same models.
Similar validation-plus-transformation steps might prove useful for other sequence models outside code.
Deployment in environments with restricted model access becomes more feasible since no internal changes are needed.

Load-bearing premise

Syntax- and semantics-preserving operations can be identified and applied on the fly to transform inputs likely to cause mispredictions into ones that align better with the model's learned behavior.

What would settle it

An experiment in which the identified syntax- and semantics-preserving transformations are applied to mispredicted inputs and produce no reduction, or an increase, in error rates across the tested code understanding tasks.

Figures

Figures reproduced from arXiv: 2605.19365 by Ravishka Rathnasuriya, Wei Yang.

read the original abstract

Code language models (CLMs) play a central role in software engineering across both generation and classification tasks. However, these models still exhibit notable mispredictions in real-world applications, even when trained on up-to-date data. Existing solutions address this by retraining the model, modifying its architecture, or re-engineering prompts. These approaches incur high computational cost requiring substantial effort in data labeling, model updates, and redeployment, and often suffer from poor generalization across tasks and tuning instability across models. This work proposes an alternative strategy based on on-the-fly input adaptation, which improves model behavior without altering its parameters or requiring additional supervision. The method consists of two stages: input validation, which detects inputs likely to cause mispredictions, and input adaptation, which transforms them using syntax- and semantics-preserving operations to better align with the model's learned behavior. This dual strategy reduces mispredictions across diverse code understanding tasks, boosting model performance without necessitating retraining. As a scalable and resource-efficient solution, this framework holds significant promise for high-stakes applications in software engineering where reliability is critical.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes detecting and adapting problematic inputs on the fly for code models to cut mispredictions without retraining, but the abstract supplies no experiments or checks to back the claims.

read the letter

This paper's main idea is to improve code language models by validating inputs that are likely to cause mistakes and then adapting them on the fly using operations that keep syntax and semantics intact. It avoids the need to retrain or modify the model itself. The approach is presented as a new alternative to retraining, architecture changes, and prompt engineering. It does well in highlighting the practical drawbacks of those methods, such as high costs and poor generalization across tasks. The dual-stage method sounds straightforward and could be useful for real-world applications where reliability matters. However, the abstract provides no experimental details, so it's impossible to assess the actual gains or the baselines used. The concern about whether the adaptations truly preserve the ground-truth labels for tasks like defect detection is a real one. Without task-specific validation or an equivalence checker, reported improvements might partly result from unintended label shifts rather than the model performing better on the original task. This work would interest practitioners and researchers in software engineering focused on deploying reliable code intelligence tools. If the full paper includes proper controls and reproducible results, it could offer a practical contribution. I would recommend sending it for peer review so that the experiments can be properly evaluated.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an on-the-fly input adaptation framework for code language models (CLMs) to improve reliability on code understanding tasks. It consists of an input validation stage that detects inputs likely to cause mispredictions and an input adaptation stage that applies syntax- and semantics-preserving operations to transform those inputs into ones better aligned with the model's learned behavior. The approach is positioned as a low-cost alternative to retraining, architectural changes, or prompt re-engineering, claiming reduced mispredictions across diverse tasks without requiring additional supervision or parameter updates.

Significance. If the method is shown to preserve task labels and deliver consistent, verifiable gains, the work would offer a practical, resource-efficient path to higher reliability for CLMs in high-stakes software engineering applications. It directly addresses the computational and generalization drawbacks of retraining-based solutions and could be broadly applicable across models and tasks.

major comments (1)

[Input adaptation stage (method description)] Input adaptation stage (method description): The central claim requires that the proposed syntax- and semantics-preserving operations transform misprediction-prone inputs while leaving the ground-truth label for the downstream task unchanged. For tasks such as defect detection or clone detection, local edits (e.g., variable renaming, dead-code insertion, or formatting) can be syntax-preserving yet alter semantic equivalence or the task label. The manuscript provides no task-specific validation—such as an equivalence checker, formal semantic analysis, or human annotation on the evaluation sets—to confirm label invariance. Without this, reported accuracy gains risk partial attribution to label drift rather than genuine distributional alignment.

minor comments (1)

[Abstract] Abstract: The performance claims would be strengthened by briefly indicating the concrete metrics, baselines, and code understanding tasks on which gains were measured.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding validation of label invariance in the input adaptation stage below.

read point-by-point responses

Referee: Input adaptation stage (method description): The central claim requires that the proposed syntax- and semantics-preserving operations transform misprediction-prone inputs while leaving the ground-truth label for the downstream task unchanged. For tasks such as defect detection or clone detection, local edits (e.g., variable renaming, dead-code insertion, or formatting) can be syntax-preserving yet alter semantic equivalence or the task label. The manuscript provides no task-specific validation—such as an equivalence checker, formal semantic analysis, or human annotation on the evaluation sets—to confirm label invariance. Without this, reported accuracy gains risk partial attribution to label drift rather than genuine distributional alignment.

Authors: We thank the referee for raising this important point. Our input adaptation operations are drawn from established code transformation techniques (e.g., semantics-preserving refactorings and obfuscations documented in prior work) that are intended to maintain both syntactic validity and functional equivalence, thereby preserving downstream task labels by construction. For example, variable renaming is applied consistently across all references without altering control or data flow, and dead-code insertion is restricted to unreachable or non-affecting statements. Nevertheless, we acknowledge that the current manuscript lacks explicit empirical confirmation of label invariance on the evaluation sets. In the revised manuscript we will add a dedicated subsection that (i) formally justifies label preservation for each operation per task, (ii) reports results from an automated equivalence checker (AST-based semantic comparison for clone detection and control-flow analysis for defect detection), and (iii) includes a human annotation study on a random sample of 200 adapted inputs to verify that ground-truth labels remain unchanged. These additions will directly address the risk of label drift. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method without derivation chain

full rationale

The paper describes a two-stage procedural framework (input validation followed by syntax- and semantics-preserving adaptation) that is applied at inference time to reduce mispredictions. No equations, fitted parameters, uniqueness theorems, or first-principles derivations appear in the provided abstract or method outline. The claimed performance gain is presented as an empirical outcome of the heuristic transformations rather than a quantity derived from prior fitted values or self-referential definitions. The central claim therefore does not reduce to its inputs by construction, satisfying the criteria for a self-contained engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the core premise that preserving transformations exist and improve alignment is treated as a domain assumption rather than derived.

pith-pipeline@v0.9.0 · 5711 in / 1119 out tokens · 44112 ms · 2026-05-20T04:54:41.764935+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

input adaptation, which transforms them using syntax- and semantics-preserving operations to better align with the model’s learned behavior
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

variance-based validation via sub-models ... representation distance metrics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

[1]

Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Code- score: Evaluating code generation by learning code execution.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

work page 2025
[2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

work page 2016
[3]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Dan Hendrycks and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv:1610.02136 [cs.NE]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Lei Ma, and Yves Le Traon. 2023. CodeS: towards code model generalization under distribution shift. InInternational Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER)

work page 2023
[6]

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

work page 2024
[7]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems30 (2017)

work page 2017
[8]

Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, and André FT Martins. 2024. Doce: Finding the sweet spot for execution-based code generation.arXiv preprint arXiv:2408.13745(2024)

work page arXiv 2024
[9]

Yufei Li, Simin Chen, and Wei Yang. 2021. Estimating predictive uncertainty under program data distribution shift.arXiv preprint arXiv:2107.10989(2021)

work page arXiv 2021
[10]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7

work page 2023
[11]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

work page 2023
[12]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Ling- ming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. InFirst Conference on Language Modeling. https://openreview.net/forum?id= IBCBMeAhmC

work page 2024
[13]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. 2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster

work page 2021
[15]

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning. PMLR, 26106–26128

work page 2023
[16]

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710

work page 2018
[17]

Yun Peng, Akhilesh Deepak Gotmare, Michael R Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. Perfcodegen: Improving performance of llm generated code with execution feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 1–13

work page 2025
[18]

Ravishka Rathnasuriya. 2025. A Framework for On the Fly Input Refinement for Deep Learning Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 140–144

work page 2025
[19]

Ravishka Rathnasuriya. 2025. On the Fly Input Refinement for Code Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 230–231

work page 2025
[20]

Ravishka Rathnasuriya, Zijie Zhao, and Wei Yang. 2025. CodeImprove: Program Adaptation for Deep Code Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 676–676

work page 2025
[21]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948
[23]

Jacob Steinhardt and Percy S Liang. 2016. Unsupervised risk estimation using only conditional independence structure.Advances in Neural Information Processing Systems29 (2016)

work page 2016
[24]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2023. On-the-fly Improving Perfor- mance of Deep Code Models via Input Denoising.arXiv preprint arXiv:2308.09969 (2023)

work page arXiv 2023
[25]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2025. Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 645–645

work page 2025
[26]

Rijnard van Tonder and Claire Le Goues. 2020. Tailoring programs for static analy- sis via program transformation. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 824–834

work page 2020
[27]

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Gr- ishchenkov, Sergey Petrakov, et al. 2024. Benchmarking uncertainty quantifi- cation methods for large language models with lm-polygraph.arXiv preprint arXiv:2406.15627(2024)

work page arXiv 2024
[28]

Rosenblum, Changsheng Sun, Sebastian Elbaum, Yun Lin, and Jin Song Dong

Yan Xiao, Ivan Beschastnikh, David S. Rosenblum, Changsheng Sun, Sebastian Elbaum, Yun Lin, and Jin Song Dong. 2021. Self-Checking Deep Neural Networks in Deployment. arXiv:2103.02371 [cs.SE]

work page arXiv 2021
[29]

Yan Xiao, Yun Lin, Ivan Beschastnikh, Changsheng Sun, David S Rosenblum, and Jin Song Dong. 2022. Repairing Failure-inducing Inputs with Input Reflection. In The 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE

work page 2022
[30]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre- Trained Models of Code. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Com- puting Machinery, New York, NY, USA, 1482–1493. doi:10.1145/3510003.3510146

work page doi:10.1145/3510003.3510146 2022
[31]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–30

work page 2020
[32]

Shiwen Yu, Ting Wang, and Ji Wang. 2022. Data Augmentation by Program Transformation.Journal of Systems and Software190 (Aug. 2022), 111304. doi:10. 1016/j.jss.2022.111304

work page arXiv 2022
[33]

Weiwei Zhang, Shengjian Guo, Hongyu Zhang, Yulei Sui, Yinxing Xue, and Yun Xu. 2023. Challenging Machine Learning-based Clone Detectors via Semantic- preserving Code Transformations.IEEE Transactions on Software Engineering49, 5 (May 2023), 3052–3070. doi:10.1109/TSE.2023.3240118 arXiv:2111.10793 [cs]

work page doi:10.1109/tse.2023.3240118 2023
[34]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)

work page 2019

[1] [1]

Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2025. Code- score: Evaluating code generation by learning code execution.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

work page 2025

[2] [2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

work page 2016

[3] [3]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Dan Hendrycks and Kevin Gimpel. 2018. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. arXiv:1610.02136 [cs.NE]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Mike Papadakis, Lei Ma, and Yves Le Traon. 2023. CodeS: towards code model generalization under distribution shift. InInternational Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER)

work page 2023

[6] [6]

Xue Jiang, Yihong Dong, Lecheng Wang, Zheng Fang, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. 2024. Self-planning code generation with large language models.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–30

work page 2024

[7] [7]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems30 (2017)

work page 2017

[8] [8]

Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, and André FT Martins. 2024. Doce: Finding the sweet spot for execution-based code generation.arXiv preprint arXiv:2408.13745(2024)

work page arXiv 2024

[9] [9]

Yufei Li, Simin Chen, and Wei Yang. 2021. Estimating predictive uncertainty under program data distribution shift.arXiv preprint arXiv:2107.10989(2021)

work page arXiv 2021

[10] [10]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. InThirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7

work page 2023

[11] [11]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

work page 2023

[12] [12]

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Ling- ming Zhang. 2024. Evaluating Language Models for Efficient Code Generation. InFirst Conference on Language Modeling. https://openreview.net/forum?id= IBCBMeAhmC

work page 2024

[13] [13]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI

Robert Munro Monarch. 2021.Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster

work page 2021

[15] [15]

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023. Lever: Learning to verify language-to-code generation with execution. InInternational Conference on Machine Learning. PMLR, 26106–26128

work page 2023

[16] [16]

Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: fuzzing by program transformation. In2018 IEEE Symposium on Security and Privacy (SP). IEEE, 697–710

work page 2018

[17] [17]

Yun Peng, Akhilesh Deepak Gotmare, Michael R Lyu, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. Perfcodegen: Improving performance of llm generated code with execution feedback. In2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge). IEEE, 1–13

work page 2025

[18] [18]

Ravishka Rathnasuriya. 2025. A Framework for On the Fly Input Refinement for Deep Learning Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 140–144

work page 2025

[19] [19]

Ravishka Rathnasuriya. 2025. On the Fly Input Refinement for Code Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 230–231

work page 2025

[20] [20]

Ravishka Rathnasuriya, Zijie Zhao, and Wei Yang. 2025. CodeImprove: Program Adaptation for Deep Code Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 676–676

work page 2025

[21] [21]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

work page 1948

[23] [23]

Jacob Steinhardt and Percy S Liang. 2016. Unsupervised risk estimation using only conditional independence structure.Advances in Neural Information Processing Systems29 (2016)

work page 2016

[24] [24]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2023. On-the-fly Improving Perfor- mance of Deep Code Models via Input Denoising.arXiv preprint arXiv:2308.09969 (2023)

work page arXiv 2023

[25] [25]

Zhao Tian, Junjie Chen, and Xiangyu Zhang. 2025. Fixing Large Language Models’ Specification Misunderstanding for Better Code Generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 645–645

work page 2025

[26] [26]

Rijnard van Tonder and Claire Le Goues. 2020. Tailoring programs for static analy- sis via program transformation. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 824–834

work page 2020

[27] [27]

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Akim Tsvigun, Daniil Vasilev, Rui Xing, Abdelrahman Boda Sadallah, Kirill Gr- ishchenkov, Sergey Petrakov, et al. 2024. Benchmarking uncertainty quantifi- cation methods for large language models with lm-polygraph.arXiv preprint arXiv:2406.15627(2024)

work page arXiv 2024

[28] [28]

Rosenblum, Changsheng Sun, Sebastian Elbaum, Yun Lin, and Jin Song Dong

Yan Xiao, Ivan Beschastnikh, David S. Rosenblum, Changsheng Sun, Sebastian Elbaum, Yun Lin, and Jin Song Dong. 2021. Self-Checking Deep Neural Networks in Deployment. arXiv:2103.02371 [cs.SE]

work page arXiv 2021

[29] [29]

Yan Xiao, Yun Lin, Ivan Beschastnikh, Changsheng Sun, David S Rosenblum, and Jin Song Dong. 2022. Repairing Failure-inducing Inputs with Input Reflection. In The 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE

work page 2022

[30] [30]

Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022. Natural Attack for Pre- Trained Models of Code. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Com- puting Machinery, New York, NY, USA, 1482–1493. doi:10.1145/3510003.3510146

work page doi:10.1145/3510003.3510146 2022

[31] [31]

Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code.Proceedings of the ACM on Programming Languages4, OOPSLA (2020), 1–30

work page 2020

[32] [32]

Shiwen Yu, Ting Wang, and Ji Wang. 2022. Data Augmentation by Program Transformation.Journal of Systems and Software190 (Aug. 2022), 111304. doi:10. 1016/j.jss.2022.111304

work page arXiv 2022

[33] [33]

Weiwei Zhang, Shengjian Guo, Hongyu Zhang, Yulei Sui, Yinxing Xue, and Yun Xu. 2023. Challenging Machine Learning-based Clone Detectors via Semantic- preserving Code Transformations.IEEE Transactions on Software Engineering49, 5 (May 2023), 3052–3070. doi:10.1109/TSE.2023.3240118 arXiv:2111.10793 [cs]

work page doi:10.1109/tse.2023.3240118 2023

[34] [34]

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)

work page 2019