pith. machine review for the scientific record. sign in

arxiv: 2604.21579 · v1 · submitted 2026-04-23 · 💻 cs.SE · cs.AI

Recognition: unknown

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords metamorphic testingdata leakageLLM memorizationautomated program repairnegative log-likelihoodprogram repair benchmarks
0
0 comments X

The pith

LLM program repair tools succeed less often on semantics-preserving bug variants, revealing memorization of training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies metamorphic testing by transforming bug benchmarks while preserving semantics to test if LLMs rely on memorized fixes. It evaluates seven LLMs on original and transformed versions of Defects4J and GitBug-Java. Results show drops in repair success rates that correlate with negative log-likelihood on originals. This combination provides evidence of data leakage affecting performance estimates. Using such transformations can help create more reliable evaluations for LLM-based automated program repair.

Core claim

All evaluated LLMs exhibit substantial drops in patch generation success rates on transformed benchmarks, from -4.1% for GPT-4o to -15.98% for Llama-3.1, and this degradation strongly correlates with NLL on the original benchmarks, indicating that models perform better on instances they are more likely to have memorized.

What carries the argument

Metamorphic testing using semantics-preserving transformations on program repair benchmarks, paired with negative log-likelihood as a proxy for memorization detection.

If this is right

  • APR evaluations on standard benchmarks like Defects4J overestimate LLM performance due to potential memorization.
  • Combining metamorphic variants with NLL analysis strengthens detection of data leakage.
  • Metamorphic testing can be used to create leakage-resistant benchmarks for future LLM evaluations.
  • State-of-the-art LLMs vary in their susceptibility, with some showing larger performance drops than others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data for these LLMs likely includes the common bug fixes from Defects4J and similar datasets.
  • The method could be extended to diagnose memorization in other code-related LLM tasks such as code completion or summarization.
  • Developers of APR tools might consider training or fine-tuning on transformed or augmented bug data to improve generalization.

Load-bearing premise

The semantics-preserving transformations do not change the inherent difficulty of repairing the bugs beyond removing memorization advantages.

What would settle it

Observing no significant drop in repair success rates on the transformed benchmarks or finding no correlation between performance degradation and NLL values would falsify the claim of data leakage via memorization.

Figures

Figures reproduced from arXiv: 2604.21579 by Ali Asgari, Annibale Panichella, Milan De Koning, Pouria Derakhshanfar.

Figure 1
Figure 1. Figure 1: Experimental pipeline 3. Methodology This section describes the methodology we applied to answer our three research questions: • RQ1: What is the impact of metamorphic trans￾formations on the performance of LLM-based pro￾gram repair? • RQ2: How is the observed LLMs’ performance drop under metamorphic transformations related to potential data leakage in the benchmark code? • RQ3: Which types of code pattern… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of success rate differences [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the mean success rate drop (SRdi f f) across NLL-based confidence categories. Bugs are grouped into five bins according to their NLL per￾centile, where category 0 includes the most familiar (lowest NLL) and category 4 the least familiar (highest NLL) instances. Notice that for this analysis, we focus on the four open-source models: Gemma, Llama, Mis￾tral, and StarCoder. NLL values are not available f… view at source ↗
Figure 5
Figure 5. Figure 5: Defects4J projects by average success rate [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization. We construct variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java. Using these benchmarks, we evaluate the repair success rates of seven LLMs on both original and transformed versions, and analyze the relationship between performance degradation and NLL. Our results show that all evaluated state-of-the-art LLMs exhibit substantial drops in patch generation success rates on transformed benchmarks, ranging from -4.1% for GPT-4o to -15.98% for Llama-3.1. Furthermore, we find that this degradation strongly correlates with NLL on the original benchmarks, suggesting that models perform better on instances they are more likely to have memorized. These findings show that combining MT with NLL provides stronger and more reliable evidence of data leakage, while metamorphic testing alone can help mitigate its effects in LLM-based APR evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that combining metamorphic testing (MT) via semantics-preserving transformations on Defects4J and GitBug-Java with negative log-likelihood (NLL) as a memorization proxy provides stronger evidence of data leakage in LLM-based APR. Across seven LLMs, it reports consistent drops in patch generation success on transformed benchmarks (e.g., -4.1% for GPT-4o to -15.98% for Llama-3.1) that correlate with original NLL, concluding that MT mitigates leakage effects in evaluations.

Significance. If the performance degradation can be isolated to loss of memorization rather than transformation-induced difficulty, the work would supply a practical diagnostic for more trustworthy LLM-APR benchmarking. The multi-model, multi-dataset design and direct use of NLL correlation add empirical weight, though the approach remains observational rather than providing machine-checked proofs or parameter-free derivations.

major comments (2)
  1. Abstract and evaluation description: The attribution of success-rate drops to reduced memorization assumes that the semantics-preserving transformations (variable renaming, statement reordering, equivalent expression substitution) do not independently increase repair difficulty via altered token sequences or syntactic patterns. No control experiments—such as difficulty metrics on non-memorized models, human repair times, or number of plausible patches—are reported to rule out this confound, making the causal link to leakage load-bearing but unisolated.
  2. Abstract: The reported 'strong correlation' between degradation and NLL lacks any mention of the correlation coefficient, statistical significance tests, confidence intervals, or controls for other variables (e.g., bug complexity or test-suite size) that could affect repair difficulty, weakening the claim that NLL serves as a clean proxy.
minor comments (1)
  1. Abstract: The exact transformation rules and their implementation details are not specified, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we provide point-by-point responses to the major comments and describe the revisions we will make.

read point-by-point responses
  1. Referee: Abstract and evaluation description: The attribution of success-rate drops to reduced memorization assumes that the semantics-preserving transformations (variable renaming, statement reordering, equivalent expression substitution) do not independently increase repair difficulty via altered token sequences or syntactic patterns. No control experiments—such as difficulty metrics on non-memorized models, human repair times, or number of plausible patches—are reported to rule out this confound, making the causal link to leakage load-bearing but unisolated.

    Authors: We agree that additional controls would help isolate the effect. Our approach uses the correlation with NLL as evidence that the degradation is linked to memorization. We will revise the evaluation section to explicitly discuss this potential limitation and propose future experiments involving non-memorized models or human studies. The observed consistency across models and datasets supports our interpretation, but we acknowledge the causal link is not fully proven. revision: partial

  2. Referee: Abstract: The reported 'strong correlation' between degradation and NLL lacks any mention of the correlation coefficient, statistical significance tests, confidence intervals, or controls for other variables (e.g., bug complexity or test-suite size) that could affect repair difficulty, weakening the claim that NLL serves as a clean proxy.

    Authors: We will update the abstract to include the specific correlation statistics from our analysis, including the coefficient, p-value, and confidence intervals. We will also add a brief discussion on controlling for variables such as bug complexity, noting that the transformations preserve semantics and test suites, and the correlation holds after accounting for dataset variations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations only

full rationale

The paper is an empirical study that applies semantics-preserving transformations to existing benchmarks (Defects4J, GitBug-Java), measures repair success rates of seven LLMs on original vs. transformed versions, and reports correlations with NLL values. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on direct experimental measurements rather than any reduction of outputs to inputs by construction. The reader's noted assumption about transformation neutrality is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen transformations isolate memorization effects without altering inherent repair difficulty, plus the assumption that NLL is a reliable memorization proxy.

axioms (2)
  • domain assumption Semantics-preserving transformations preserve bug repair difficulty except for memorization effects.
    Invoked to interpret performance drops as evidence of leakage rather than transformation artifacts.
  • domain assumption Negative log-likelihood on original benchmarks is a valid proxy for memorization.
    Used to link performance degradation to memorization.

pith-pipeline@v0.9.0 · 5556 in / 1163 out tokens · 23449 ms · 2026-05-09T20:58:09.489165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bidirectional Empowerment of Metamorphic Testing and Large Language Models: A Systematic Survey

    cs.SE 2026-05 accept novelty 4.0

    A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.

Reference graph

Works this paper leans on

65 extracted references · 7 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    On the dichotomy of debugging behavior among program- mers,

    M. Beller, N. Spruit, D. Spinellis, and A. Zaidman, “On the dichotomy of debugging behavior among program- mers,” inICSE, 2018, pp. 572–583

  2. [2]

    The cost of poor software quality in the us: A 2020 report,

    H. Krasner, “The cost of poor software quality in the us: A 2020 report,” Consortium for Information & Software Quality (CISQ), Tech. Rep., 2021

  3. [3]

    A survey on automated program repair tech- niques,

    K. Huang, Z. Xu, S. Yang, H. Sun, X. Li, Z. Yan, and Y . Zhang, “A survey on automated program repair tech- niques,” 2023

  4. [4]

    A systematic literature review on large language models for automated program repair,

    Q. Zhang, C. Fang, Y . Xie, Y . Ma, W. Sun, Y . Yang, and Z. Chen, “A systematic literature review on large language models for automated program repair,” 2024

  5. [5]

    How far can we go with practical function-level program repair?

    J. Xiang, X. Xu, F. Kong, M. Wu, Z. Zhang, H. Zhang, and Y . Zhang, “How far can we go with practical function-level program repair?” 2024

  6. [6]

    Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,

    C. S. Xia and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” inISSTA, 2024, pp. 819–831

  7. [7]

    Hybrid automated program repair by combining large language models and program analysis,

    F. Li, J. Jiang, J. Sun, and H. Zhang, “Hybrid automated program repair by combining large language models and program analysis,” 2024

  8. [8]

    Defects4j: A database of existing faults to enable controlled testing studies for java programs,

    R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inISSTA, 2014, pp. 437–440

  9. [9]

    Don’t make your llm an evaluation benchmark cheater,

    K. Zhou, Y . Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y . Lin, J.-R. Wen, and J. Han, “Don’t make your llm an evaluation benchmark cheater,” 2023

  10. [10]

    Breaking the silence: The threats of using llms in software engineer- ing,

    J. Sallou, T. Durieux, and A. Panichella, “Breaking the silence: The threats of using llms in software engineer- ing,” inICSE NIER, 2024, pp. 102–106

  11. [11]

    Are large language models memoriz- ing bug benchmarks?

    D. Ramos, C. Mamede, K. Jain, P. Canelas, C. Gamboa, and C. L. Goues, “Are large language models memoriz- ing bug benchmarks?” 2024

  12. [12]

    Condefects: A complementary dataset to address data leakage for llm-based fault localization and program repair,

    Y . Wu, Z. Li, J. M. Zhang, and Y . Liu, “Condefects: A complementary dataset to address data leakage for llm-based fault localization and program repair,” inFSE Companion, 2024, pp. 642–646

  13. [13]

    A critical review of large language models on software engineering: An example from chatgpt and automated program repair,

    Q. Zhang, T. Zhang, J. Zhai, C. Fang, B. Yu, W. Sun, and Z. Chen, “A critical review of large language models on software engineering: An example from chatgpt and automated program repair,” 2024

  14. [14]

    Lessleak-bench: A first investigation of data leakage in llms across 83 software engineering benchmarks,

    X. Zhou, M. Weyssow, R. Widyasari, T. Zhang, J. He, Y . Lyu, J. Chang, B. Zhang, D. Huang, and D. Lo, “Lessleak-bench: A first investigation of data leakage in llms across 83 software engineering benchmarks,” 2025

  15. [15]

    Assess- ing robustness of ml-based program analysis tools using metamorphic program transformations,

    L. Applis, A. Panichella, and A. van Deursen, “Assess- ing robustness of ml-based program analysis tools using metamorphic program transformations,” inASE, 2021, pp. 1377–1381

  16. [16]

    A survey on metamorphic testing,

    S. Segura, G. Fraser, A. B. Sanchez, and A. Ruiz-Cort´es, “A survey on metamorphic testing,”IEEE Transactions on Software Engineering, vol. 42, no. 9, pp. 805–824, 2016

  17. [17]

    Gitbug-java: A reproducible benchmark of recent java bugs,

    A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A reproducible benchmark of recent java bugs,” inMSR, 2024, pp. 118–122

  18. [18]

    Automatic identifier inconsistency detection using code dictionary,

    S. Kim and D. Kim, “Automatic identifier inconsistency detection using code dictionary,”Empirical Software Engineering, vol. 21, no. 2, pp. 565–604, 2016

  19. [19]

    Metamorphic-based many-objective dis- tillation of llms for code-related tasks,

    A. Panichella, “Metamorphic-based many-objective dis- tillation of llms for code-related tasks,” inICSE, 2025

  20. [20]

    Metamorphic testing of deep code mod- els: A systematic literature review,

    A. Asgari, M. de Koning, P. Derakhshanfar, and A. Panichella, “Metamorphic testing of deep code mod- els: A systematic literature review,”ACM Transactions on Software Engineering and Methodology

  21. [21]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Gemma 2: Improving Open Language Models at a Practical Size

    G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahri- ari, A. Ram ´eet al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024

  23. [23]

    Mistral 7b,

    A. Q. Jiang, A. Sablayrolles, A. Mensch, and Others, “Mistral 7b,” 2023

  24. [24]

    StarCoder 2 and The Stack v2: The Next Generation

    A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y . Wei et al., “Starcoder 2 and the stack v2: The next gener- ation,”arXiv preprint arXiv:2402.19173, 2024

  25. [25]

    Empirical review of java program repair tools: A large- scale experiment,

    T. Durieux, F. Madeiral, M. Martinez, and R. Abreu, “Empirical review of java program repair tools: A large- scale experiment,” inESEC/FSE, 2019, pp. 302–313

  26. [26]

    A survey on software fault localization,

    W. E. Wong, R. Gao, Y . Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,”IEEE Transac- tions on Software Engineering, vol. 42, no. 8, pp. 707– 740, 2016

  27. [27]

    Genprog: A generic method for automatic software repair,

    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,”IEEE Transactions on Software Engineering, vol. 38, no. 1, pp. 54–72, 2012

  28. [28]

    Automatic error elimination by horizontal code transfer across multiple applications,

    S. Sidiroglou-Douskos, E. Lahtinen, F. Long, and M. Ri- nard, “Automatic error elimination by horizontal code transfer across multiple applications,” inPLDI, 2015, pp. 43–54

  29. [29]

    Automated fixing of programs with contracts,

    Y . Wei, Y . Pei, C. A. Furia, L. S. Silva, S. Buchholz, B. Meyer, and A. Zeller, “Automated fixing of programs with contracts,” inISSTA, 2010, pp. 61–72

  30. [30]

    Contract-based pro- gram repair without the contracts: An extended study,

    L. Chen, Y . Pei, and C. A. Furia, “Contract-based pro- gram repair without the contracts: An extended study,” IEEE Transactions on Software Engineering, vol. 47, no. 12, pp. 2841–2857, 2020

  31. [31]

    Automatic patch generation learned from human-written patches,

    D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation learned from human-written patches,” inICSE, 2013, pp. 802–811

  32. [32]

    History driven program repair,

    X. B. D. Le, D. Lo, and C. Le Goues, “History driven program repair,” inSANER, 2016, pp. 213–224

  33. [33]

    Relifix: Automated repair of software regressions,

    S. H. Tan and A. Roychoudhury, “Relifix: Automated repair of software regressions,” inICSE, 2015, pp. 471– 482

  34. [34]

    An empirical study on learning bug-fixing patches in the wild via neural ma- chine translation,

    M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “An empirical study on learning bug-fixing patches in the wild via neural ma- chine translation,”ACM Transactions on Software Engi- neering and Methodology, vol. 28, no. 4, 2019

  35. [35]

    Sequencer: Sequence-to-sequence learning for end-to-end program repair,

    Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence learning for end-to-end program repair,”IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1943–1959, 2019

  36. [36]

    Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation,

    Y . Wang, W. Wang, S. Joty, and S. C. H. Hoi, “Codet5: Identifier-aware unified pre-trained encoder- decoder models for code understanding and generation,” 2021

  37. [37]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liuet al., “Codebert: A pre-trained model for programming and natural languages,”arXiv preprint arXiv:2002.08155, 2020

  38. [38]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brock- manet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  39. [39]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,”arXiv preprint arXiv:2406.00515, 2024

  40. [40]

    Source code summa- rization in the era of large language models,

    W. Sun, Y . Miao, Y . Li, H. Zhang, C. Fang, Y . Liu, G. Deng, Y . Liu, and Z. Chen, “Source code summa- rization in the era of large language models,” 2024

  41. [41]

    Large language model for vulnerability detection and repair: Literature review and the road ahead,

    X. Zhou, S. Cao, X. Sun, and D. Lo, “Large language model for vulnerability detection and repair: Literature review and the road ahead,”ACM Transactions on Soft- ware Engineering and Methodology, 2024

  42. [42]

    A com- prehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A com- prehensive overview of large language models,” 2024

  43. [43]

    Searching for quality: Genetic algorithms and metamorphic testing for software engineering ml,

    L. Applis, A. Panichella, and R. Marang, “Searching for quality: Genetic algorithms and metamorphic testing for software engineering ml,” inGECCO, 2023, pp. 1490– 1498

  44. [44]

    Evolutionary multi-objective optimization for contextual adversarial example generation,

    S. Zhou, M. Huang, Y . Sun, and K. Li, “Evolutionary multi-objective optimization for contextual adversarial example generation,”Proceedings of the ACM on Soft- ware Engineering, vol. 1, 2024

  45. [45]

    Discrete adversarial attack to models of code,

    F. Gao, Y . Wang, and K. Wang, “Discrete adversarial attack to models of code,”Proceedings of the ACM on Programming Languages, vol. 7, 2023

  46. [46]

    Adversarial robustness of deep code comment generation,

    Y . Zhou, X. Zhang, J. Shen, T. Han, T. Chen, and H. Gall, “Adversarial robustness of deep code comment generation,”ACM Transactions on Software Engineer- ing and Methodology, vol. 31, no. 4, 2022

  47. [47]

    Clawsat: Towards both robust and accurate code models,

    J. Jia, S. Srikant, T. Mitrovska, C. Gan, S. Chang, S. Liu, and U.-M. O’Reilly, “Clawsat: Towards both robust and accurate code models,” inSANER, 2023, pp. 212–223

  48. [48]

    On inter-dataset code duplication and data leakage in large language models,

    J. A. H. L ´opez, B. Chen, M. Saad, T. Sharma, and D. Varr´o, “On inter-dataset code duplication and data leakage in large language models,”IEEE Transactions on Software Engineering, vol. 51, no. 1, pp. 192–205, 2025

  49. [49]

    Traces of memorisation in large language models for code,

    A. Al-Kaswan, M. Izadi, and A. Van Deursen, “Traces of memorisation in large language models for code,” in ICSE, 2024, pp. 1–12

  50. [50]

    Benchmarking benchmark leakage in large language models,

    R. Xu, Z. Wang, R.-Z. Fan, and P. Liu, “Benchmarking benchmark leakage in large language models,” 2024

  51. [51]

    Estimating contamination via perplexity: Quan- tifying memorisation in language model evaluation,

    Y . Li, “Estimating contamination via perplexity: Quan- tifying memorisation in language model evaluation,” 2023

  52. [52]

    A sur- vey of learning-based automated program repair,

    Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A sur- vey of learning-based automated program repair,”ACM Transactions on Software Engineering and Methodol- ogy, vol. 33, no. 2, 2023

  53. [53]

    Addressing data leakage in humaneval using combinatorial test design,

    J. S. Bradbury and R. More, “Addressing data leakage in humaneval using combinatorial test design,” inICST, 2025, pp. 587–591

  54. [54]

    Natural attack for pre- trained models of code,

    Z. Yang, J. Shi, J. He, and D. Lo, “Natural attack for pre- trained models of code,” inICSE, 2022, pp. 1482–1493

  55. [55]

    Evalu- ating program repair with semantic-preserving transfor- mations: A naturalness assessment,

    T. Le-Cong, D. Nguyen, B. Le, and T. Murray, “Evalu- ating program repair with semantic-preserving transfor- mations: A naturalness assessment,”CoRR, 2024

  56. [56]

    Ro- bustnpr: Evaluating the robustness of neural program re- pair models,

    H. Ge, W. Zhong, C. Li, J. Ge, H. Hu, and B. Luo, “Ro- bustnpr: Evaluating the robustness of neural program re- pair models,”Journal of Software: Evolution and Pro- cess, vol. 36, no. 4, p. e2586, 2024

  57. [57]

    Evaluating the generalizability of llms in automated program repair,

    F. Li, J. Jiang, J. Sun, and H. Zhang, “Evaluating the generalizability of llms in automated program repair,” 2025

  58. [58]

    Exploring and lifting the robustness of llm-powered automated program repair with meta- morphic testing,

    P. Xue, L. Wu, Z. Yang, X. Li, Z. Yu, Z. Jin, G. Li, Y . Xiao, and J. Wu, “Exploring and lifting the robustness of llm-powered automated program repair with meta- morphic testing,” 2024

  59. [59]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inSOSP, 2023

  60. [60]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,”arXiv preprint arXiv:2203.13474, 2022

  61. [61]

    Chain-of- thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of- thought prompting elicits reasoning in large language models,” inNeurIPS, 2022, pp. 24 824–24 837

  62. [62]

    A hitchhiker’s guide to statis- tical tests for assessing randomized algorithms in soft- ware engineering,

    A. Arcuri and L. Briand, “A hitchhiker’s guide to statis- tical tests for assessing randomized algorithms in soft- ware engineering,”Software Testing, Verification and Reliability, vol. 24, no. 3, pp. 219–250, 2014

  63. [63]

    A critique and improve- ment of the cl common language effect size statistics of mcgraw and wong,

    A. Vargha and H. D. Delaney, “A critique and improve- ment of the cl common language effect size statistics of mcgraw and wong,”Journal of Educational and Behav- ioral Statistics, vol. 25, no. 2, pp. 101–132, 2000

  64. [64]

    A survey of llm-based automated program repair: Taxonomies, design paradigms, and ap- plications,

    B. Yang, Z. Cai, F. Liu, B. Le, L. Zhang, T. F. Bissyand´e, Y . Liu, and H. Tian, “A survey of llm-based automated program repair: Taxonomies, design paradigms, and ap- plications,” 2025

  65. [65]

    Neural transfer learning for repairing security vulnerabilities in c code,

    Z. Chen, S. Kommrusch, and M. Monperrus, “Neural transfer learning for repairing security vulnerabilities in c code,”IEEE Transactions on Software Engineering, vol. 49, no. 1, pp. 147–165, 2022