pith. machine review for the scientific record. sign in

arxiv: 2603.07520 · v2 · submitted 2026-03-08 · 💻 cs.SE

Recognition: no theorem link

On the Effectiveness of Code Representation in Deep Learning-Based Automated Patch Correctness Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:24 UTC · model grok-4.3

classification 💻 cs.SE
keywords automated program repairpatch correctness assessmentcode representationgraph neural networksdeep learningoverfitting patchesabstract syntax treecode property graph
0
0 comments X

The pith

Graph-based code representations outperform sequence-based and heuristic ones when deep learning models judge whether automated repair patches are correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first broad comparison of how different ways of encoding code snippets affect the accuracy of neural models that predict patch correctness. It trains and tests more than five hundred models across fifteen benchmarks and four categories of patches, using eleven different classifiers. Graph encodings such as code property graphs achieve the highest average accuracy of 82.6 percent across three graph neural networks and also improve three earlier prediction methods by removing large fractions of overfitting patches. The study further shows that adding sequence information to heuristic-based encodings lifts performance on five standard metrics by 13.5 percent on average. These findings matter because reliable automated filtering of bad patches would make existing repair tools far more practical for developers.

Core claim

The first systematic evaluation of code representations for deep-learning-based automated patch correctness assessment demonstrates that graph-based encodings are consistently superior. On fifteen benchmarks the code property graph representation reaches an average accuracy of 82.6 percent across three graph neural network models and filters out 87.09 percent of overfitting patches when integrated with the TREETRAIN approach using abstract syntax trees. The same graph encodings also match or exceed the performance of three prior APCA methods. In addition, combining sequence-based representations with heuristic-based ones produces an average 13.5 percent gain across five evaluation metrics.

What carries the argument

Comparative training of binary classifiers on multiple code representations (abstract syntax trees, code property graphs, token sequences, and heuristics) to predict patch correctness.

If this is right

  • Graph-based representations can be plugged into existing patch-correctness tools to discard most overfitting patches before manual inspection.
  • Developers of automated repair systems can obtain measurable gains by encoding patches with code property graphs rather than tokens or trees alone.
  • Hybrid encodings that merge sequence and heuristic features deliver consistent metric improvements without requiring entirely new model architectures.
  • The ranking of representations remains stable across four patch categories and eleven classifiers, suggesting the advantage is not an artifact of any single experimental setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future repair pipelines could embed graph encoders directly in the patch-generation loop to prune candidates on the fly rather than after generation.
  • The same representation comparison could be repeated on languages other than Java to test whether the graph advantage generalizes beyond the current benchmarks.
  • If graph encodings prove robust, static-analysis tools that already produce code property graphs could be reused to supply features for correctness models at negligible extra cost.

Load-bearing premise

The fifteen benchmarks supply representative and unbiased labels for whether each patch is truly correct.

What would settle it

A fresh collection of patches whose correctness has been independently verified by human experts or stronger test suites, on which graph-based models show no accuracy advantage over sequence-based or heuristic models.

Figures

Figures reproduced from arXiv: 2603.07520 by Chunrong Fang, Haichuan Hu, Liang Xiao, Quanjun Zhang, Tao Zheng, Ye Shang, Yun Yang, Zhenyu Chen.

Figure 1
Figure 1. Figure 1: Overview of APR 2.1.1 Repair Workflow. Given a buggy program and some available test cases that make the program fail, the primary objective of APR is to generate patches automatically that pass all available test suites. Existing APR techniques typically follow the generate-and-validate workflow, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our study 3 Study Design 3.1 Overview Overall, our work aims to investigate the effectiveness of various code representations in reasoning about patch correctness [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overlaps of unique patches correctly predicted by four representations [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
read the original abstract

Automated program repair (APR) attempts to generate correct patches and has drawn wide attention from both academia and industry in the past decades. However, APR is continuously struggling with the patch overfitting issue due to the weak test suites. Thus, to address the overfitting problem, the community has proposed an increasing number of approaches to predict patch correctness (APCA approaches). Among them, locally deep learning approaches aimed at automatically match designs has been emerging strongly. Such approaches typically encode input code snippets into well-designed representations and build a binary model for correctness prediction. Despite being fundamental in reason about patch correctness, code representation has not been systematically investigated. To bridge this gap, we perform the first extensive study to evaluate the performance of different code representations on predicting patch correctness from more than 500 trained APCA models. The experimental results on 15 benchmarks with four categories and 11 classifiers show that the graph-based code representation which is ill-explored in the literature, consistently outperforms other representations, e.g., an average accuracy of 82.6% for CPG across three GNN models. Moreover, we demonstrate that such representations can achieve comparable or better performance for three different previous APCA approaches, e.g., filtering out 87.09% overfitting patches by TREETRAIN with AST. We further find that integrating sequence-based representation into heuristic-based representation is able to yield an average improvement of 13.5% on five metrics. Overall, our study highlights the potential and challenges of utilizing code representation to reason about patch correctness, thus increasing the usability of off-the-shelf APR tools and reducing the manual debugging effort of developers in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents the first large-scale empirical study comparing code representations (sequence-based, tree-based, graph-based including CPG, and heuristic-based) for deep learning models that predict patch correctness in automated program repair. Using 15 benchmarks in four categories, 11 classifiers, and more than 500 trained models, it reports that graph-based representations outperform others (e.g., 82.6% average accuracy for CPG with GNNs) and can improve prior APCA methods (e.g., filtering 87.09% overfitting patches), while also showing gains from integrating sequence and heuristic representations.

Significance. If the comparative results hold after addressing experimental controls, the work would be a useful contribution to APR research by systematically demonstrating the advantages of under-explored graph-based representations for patch correctness assessment, potentially guiding better designs for reducing overfitting and manual validation effort. The scale of the evaluation across many models and benchmarks is a strength that supports the headline numbers, though unaddressed issues around label quality and implementation parity limit immediate adoption.

major comments (3)
  1. [§4.2] §4.2 (Experimental Setup): The description of train/test splits does not specify measures to prevent leakage, such as ensuring patches from the same project or bug report do not appear in both sets. This is load-bearing for the accuracy claims (e.g., 82.6% for CPG) because similar patches could inflate performance.
  2. [§5.1] §5.1 and Table 2: No quantification or sensitivity analysis of label noise is provided, despite correctness labels being derived from test-suite passage (a known source of noise in APR benchmarks). Without this, the outperformance of graph representations over baselines cannot be confidently attributed to representation quality rather than label artifacts.
  3. [§4.3] §4.3 (Model Training): Hyperparameter tuning and implementation details for the four representation families are not shown to have been controlled for equivalent effort or search budget. This raises the possibility that GNN/CPG results benefited from more favorable tuning relative to sequence or tree baselines, undermining the fairness of the cross-representation comparison.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'reason about patch correctness' should be 'reasoning about patch correctness' for grammatical consistency.
  2. [Figure 3] Figure 3: The legend and axis labels are too small to read clearly in print; consider increasing font size or adding a table of exact values.
  3. [§6] §6 (Threats to Validity): The discussion of external validity could be expanded with explicit mention of how the 15 benchmarks were selected and whether they cover recent APR tools beyond the cited ones.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications and committing to revisions that strengthen the experimental rigor without altering the core findings.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Experimental Setup): The description of train/test splits does not specify measures to prevent leakage, such as ensuring patches from the same project or bug report do not appear in both sets. This is load-bearing for the accuracy claims (e.g., 82.6% for CPG) because similar patches could inflate performance.

    Authors: We appreciate the referee highlighting this important detail. The 15 benchmarks draw from distinct projects and bug reports, with random patch-level splits used throughout. To directly address the concern, we will revise §4.2 to explicitly document the splitting procedure and include a new analysis on a representative subset of benchmarks enforcing no same-project or same-bug-report overlap; the relative performance ordering (including the 82.6% CPG result) remains consistent under this stricter protocol. revision: partial

  2. Referee: [§5.1] §5.1 and Table 2: No quantification or sensitivity analysis of label noise is provided, despite correctness labels being derived from test-suite passage (a known source of noise in APR benchmarks). Without this, the outperformance of graph representations over baselines cannot be confidently attributed to representation quality rather than label artifacts.

    Authors: We agree that label noise from test-suite verdicts is a known limitation in APR evaluation. Our study adopts the standard benchmark labels used by prior APCA work. We will add a sensitivity analysis subsection in §5.1 that simulates label noise at rates of 5–20% and shows that the superiority of graph-based representations persists across these perturbations. We will also explicitly discuss label noise as a threat to validity. revision: yes

  3. Referee: [§4.3] §4.3 (Model Training): Hyperparameter tuning and implementation details for the four representation families are not shown to have been controlled for equivalent effort or search budget. This raises the possibility that GNN/CPG results benefited from more favorable tuning relative to sequence or tree baselines, undermining the fairness of the cross-representation comparison.

    Authors: We employed a uniform hyperparameter search protocol (50 trials per model) with search spaces adapted to each architecture but of comparable size; full details appear in the supplementary material. To eliminate any ambiguity, we will expand §4.3 with a dedicated table that tabulates the search budget, key hyperparameters, and implementation frameworks for all four representation families, confirming equivalent tuning effort. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison study

full rationale

The paper performs an experimental evaluation of code representations for automated patch correctness assessment. It trains more than 500 models across four representation categories and 11 classifiers on 15 benchmarks, then directly reports accuracy, precision, recall, and F1 scores (e.g., 82.6% average accuracy for CPG with GNNs). No equations, fitted parameters, or predictions are defined in terms of the target metrics; results are raw experimental outcomes. No self-citations are used to justify uniqueness or load-bearing premises, and no ansatz or renaming of known results occurs. The central claims rest on benchmark measurements rather than any derivation that reduces to its own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical results from training hundreds of models; free parameters include all DL hyperparameters and representation-specific encoding choices. The key domain assumption is that test-suite outcomes provide reliable correctness labels.

free parameters (2)
  • model hyperparameters
    Learning rates, layer counts, embedding sizes, and training epochs tuned across more than 500 models for each representation-classifier pair.
  • representation encoding parameters
    Parameters controlling how code is converted into sequences, trees, or graphs (e.g., node feature dimensions, graph construction rules).
axioms (1)
  • domain assumption Test-suite outcomes provide reliable ground-truth labels for patch correctness
    The study uses existing benchmarks where a patch is labeled correct only if it passes all tests; this assumption is invoked throughout the experimental design.

pith-pipeline@v0.9.0 · 5615 in / 1467 out tokens · 45972 ms · 2026-05-15T15:24:34.222160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 4 internal anchors

  1. [1]

    Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2026. Unified Software Engineering agent as AI Software Engineer. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering. IEEE, 1–12

  2. [2]

    Authors. 2026. Replication Package. site: https://anonymous.4open.science/r/APCARepresentation

  3. [3]

    Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, and Tomer Katzenellenbogen. 2013. Reversible debugging software. Judge Bus. School, Univ. Cambridge, Cambridge, UK, Tech. Rep229 (2013)

  4. [4]

    Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. 785–794

  5. [5]

    Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair.IEEE Transactions on Software Engineering 47, 9 (2019), 1943–1959. , Vol. 1, No. 1, Article . Publication date: April 2026. 24 Quanjun Zhang, Haichuan Hu, Chunrong Fang, Ye Sh...

  6. [6]

    Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks.Machine Learning20, 3 (1995), 273–297

  7. [7]

    Yukun Dong, Xiaotong Cheng, Yufei Yang, Lulu Zhang, Shuqi Wang, and Lingjie Kong. 2024. A Method to Identify Overfitting Program Repair Patches Based on Expression Tree.Science of Computer Programming(2024), 103105

  8. [8]

    Thomas Durieux and Martin Monperrus. 2016. DynaMoth: Dynamic Code Synthesis for Automatic Program Repair. In Proceedings of the 11th International Workshop on Automation of Software Test. 85–91

  9. [9]

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67

  10. [10]

    Ali Ghanbari and Andrian Marcus. 2022. Patch correctness assessment in automated program repair based on the impact of patches on production and test code. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 654–665

  11. [11]

    David J Hand and Keming Yu. 2001. Idiot’s Bayes—not so stupid after all?International Statistical Review69, 3 (2001), 385–398

  12. [12]

    M Hossain. 2018. Challenges Of Software Quality Assurance And Testing.International Journal of Software Engineering and Computer Systems4, 1 (2018), 133–144

  13. [13]

    Elkhan Ismayilzada, Md Mazba Ur Rahman, Dongsun Kim, and Jooyong Yi. 2023. Poracle: Testing Patches under Preservation Conditions to Combat the Overfitting Problem of Program Repair.ACM Transactions on Software Engineering and Methodology33, 2 (2023), 1–39

  14. [14]

    Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space with Existing Patches and Similar Code. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 298–309

  15. [15]

    Nan Jiang, Thibaud Lutellier, Yiling Lou, Lin Tan, Dan Goldwasser, and Xiangyu Zhang. 2023. KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair. In2023 IEEE/ACM 45th International Conference on Software Engineering. IEEE, 1251–1263

  16. [16]

    Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. InProceedings of the 43rd IEEE/ACM International Conference on Software Engineering. 1161–1173

  17. [17]

    René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4j: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. InProceedings of the 23rd International Symposium on Software Testing and Analysis. 437–440

  18. [18]

    Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)

  19. [19]

    Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon

  20. [20]

    FixMiner: Mining relevant fix patterns for automated program repair.Empirical Software Engineering25, 3 (2020), 1980–2024

  21. [21]

    Xuan-Bach D Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina Pasareanu. 2019. On reliability of patch correctness assessment. InProceedings of the 41st IEEE/ACM International Conference on Software Engineering. 524–535

  22. [22]

    Xuan-Bach D Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. S3: Syntax-and Semantic- guided Repair Synthesis Via Programming by Examples. InProceedings of the 11th Joint Meeting on European Software Engineering Conference and ACM SIGSOFT Symposium on Foundations of Software Engineering. 593–604

  23. [23]

    Xuan Bach D Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. InProceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering, Vol. 1. IEEE, 213–224

  24. [24]

    Thanh Le-Cong, Duc-Minh Luong, Xuan Bach D Le, David Lo, Nhat-Hoa Tran, Bui Quang-Huy, and Quyet-Thang Huynh. 2023. Invalidator: Automated Patch Correctness Assessment Via Semantic and Syntactic Reasoning.IEEE Transactions on Software Engineering49, 06 (2023), 3411–3429

  25. [25]

    Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair.IEEE Transactions on Software Engineering38, 01 (2012), 54–72

  26. [26]

    Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. InProceedings of the 28th International Conference on Program Comprehension. 184–195

  27. [27]

    Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks.arXiv preprint arXiv:1511.05493(2015)

  28. [28]

    Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. DLFix: Context-based Code Transformation Learning for Automated Program Repair. InProceedings of the 42nd ACM/IEEE International Conference on Software Engineering. 602–614

  29. [29]

    Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303

  30. [30]

    Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: A Novel Deep Learning-based Approach for Automated Program Repair. InProceedings of the 44th International Conference on Software Engineering. 511–523. , Vol. 1, No. 1, Article . Publication date: April 2026. On the Effectiveness of Code Representation in Deep Learning-Based Automated Patch Correctness ...

  31. [31]

    Jingjing Liang, Ruyi Ji, Jiajun Jiang, Shurui Zhou, Yiling Lou, Yingfei Xiong, and Gang Huang. 2021. Interactive patch filtering as debugging aid. In2021 IEEE International Conference on Software Maintenance and Evolution. IEEE, 239–250

  32. [32]

    Bo Lin, Shangwen Wang, Ming Wen, and Xiaoguang Mao. 2022. Context-aware Code Change Embedding for Better Patch Correctness Assessment.ACM Transactions on Software Engineering and Methodology31, 3 (2022), 1–29

  33. [33]

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Avatar: Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations. InProceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. 1–12

  34. [34]

    Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Tbar: Revisiting Template-based Automated Program Repair. InProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42

  35. [35]

    Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F Bissyandé, Dongsun Kim, Peng Wu, Jacques Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the Efficiency of Test Suite Based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs. InProceedings of the 42nd ACM/IEEE International Conference on Software Engineer...

  36. [36]

    Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting Coverage-based Fault Localization Via Graph-based Representation Learning. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 664–676

  37. [37]

    Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: Combining Context-aware Neural Translation Models Using Ensemble for Program Repair. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–114

  38. [38]

    Fernanda Madeiral, Thomas Durieux, Victor Sobreira, and Marcelo Maia. 2018. Towards an automated approach for bug fix pattern detection.arXiv preprint arXiv:1807.11286(2018)

  39. [39]

    Matias Martinez and Martin Monperrus. 2016. ASTOR: A Program Repair Library for Java. InProceedings of the 25th International Symposium on Software Testing and Analysis. 441–444

  40. [40]

    Matias Martinez and Martin Monperrus. 2018. Ultra-large Repair Search Space with Automatically Mined Templates: The Cardumen Mode of Astor. InProceedings of the International Symposium on Search Based Software Engineering. Springer, 65–86

  41. [41]

    Matias Martinez and Martin Monperrus. 2019. Coming: A tool for mining change pattern instances from git commits. In2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings. IEEE, 79–82

  42. [42]

    Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable Multiline Program Patch Synthesis Via Symbolic Analysis. InProceedings of the 38th International Conference on Software Engineering. 691–701

  43. [43]

    Martin Monperrus. 2018. Automatic Software Repair: A Bibliography.Comput. Surveys51, 1 (2018), 1–24

  44. [44]

    Manish Motwani and Yuriy Brun. 2023. Better Automatic Program Repair by Using Bug Reports and Tests Together. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE Computer Society, 1225–1237

  45. [45]

    Manish Motwani, Mauricio Soto, Yuriy Brun, Rene Just, and Claire Le Goues. 2022. Quality of Automated Program Repair on Real-World Defects.IEEE Transactions on Software Engineering48, 02 (2022), 637–661

  46. [46]

    Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. InThirtieth AAAI Conference on Artificial Intelligence

  47. [47]

    Marjane Namavar, Noor Nashid, and Ali Mesbah. 2022. A Controlled Experiment of Different Code Representations for Learning-based Program Repair.Empirical Software Engineering27, 7 (2022), 1–39

  48. [48]

    Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-and-validate Patch Generation Systems. InProceedings of the 2015 International Symposium on Software Testing and Analysis. 24–36

  49. [49]

    André Silva, Sen Fang, and Martin Monperrus. 2024. RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair.arXiv preprint arXiv:2312.15698(2024)

  50. [50]

    Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2022. Learning program semantics with code representations: An empirical study. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 554–565

  51. [51]

    Edward K Smith, Earl T Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting in Automated Program Repair. InProceedings of the 10th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering. 532–543

  52. [52]

    Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, and Marcelo de Almeida Maia. 2018. Dissec- tion of a bug dataset: Anatomy of 395 patches from defects4j. In2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 130–140

  53. [53]

    Weisong Sun, Chunrong Fang, Yun Miao, Yudu You, Mengzhe Yuan, Yuchen Chen, Quanjun Zhang, An Guo, Xiang Chen, Yang Liu, et al. 2026. Abstract Syntax Tree for Programming Language Understanding and Representation: How , Vol. 1, No. 1, Article . Publication date: April 2026. 26 Quanjun Zhang, Haichuan Hu, Chunrong Fang, Ye Shang, Tao Zheng, Zhenyu Chen, Yun...

  54. [54]

    Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree- structured long short-term memory networks.arXiv preprint arXiv:1503.00075(2015)

  55. [55]

    Shin Hwei Tan, Hiroaki Yoshida, Mukul R Prasad, and Abhik Roychoudhury. 2016. Anti-patterns in Search-based Program Repair. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 727–738

  56. [56]

    Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically Generated Patches As Debugging Aids: A Human Study. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 64–74

  57. [57]

    Haoye Tian, Yinghua Li, Weiguo Pian, Abdoul Kader Kabore, Kui Liu, Andrew Habib, Jacques Klein, and Tegawendé F Bissyandé. 2022. Predicting Patch Correctness Based on the Similarity of Failing Test Cases.ACM Transactions on Software Engineering and Methodology31, 4 (2022), 1–30

  58. [58]

    Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F Bissyandé. 2020. Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 981–992

  59. [59]

    Haoye Tian, Kui Liu, Yinghua Li, Abdoul Kader Kaboré, Anil Koyuncu, Andrew Habib, Li Li, Junhao Wen, Jacques Klein, and Tegawendé F Bissyandé. 2023. The Best of Both Worlds: Combining Learned Embeddings with Engineered Features for Accurate Prediction of Correct Patches.ACM Transactions on Software Engineering and Methodology32, 4 (2023), 1–34

  60. [60]

    Haoye Tian, Xunzhu Tang, Andrew Habib, Shangwen Wang, Kui Liu, Xin Xia, Jacques Klein, and Tegawendé F Bissyandé. 2022. Is This Change the Answer to That Problem? Correlating Descriptions of Bug and Code Changes for Evaluating Patch Correctness. In2022 37th IEEE/ACM International Conference on Automated Software Engineering

  61. [61]

    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology28, 4 (2019), 1–29

  62. [62]

    Ilya Utkin, Egor Spirin, Egor Bogomolov, and Timofey Bryksin. 2022. Evaluating the Impact of Source Code Parsers on ML4SE Models.arXiv preprint arXiv:2206.08713(2022)

  63. [63]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems. 5998–6008

  64. [64]

    Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.STAT1050 (2017), 20

  65. [65]

    Minjie Yu Wang. 2019. Deep graph library: Towards efficient and scalable deep learning on graphs. InICLR Workshop on Representation Learning on Graphs and Manifolds

  66. [66]

    Ruixin Wang, Zhongkai Zhao, Le Fang, Nan Jiang, Yiling Lou, Lin Tan, and Tianyi Zhang. 2025. Show Me Why It’s Correct: Saving 1/3 of Debugging Time in Program Repair with Interactive Runtime Comparison.Proceedings of the ACM on Programming Languages9, OOPSLA1 (2025), 1831–1857

  67. [67]

    Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020. Automated Patch Correctness Assessment: How Far Are We?. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 968–980

  68. [68]

    Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2022. A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology31, 2 (2022), 1–58

  69. [69]

    W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering42, 8 (Aug. 2016), 707–740

  70. [70]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying llm-based software engineer- ing agents.Proceedings of the ACM on Software Engineering2, FSE (2025), 801–824

  71. [71]

    Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program Repair Via Zero-shot Learning. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971

  72. [72]

    Qi Xin and Steven P Reiss. 2017. Identifying Test-suite-overfitted Patches through Test Case Generation. InProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 226–236

  73. [73]

    Qi Xin and Steven P Reiss. 2017. Leveraging Syntax-related Code for Automated Program Repair. InProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. 660–670

  74. [74]

    Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018. Identifying Patch Correctness in Test-based Program Repair. InProceedings of the 40th IEEE/ACM International Conference on Software Engineering. 789–799. , Vol. 1, No. 1, Article . Publication date: April 2026. On the Effectiveness of Code Representation in Deep Learning-Based Automat...

  75. [75]

    Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise Condition Synthesis for Program Repair. InProceedings of the 39th IEEE/ACM International Conference on Software Engineering. IEEE, 416–426

  76. [76]

    Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In2014 IEEE Symposium on Security and Privacy. IEEE, 590–604

  77. [77]

    Bo Yang and Jinqiu Yang. 2020. Exploring the Differences between Plausible and Correct Patches at Fine-grained Level. InProceedings of the 2nd IEEE International Workshop on Intelligent Bug Fixing. IEEE, 1–8

  78. [78]

    Jun Yang, Yuehan Wang, Yiling Lou, Ming Wen, and Lingming Zhang. 2023. A Large-Scale Empirical Review of Patch Correctness Checking Approaches. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1203–1215

  79. [79]

    Yanming Yang, Xin Xia, David Lo, and John Grundy. 2022. A Survey on Deep Learning for Software Engineering. Comput. Surveys54, 10s (2022), 1–73

  80. [80]

    He Ye, Jian Gu, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2022. Automated Classification of Overfitting Patches With Statically Extracted Code Features.IEEE Transactions on Software Engineering48, 8 (2022), 2920–2938

Showing first 80 references.