pith. machine review for the scientific record. sign in

arxiv: 2605.07957 · v1 · submitted 2026-05-08 · 💻 cs.SE

Recognition: no theorem link

Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:54 UTC · model grok-4.3

classification 💻 cs.SE
keywords test code fault localizationLLM-based debuggingretrieval-augmented generationcontinuous integrationsoftware testingfault pattern matchingtest script analysis
0
0 comments X

The pith

SPARK retrieves similar past test failures to annotate suspicious lines and guide LLMs toward more accurate fault locations in new failing tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPARK to address test code fault localization, a problem where faults in large system test scripts are hard to find using only error messages and logs. It builds a corpus of fault-labeled cases from continuous integration runs and, for each new failure, pulls the most similar ones to mark likely buggy lines in the current script. These annotations steer the language model's reasoning process. The result is higher success in pinpointing faults, especially when several bugs exist in one test, without increasing token counts or inference time.

Core claim

SPARK integrates accumulated debugging knowledge from CI environments into LLM-based TCFL by retrieving similar fault-labeled test cases from a knowledge corpus and selectively annotating suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding prompt-length explosion. On three industrial datasets of real-world faulty Python test cases, SPARK identifies more correct faulty locations than the existing LLM-based baseline, particularly in complex multi-fault cases, while keeping inference cost and token usage comparable.

What carries the argument

The selective annotation step, which transfers fault labels from retrieved similar cases onto suspicious lines in the target failing test to focus the LLM's attention.

If this is right

  • More correct faulty locations are identified in complex tests that contain multiple faults.
  • Fault localization effectiveness rises while inference cost and token usage stay comparable to the unaugmented baseline.
  • The approach scales to large test suites without causing prompt-length problems that plague naive retrieval methods.
  • It works on real industrial Python test cases drawn from different software products.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-and-annotation idea could be tested on non-Python test languages if equivalent fault-labeled corpora are collected.
  • Combining the annotations with additional signals such as execution traces might further narrow the search space in black-box settings.
  • Maintaining an evolving CI corpus could allow the system to improve automatically as new faults are discovered and labeled.

Load-bearing premise

That cases retrieved from the CI corpus will share fault patterns accurate enough to annotate the new test without adding misleading noise that hurts the LLM's reasoning.

What would settle it

Run the same three industrial datasets through the baseline LLM approach but with the annotation step removed or replaced by random line marks, then check whether the number of correctly localized faults drops, especially on the multi-fault subset.

Figures

Figures reproduced from arXiv: 2605.07957 by Golnaz Gharachorlu, Lionel C. Briand, Mahsa Panahandeh, Ruifeng Gao, Ruiyuan Wan.

Figure 1
Figure 1. Figure 1: A synthetic example showing a faulty test case, its error message, and the list of lines ranked by their likelihood of being [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The baseline TCFL’s prompt template for test code fault localization [59]. Text enclosed in curly braces ({ }) represents variable placeholders dynamically filled during the prompting process. Our experimental results (see section 4) indicate that this single-modality guidance limits the effectiveness of the baseline TCFL approach across different datasets, especially for system-level test scripts. For exa… view at source ↗
Figure 3
Figure 3. Figure 3: Example workflow of SPARK. 3.6 An Illustrative Example To better illustrate our approach, [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for SPARK with the annotator module disabled, in which faulty lines retrieved from similar test cases are included as a separate section. The highlighted text presents additional instructions and information included in this template that are not present in the baseline TCFL’s prompt template shown in [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A directive prompt template for test code fault localization, explicitly instructing the LLM to start with investigating the [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
read the original abstract

Software failures remain a major challenge in modern software development, and identifying the code elements responsible for failures is a time-consuming debugging task. While extensive research has focused on fault localization in the system under test (SUT), failures can also originate from faulty system test scripts. This problem, known as Test Code Fault Localization (TCFL), has received significantly less attention despite its importance in continuous integration (CI) environments where large test suites are executed frequently. TCFL is particularly challenging because it typically operates under black-box conditions, relies on limited diagnostic signals such as error messages and partial logs, and involves large system-level test scripts that expand the fault localization search space. In this paper, we propose SPARK, a framework that integrates accumulated debugging knowledge from continuous integration (CI) environments into Large Language Model (LLM)-based TCFL. Given a newly observed failing test case, SPARK retrieves similar fault-labeled test cases from a debugging knowledge corpus and selectively annotates suspicious lines of the failing test based on their similarity to previously observed fault patterns. These annotations guide the LLM's reasoning while maintaining scalability and avoiding the prompt-length explosion common to naive retrieval-augmented approaches. We evaluate SPARK on three industrial datasets containing real-world faulty Python test cases from different software products. The results show that SPARK consistently improves fault localization effectiveness compared to the existing LLM-based TCFL baseline while maintaining comparable inference cost and token usage. In particular, the approach advances the state of the art by identifying more correct faulty locations in complex test cases containing multiple faults.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SPARK, a retrieval-augmented framework for LLM-based Test Code Fault Localization (TCFL). Given a failing test, SPARK retrieves similar fault-labeled cases from a CI debugging knowledge corpus and selectively annotates suspicious lines based on pattern similarity; these annotations are then provided to the LLM to improve localization. The approach is evaluated on three industrial datasets of real-world faulty Python test cases, with the central claim that SPARK yields consistent gains in fault-localization effectiveness over an existing LLM-based TCFL baseline, especially on complex multi-fault tests, while preserving comparable inference cost and token usage.

Significance. If the reported gains prove robust, SPARK would represent a practical advance in an under-studied area of test-script debugging within CI pipelines. By leveraging historical fault patterns without naive retrieval-induced prompt bloat, the method could improve LLM reasoning on large system-level tests where diagnostic signals are limited.

major comments (2)
  1. [Method / SPARK Framework] The method section provides no formal definition of the similarity metric, no pseudocode for the retrieval or line-selection procedure, and no explicit handling of partial matches when a test contains multiple independent faults. This directly underpins the central claim that retrieved annotations improve rather than degrade LLM output; without these details it is impossible to assess whether surface-level similarity (e.g., token overlap or error strings) reliably identifies causal fault locations.
  2. [Evaluation / Results] The evaluation reports “consistent improvement” and “more correct faulty locations in complex test cases containing multiple faults,” yet supplies no numerical metrics (e.g., Top-1/Top-5 accuracy, EXAM score), no statistical significance tests, no ablation on the annotation component, and no breakdown by number of faults per test. These omissions make it impossible to verify the robustness of the multi-fault claim or to compare effect sizes against the baseline.
minor comments (2)
  1. [Abstract] The abstract states that annotations “guide the LLM’s reasoning while maintaining scalability,” but the paper never quantifies prompt-length growth or token usage beyond the qualitative claim of “comparable” cost.
  2. [Preliminaries / Notation] Notation for the knowledge corpus, similarity function, and annotation mask is introduced without a dedicated notation table or consistent symbols across figures and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to improve clarity in the method description and rigor in the evaluation. We address each major comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses
  1. Referee: [Method / SPARK Framework] The method section provides no formal definition of the similarity metric, no pseudocode for the retrieval or line-selection procedure, and no explicit handling of partial matches when a test contains multiple independent faults. This directly underpins the central claim that retrieved annotations improve rather than degrade LLM output; without these details it is impossible to assess whether surface-level similarity (e.g., token overlap or error strings) reliably identifies causal fault locations.

    Authors: We agree that a formal definition of the similarity metric, pseudocode, and explicit discussion of multi-fault handling would strengthen the method section and aid reproducibility. In the revised manuscript we will add a formal definition of the similarity metric (based on pattern similarity between the current failing test and historical fault-labeled cases), include pseudocode for the retrieval and selective annotation steps, and explain how partial matches are managed: each line is annotated independently according to its similarity to observed fault patterns, enabling the framework to surface multiple faults without requiring an exact overall test match. These additions will directly support the claim that the annotations improve LLM localization. revision: yes

  2. Referee: [Evaluation / Results] The evaluation reports “consistent improvement” and “more correct faulty locations in complex test cases containing multiple faults,” yet supplies no numerical metrics (e.g., Top-1/Top-5 accuracy, EXAM score), no statistical significance tests, no ablation on the annotation component, and no breakdown by number of faults per test. These omissions make it impossible to verify the robustness of the multi-fault claim or to compare effect sizes against the baseline.

    Authors: We acknowledge that the evaluation would benefit from greater quantitative detail and additional analyses. In the revised manuscript we will expand the results section to report Top-1 and Top-5 accuracy as well as EXAM scores for SPARK versus the baseline on all three datasets, include statistical significance testing, present an ablation study isolating the selective annotation component, and provide a breakdown of performance by number of faults per test (single-fault versus multi-fault cases). These changes will allow readers to verify the reported gains and effect sizes more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical retrieval method relies on external corpus and direct evaluation

full rationale

The paper describes SPARK as a retrieval-augmented framework that pulls similar fault-labeled cases from an external CI debugging corpus, selectively annotates lines in a new failing test, and feeds the result to an LLM for localization. Evaluation is performed on three separate industrial datasets with real faulty Python tests, reporting improvements over an LLM baseline in effectiveness metrics while holding inference cost constant. No equations, fitted parameters, or first-principles derivations appear in the provided text; the central claim is supported by empirical comparison rather than any self-referential definition, uniqueness theorem, or ansatz smuggled via self-citation. The method is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that similarity between test cases reliably indicates transferable fault patterns and that LLM reasoning benefits from such annotations without prompt overload.

axioms (1)
  • domain assumption Retrieved similar fault-labeled test cases provide useful and non-misleading annotations for guiding LLM fault localization in new tests
    This assumption underpins the selective annotation step and is not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1272 out tokens · 55237 ms · 2026-05-11T02:54:22.083351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 2 internal anchors

  1. [1]

    Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. 2006. An Evaluation of Similarity Coefficients for Software Fault Localization. In12th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2006), 18-20 December, 2006, University of California, Riverside, USA. IEEE Computer Society, 39–46. doi:10.1109/PRDC.2006.18

  2. [2]

    Aho, Monica S

    Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006.Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA

  3. [3]

    Qurat Ul Ain, Wasi Haider Butt, Muhammad Waseem Anwar, Farooque Azam, and Bilal Maqbool. 2019. A Systematic Review on Code Clone Detection.IEEE Access7 (2019), 86121–86144. doi:10.1109/ACCESS.2019.2918202

  4. [4]

    Benoit Baudry, Franck Fleurey, Jean-Marc Jézéquel, and Yves Le Traon. 2005. Automatic Test Case Optimization: A Bacteriologic Algorithm.IEEE Softw.22, 2 (2005), 76–82. doi:10.1109/MS.2005.30

  5. [5]

    Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller

    Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the bug and how is it fixed? an experiment with practitioners. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering(Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 117...

  6. [6]

    Yuriy Brun, Saikat Chakraborty, Claire Le Goues, Corina Păsăreanu, and Adish Singla. 2026. Automatically Engineering Trusted Software: A Research Roadmap.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10.1145/3779132 Just Accepted

  7. [7]

    Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhiming Ma, and Hang Li. 2009. Ranking measures and loss functions in learning to rank. InProceedings of the 23rd International Conference on Neural Information Processing Systems(Vancouver, British Columbia, Canada)(NIPS’09). Curran Associates Inc., Red Hook, NY, USA, 315âĂŞ323

  8. [8]

    Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. 2024. Comments as natural logic pivots: Improve code generation via comment perspective.arXiv preprint arXiv:2404.07549(2024)

  9. [9]

    Higor A de Souza, Marcos L Chaim, and Fabio Kon. 2016. Spectrum-based software fault localization: A survey of techniques, advances, and challenges.arXiv preprint arXiv:1607.04347(2016)

  10. [10]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...

  11. [11]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel MazarÃľ, Maria Lomeli, Lucas Hosseini, and HervÃľ JÃľgou. 2024. The Faiss library.arXiv e-prints(2024). arXiv:2401.08281 [cs.LG]

  12. [12]

    Bin Du, Xiaolan Kang, Hexiang Xu, Yonghao Wu, and Yong Liu. 2025. Leveraging Retrieval Augmented Generation to Enhance LLM-Based Fault Localization for Novice Programs. In2025 25th International Conference on Software Quality, Reliability and Security (QRS). 46–56

  13. [13]

    2007.Continuous integration: improving software quality and reducing risk

    Paul M Duvall, Steve Matyas, and Andrew Glover. 2007.Continuous integration: improving software quality and reducing risk. Pearson Education

  14. [14]

    Federico Errica, Giuseppe Siracusano, Davide Sanvito, and Roberto Bifulco. 2024. What did I do wrong? Quantifying LLMs’ sensitivity and consistency to prompt engineering.arXiv preprint arXiv:2406.12334(2024). Manuscript submitted to ACM Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization 37

  15. [15]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational L...

  16. [16]

    Ruslan Galiullin and Yegor Bugayenko. 2024. Code Quality Analysis: Exploring Blank Lines as Indicators of Increased Code Complexity.Zenodo (2024)

  17. [17]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850(2022)

  18. [18]

    Neha Gupta, Arun Sharma, and Manoj Kumar Pachariya. 2019. An Insight Into Test Case Optimization: Ideas and Trends With Future Perspectives. IEEE Access7 (2019), 22310–22327. doi:10.1109/ACCESS.2019.2899471

  19. [19]

    Ahmed E. Hassan. 2008. The road ahead for Mining Software Repositories . In2008 IEEE International Conference on Software Maintenance. IEEE Computer Society, Los Alamitos, CA, USA, 48–57. doi:10.1109/FOSM.2008.4659248

  20. [20]

    Hao Hu, Hongyu Zhang, Jifeng Xuan, and Weigang Sun. 2014. Effective Bug Triage Based on Historical Bug-Fix Information. In2014 IEEE 25th International Symposium on Software Reliability Engineering. 122–132. doi:10.1109/ISSRE.2014.17

  21. [21]

    Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, and Yao Qin. 2025. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs.arXiv preprint arXiv:2509.01790(2025)

  22. [22]

    Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code. InInternational Joint Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:18198301

  23. [23]

    Aaron Imani, Mohammad Moshirpour, and Iftekhar Ahmed. 2025. Inside Out: Uncovering How Comment Internalization Steers LLMs for Better or Worse.arXiv preprint arXiv:2512.16790(2025)

  24. [24]

    Darryl Jarman, Jeffrey Berry, Riley Smith, Ferdian Thung, and David Lo. 2022. Legion: Massively Composing Rankers for Improved Bug Localization at Adobe.IEEE Transactions on Software Engineering48, 8 (2022), 3010–3024. doi:10.1109/TSE.2021.3075215

  25. [25]

    Suhwan Ji, Sanghwa Lee, Changsup Lee, Yo-Sub Han, and Hyeonseung Im. 2025. Impact of Large Language Models of Code on Fault Localization. In 2025 IEEE Conference on Software Testing, Verification and Validation (ICST). 302–313. doi:10.1109/ICST62969.2025.10989036

  26. [26]

    Jones, Mary Jean Harrold, and John T

    James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. InProceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA, Will Tracz, Michal Young, and Jeff Magee (Eds.). ACM, 467–477. doi:10.1145/581339.581397

  27. [27]

    René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 International Symposium on Software Testing and Analysis(San Jose, CA, USA)(ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437âĂŞ440. doi:10.1145/2610384.2628055

  28. [28]

    Sonia K Katyal. 2018. The paradox of source code secrecy.Cornell L. Rev.104 (2018), 1183

  29. [29]

    Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Enzhi Wang, and Xiaohang Dong. 2024. Better Zero-Shot Reasoning with Role-Play Prompting. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico C...

  30. [30]

    Tuan Manh Lai, Trung Bui, and Sheng Li. 2018. A Review on Deep Learning Techniques Applied to Answer Selection. InProceedings of the 27th International Conference on Computational Linguistics, Emily M. Bender, Leon Derczynski, and Pierre Isabelle (Eds.). Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2132–2144. https://aclanthology....

  31. [31]

    An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2017. Bug Localization with Combination of Deep Learning and Information Retrieval. In2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). 218–229. doi:10.1109/ICPC.2017.24

  32. [32]

    V. I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals.Soviet Physics Doklady10 (Feb. 1966), 707

  33. [33]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...

  34. [34]

    Hongyan Li, Weifeng Sun, Meng Yan, Ling Xu, Qiang Li, Xiaohong Zhang, and Hongyu Zhang. 2025. Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation.IEEE Transactions on Software Engineering51, 5 (2025), 1591–1614. doi:10.1109/TSE.2025.3558403

  35. [35]

    Xia Li, Jiajun Jiang, Samuel Benton, Yingfei Xiong, and Lingming Zhang. 2021. A Large-scale Study on API Misuses in the Wild. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). 241–252. doi:10.1109/ICST49551.2021.00034

  36. [36]

    Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang. 2025. A Knowledge Enhanced Large Language Model for Bug Localization.Proc. ACM Softw. Eng.2, FSE, Article FSE086 (June 2025), 23 pages. doi:10.1145/3729356

  37. [37]

    Yi Li, Shaohua Wang, and Tien N. Nguyen. 2021. Fault Localization with Code Coverage Representation Learning. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 661–673. doi:10.1109/ICSE43902.2021.00067

  38. [38]

    Zheng Li, Xue Bai, Haifeng Wang, and Yong Liu. 2020. IRBFL: An Information Retrieval Based Fault Localization Approach. In44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020. IEEE, 991–996. doi:10.1109/COMPSAC48688.2020.0- 142 Manuscript submitted to ACM 38 Golnaz Gharachorlu, Mahsa Panahandeh, ...

  39. [39]

    Hongliang Liang, Dengji Hang, and Xiangyu Li. 2022. Modeling function-level interactions for file-level bug localization.Empirical Software Engineering27, 7 (2022), 186

  40. [40]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (Eds.). Association ...

  41. [41]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9, Article 195 (Jan. 2023), 35 pages. doi:10.1145/3560815

  42. [42]

    Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-BERT: Enabling Language Representation with Knowledge Graph.Proceedings of the AAAI Conference on Artificial Intelligence34, 03 (Apr. 2020), 2901–2908. doi:10.1609/aaai.v34i03.5681

  43. [43]

    Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting coverage-based fault localization via graph-based representation learning. InESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis...

  44. [44]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (April 2020), 824âĂŞ836. doi:10.1109/TPAMI.2018.2889473

  45. [45]

    Manning, Prabhakar Raghavan, and Hinrich SchÃijtze

    Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃijtze. 2008.Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. http://nlp.stanford.edu/IR-book/information-retrieval-book.html

  46. [46]

    Elijah Mansur, Johnson Chen, Muhammad Anas Raza, and Mohammad Wardat. 2024. RAGFix: Enhancing LLM Code Repair Using RAG and Stack Overflow Posts. In2024 IEEE International Conference on Big Data (BigData). 7491–7496. doi:10.1109/BigData62323.2024.10825785

  47. [47]

    Myers, Corey Sandler, and Tom Badgett

    Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011.The Art of Software Testing(3rd ed.). Wiley Publishing

  48. [48]

    Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-Based Prompt Selection for Code-Related Few-Shot Learning. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 2450–2462. doi:10.1109/ICSE48619.2023.00205

  49. [49]

    Srinivas Nidhra and Jagruthi Dondeti. 2012. Black box and white box testing techniques-a literature review.International Journal of Embedded Systems and Applications (IJESA)2, 2 (2012), 29–50

  50. [50]

    Owain Parry, Gregory Kapfhammer, Michael Hilton, and Phil McMinn. 2025. Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE ’25). Association for Computing Machinery, New York, NY, USA, 476âĂŞ487. doi:10.1145/3756681.3756945

  51. [51]

    Francisco Ponce, Roberto Verdecchia, Breno Miranda, and Jacopo Soldani. 2025. Microservices testing: A systematic literature review.Information and Software Technology188 (2025), 107870. doi:10.1016/j.infsof.2025.107870

  52. [52]

    Pukelsheim

    F. Pukelsheim. 1994. The Three Sigma Rule.The American Statistician48, 2 (1994), 88–91

  53. [53]

    Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2024. AgentFL: Scaling LLM-based Fault Localization to Project-Level Context.CoRRabs/2403.16362 (2024). arXiv:2403.16362 doi:10.48550/ARXIV.2403.16362

  54. [54]

    Yihao Qin, Shangwen Wang, Yiling Lou, Jinhao Dong, Kaixin Wang, Xiaoling Li, and Xiaoguang Mao. 2025. SoapFL: A Standard Operating Procedure for LLM-Based Method-Level Fault Localization.IEEE Transactions on Software Engineering51, 4 (2025), 1173–1187. doi:10.1109/TSE.2025.3543187

  55. [55]

    Steven Raemaekers, Arie van Deursen, and Joost Visser. 2011. Exploring risks in the usage of third-party libraries. Inof the BElgian-NEtherlands software eVOLution seminar, Vol. 31

  56. [56]

    Moeketsi Raselimo and Bernd Fischer. 2024. Spectrum-based rule- and item-level localization of faults in context-free grammars.J. Syst. Softw.215 (2024), 112067. doi:10.1016/J.JSS.2024.112067

  57. [57]

    Michael Rath, David Lo, and Patrick Mäder. 2018. Analyzing requirements and traceability information to improve bug localization. InProceedings of the 15th International Conference on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Association for Computing Machinery, New York, NY, USA, 442âĂŞ453. doi:10.1145/3196398.3196415

  58. [58]

    Robertson and K

    S.E. Robertson and K. Spärck Jones. 1994.Simple, proven approaches to text retrieval. Technical Report UCAM-CL-TR-356. University of Cambridge, Computer Laboratory. doi:10.48456/tr-356

  59. [59]

    Ahmadreza Saboor Yaraghi, Golnaz Gharachorlu, Sakina Fatima, Lionel C Briand, Ruiyuan Wan, and Ruifeng Gao. 2025. Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation.arXiv e-prints(2025), arXiv–2506

  60. [60]

    David Saff and Michael D Ernst. 2003. Reducing wasted development time via continuous testing. In14th International Symposium on Software Reliability Engineering, 2003. ISSRE 2003.IEEE, 281–292

  61. [61]

    Qusay Idrees Sarhan and Árpád Beszédes. 2022. A Survey of Challenges in Spectrum-Based Software Fault Localization.IEEE Access10 (2022), 10618–10639. doi:10.1109/ACCESS.2022.3144079

  62. [62]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105

  63. [64]

    Xinyu Shi, Zhenhao Li, and An Ran Chen. 2025. Enhancing LLM-based Fault Localization with a Functionality-Aware Retrieval-Augmented Generation Framework.arXiv preprint arXiv:2509.20552(2025). Manuscript submitted to ACM Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization 39

  64. [65]

    Sheldon Smith, Ethan Robinson, Timmy Frederiksen, Trae Stevens, Tomas Cerny, Miroslav Bures, and Davide Taibi. 2023. Benchmarks for End-to-End Microservices Testing. In2023 IEEE International Conference on Service-Oriented System Engineering (SOSE). 60–66. doi:10.1109/SOSE58276.2023.00013

  65. [66]

    Demin Song, Honglin Guo, Yunhua Zhou, Shuhao Xing, Yudong Wang, Zifan Song, Wenwei Zhang, Qipeng Guo, Hang Yan, Xipeng Qiu, et al. 2024. Code needs comments: Enhancing code llms with comment augmentation.arXiv preprint arXiv:2402.13013(2024)

  66. [67]

    Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei. 2022. RegMiner: towards constructing a large regression dataset from code evolution history. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis(Virtual, South Korea)(ISSTA 2022). Association for Computing Machinery, New York, ...

  67. [68]

    Daniela Steidl, Benjamin Hummel, and Elmar Juergens. 2013. Quality analysis of source code comments. In2013 21st International Conference on Program Comprehension (ICPC). 83–92. doi:10.1109/ICPC.2013.6613836

  68. [69]

    Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, and Yuqun Zhang. 2025. Prompt-based Code Completion via Multi-Retrieval Augmented Generation.ACM Trans. Softw. Eng. Methodol.(March 2025). doi:10.1145/3725812 Just Accepted

  69. [70]

    Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T

    David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio- GonzÃąlez. 2019. BugSwarm: Mining and Continuously Growing a Dataset of Reproducible Failures and Fixes. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 339–349. doi:10.1109/ICSE.2019.00048

  70. [71]

    Arash Vahabzadeh, Amin Milani Fard, and Ali Mesbah. 2015. An empirical study of bugs in test code. In2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 101–110. doi:10.1109/ICSM.2015.7332456

  71. [72]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

  72. [73]

    Bei Wang, Ling Xu, Meng Yan, Chao Liu, and Ling Liu. 2022. Multi-Dimension Convolutional Neural Network for Bug Localization.IEEE Transactions on Services Computing15, 3 (2022), 1649–1663. doi:10.1109/TSC.2020.3006214

  73. [74]

    Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. 2024. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs.NPJ digital medicine7, 1 (2024), 41

  74. [75]

    Shaowei Wang, David Lo, and Julia Lawall. 2014. Compositional Vector Space Models for Improved Bug Localization. In2014 IEEE International Conference on Software Maintenance and Evolution. 171–180. doi:10.1109/ICSME.2014.39

  75. [76]

    Weishi Wang, Yue Wang, Shafiq Joty, and Steven C.H. Hoi. 2023. RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machin...

  76. [77]

    Ying Wang, Bihuan Chen, Kaifeng Huang, Bowen Shi, Congying Xu, Xin Peng, Yijian Wu, and Yang Liu. 2020. An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java Projects. InIEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020. IEEE, 35–45. doi:10.1109/I...

  77. [78]

    Generalizing from a Few Examples: A Survey on Few-Shot Learning.ACM Comput

    Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2021. Generalizing from a Few Examples: A Survey on Few-shot Learning.ACM Comput. Surv.53, 3 (2021), 63:1–63:34. doi:10.1145/3386252

  78. [79]

    Ming Wen, Junjie Chen, Yongqiang Tian, Rongxin Wu, Dan Hao, Shi Han, and Shing-Chi Cheung. 2021. Historical Spectrum Based Fault Localization. IEEE Trans. Software Eng.47, 11 (2021), 2348–2368. doi:10.1109/TSE.2019.2948158

  79. [80]

    Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt

  80. [81]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.CoRRabs/2302.11382 (2023). arXiv:2302.11382 doi:10.48550/ARXIV. 2302.11382

Showing first 80 references.