Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models

arxiv: 2506.19045 · v4 · submitted 2025-06-23 · 💻 cs.SE

Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models

Ahmadreza Saboor Yaraghi , Golnaz Gharachorlu , Sakina Fatima , Lionel C. Briand , Ruiyuan Wan , Ruifeng Gao This is my paper

Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3

classification 💻 cs.SE

keywords fault localizationlarge language modelssystem-level test codeblack-box debuggingstatic analysisexecution trace pruningtest code debugging

0 comments p. Extension

The pith

Pruned traces from one failure log let LLMs rank faulty statements in system-level test code without repeated executions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a fully static LLM-driven method can localize faults in complex system-level test code by estimating a pruned execution trace from a single failure log. Three novel algorithms identify only the statements likely involved in the failure, and this pruned trace plus the error message is fed to the LLM to rank potential faulty locations at function, block, and line levels. A sympathetic reader would care because traditional fault localization depends on repeated test runs that become impractical for non-deterministic failures or high-cost executions, and many real failures stem from errors in the test code itself rather than the system under test. The black-box design requires no access to the system-under-test source code, making the technique usable on large industrial Python test suites. Evaluation on faulty test cases shows estimated traces match actual ones at roughly 90% F1 while cutting LLM inference time by up to 34% and delivering equal or better accuracy with over 85% less time and 93% fewer tokens than prior LLM-guided approaches.

Core claim

The central claim is that a black-box, execution-free technique for system-level test code fault localization can match or exceed prior LLM-guided accuracy by using three novel algorithms to build a pruned trace estimate from one failure log. This trace, combined with the error message, supplies the LLM with enough context to rank faulty statements correctly. The method works on complex test scripts without needing the system-under-test source code and was evaluated on an industrial dataset of faulty Python test cases not seen in LLM pre-training.

What carries the argument

Three novel algorithms that identify statements likely involved in the failure to produce a pruned execution trace estimate from a single failure log.

If this is right

Estimated traces match actual execution traces with an F1 score of around 90%.
Pruning reduces LLM inference time by up to 34% with no loss in fault localization performance.
The approach works on complex test scripts that assess full system behavior without access to system-under-test source code.
It achieves equal or higher accuracy than the latest LLM-guided method while using over 85% less average inference time and 93% fewer tokens per test case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-log pruning approach could extend to handling intermittent or non-deterministic test failures more reliably than methods that require multiple runs.
The technique might apply to test scripts in languages other than Python when similar execution logs are available.
Integration into test maintenance tools could automate initial fault ranking for engineers debugging large system test suites.

Load-bearing premise

The three novel algorithms can produce a pruned trace from a single failure log that, together with the error message, supplies the LLM with enough context to correctly identify and rank the actual faulty statements in the test code.

What would settle it

On the industrial dataset of faulty Python test cases, the LLM rankings using the pruned traces show lower accuracy than the latest LLM-guided method at line, block, or function level.

Figures

Figures reproduced from arXiv: 2506.19045 by Ahmadreza Saboor Yaraghi, Golnaz Gharachorlu, Lionel C. Briand, Ruifeng Gao, Ruiyuan Wan, Sakina Fatima.

**Figure 2.** Figure 2: A faulty test code and its corresponding execution log. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: For all levels, the output elements are expected to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.** Figure 4: Examples of the requested output format for function, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 3.** Figure 3: Our prompt template for test code fault localization. Text [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Fault localization (FL) is a critical step in debugging, which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system-under-test (SUT) rather than in the often complex system-level test code. However, the latter is also important, as in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system-level test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to complex test scripts that assess full system behavior. We evaluate our technique at the function, block, and line levels using an industrial dataset of faulty Python test cases that were not used in pre-training LLMs. Results show that our best-estimated traces closely match the actual traces, with an F1 score of around 90%. Additionally, pruning the complex system-level test code reduces the LLM's inference time by up to 34% without any loss in FL performance. Our method achieves equal or higher FL accuracy, requiring over 85% less average inference time per test case and 93% fewer tokens than the latest LLM-guided FL method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a static black-box pipeline for localizing faults in system-level test code via single-log trace estimation and LLM ranking, with concrete efficiency gains on industrial data.

read the letter

The main thing here is a static method that localizes faults inside complex test scripts rather than the system under test. It takes one failure log, runs three new algorithms to estimate and prune the execution trace, then prompts an LLM with the pruned trace plus error message to rank likely faulty statements at function, block, and line level. No repeated executions or SUT source code needed. On an industrial Python dataset the best traces hit roughly 90% F1 against ground truth, and pruning cuts LLM inference time by up to 34% with no reported drop in localization accuracy; overall the method matches or beats the latest LLM-guided baseline while using 85% less time and 93% fewer tokens per case.

Referee Report

3 major / 2 minor

Summary. The paper introduces a static, black-box LLM-based method for fault localization in complex system-level test code (TCFL). Given only a single failure log and error message, three novel pruning algorithms estimate a reduced execution trace; this trace plus the error message is fed to an LLM to rank faulty statements at function, block, and line granularity. No SUT source code or repeated executions are required. On an industrial dataset of faulty Python tests, the best pruned traces achieve ~90% F1 overlap with ground-truth traces, LLM inference time drops up to 34% with no accuracy loss, and the method matches or exceeds prior LLM-guided FL accuracy while cutting average inference time by >85% and tokens by 93%.

Significance. If the pruning step reliably retains the actual faulty statements and the LLM ranking is robust, the technique offers a practical route to low-cost, execution-free debugging of non-deterministic system tests. The use of real industrial faults outside LLM pre-training data and the concrete efficiency gains are strengths that would be valuable to the SE community if the evaluation protocol is fully documented.

major comments (3)

[Evaluation / Results] The central claim that the pruned trace plus error message suffices for correct LLM ranking rests on the assumption that the three pruning algorithms never drop the root-cause statements. The reported ~90% F1 measures trace overlap but does not quantify how often the omitted 10% contains the actual fault; this must be shown explicitly (e.g., by reporting the fraction of cases where the ground-truth faulty line is retained after pruning).
[Evaluation / Results] Statistical significance, confidence intervals, and the exact number of test cases are not reported for the FL accuracy, time, and token reductions. Without these, it is impossible to assess whether the claimed 85% time and 93% token savings are reliable or could be due to selection effects in the industrial dataset.
[Approach / Trace Pruning Algorithms] The manuscript should detail the precise decision rules inside the three novel pruning algorithms (thresholds, heuristics for identifying 'likely involved' statements) and demonstrate that they preserve statements executed only on the failing path; otherwise the subsequent LLM prompt may systematically lack the root cause.

minor comments (2)

[Related Work / Evaluation] Add a table or paragraph explicitly comparing the new method against the 'latest LLM-guided FL method' on identical test cases, including the exact baseline name and citation.
[Evaluation] Clarify whether the industrial dataset will be released (even in anonymized form) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our paper. We address each of the major comments below and outline the revisions we intend to make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation / Results] The central claim that the pruned trace plus error message suffices for correct LLM ranking rests on the assumption that the three pruning algorithms never drop the root-cause statements. The reported ~90% F1 measures trace overlap but does not quantify how often the omitted 10% contains the actual fault; this must be shown explicitly (e.g., by reporting the fraction of cases where the ground-truth faulty line is retained after pruning).

Authors: We agree that explicitly demonstrating the retention of root-cause statements is crucial for validating our approach. Although the ~90% F1 score suggests strong overall agreement between pruned and ground-truth traces, it does not isolate the retention rate for the specific faulty elements. In the revised manuscript, we will add a new table or subsection in the evaluation that reports the fraction of test cases where the ground-truth faulty line, block, and function are preserved in the pruned traces for each of the three algorithms. This analysis will be performed on our industrial dataset and will directly address the concern regarding potential loss of the root cause. revision: yes
Referee: [Evaluation / Results] Statistical significance, confidence intervals, and the exact number of test cases are not reported for the FL accuracy, time, and token reductions. Without these, it is impossible to assess whether the claimed 85% time and 93% token savings are reliable or could be due to selection effects in the industrial dataset.

Authors: We thank the referee for this observation. We will revise the manuscript to include the exact number of test cases in the industrial dataset, along with statistical significance tests and confidence intervals for the FL accuracy, time, and token reduction metrics. Specifically, we plan to use bootstrap resampling to compute 95% confidence intervals and report p-values for comparisons against baseline methods. This will allow readers to better assess the robustness of our efficiency claims. revision: yes
Referee: [Approach / Trace Pruning Algorithms] The manuscript should detail the precise decision rules inside the three novel pruning algorithms (thresholds, heuristics for identifying 'likely involved' statements) and demonstrate that they preserve statements executed only on the failing path; otherwise the subsequent LLM prompt may systematically lack the root cause.

Authors: We appreciate the referee's call for greater precision in describing our pruning algorithms. In the current manuscript, the three algorithms are outlined at a high level in Section 3. We will expand this section to provide the exact decision rules, including any thresholds and heuristics employed to determine 'likely involved' statements. Furthermore, we will include a formal argument or empirical demonstration showing that the algorithms are designed to retain statements on the failing execution path, based on the information available in the failure log. Pseudocode for each algorithm will be added to facilitate understanding and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation is self-contained

full rationale

The paper describes a static LLM-based fault localization technique for system-level test code that relies on three novel pruning algorithms applied to a single failure log plus error message. All load-bearing claims—~90% F1 trace overlap, equal-or-better FL accuracy, 85% lower inference time, and 93% fewer tokens—are presented as outcomes of an external industrial evaluation on previously unseen faulty Python test cases rather than any derivation, fitted parameter, or self-citation chain that reduces to the method’s own inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked; the reported performance metrics are measured against ground-truth traces and prior LLM-guided baselines, making the results falsifiable outside the paper’s own definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on domain assumptions about log sufficiency and introduces algorithmic parameters whose exact values are not detailed in the abstract.

free parameters (1)

trace pruning thresholds
Parameters inside the three novel algorithms that decide which statements are likely involved in the failure.

axioms (1)

domain assumption A single failure execution log contains enough information to estimate the relevant execution trace for fault localization.
Central premise invoked when the method uses one log to build the pruned trace fed to the LLM.

pith-pipeline@v0.9.0 · 5870 in / 1273 out tokens · 34652 ms · 2026-05-19T07:28:43.124792+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose three novel algorithms that can estimate the execution trace of a faulty test case with sufficient accuracy, while pruning information irrelevant to the failure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 2 internal anchors

[1]

An empirical study of fault localization families and their combinations,

D. Zou, J. Liang, Y . Xiong, M. D. Ernst, and L. Zhang, “An empirical study of fault localization families and their combinations,” IEEE Trans. Software Eng. , vol. 47, no. 2, pp. 332–347, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2892102

work page doi:10.1109/tse.2019.2892102 2021
[2]

A survey on software fault localization,

W. E. Wong, R. Gao, Y . Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016

work page 2016
[3]

Evaluating and improving fault localization,

S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 , S. Uchitel, A. Orso, and M. P. Robillard, Eds. IEEE / ACM, 2017, pp. 609–620. [Online]. A...

work page doi:10.1109/icse.2017.62 2017
[4]

Agentfl: Scaling llm-based fault localization to project-level context,

Y . Qin, S. Wang, Y . Lou, J. Dong, K. Wang, X. Li, and X. Mao, “Agentfl: Scaling llm-based fault localization to project-level context,” CoRR, vol. abs/2403.16362, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.16362

work page doi:10.48550/arxiv.2403.16362 2024
[5]

Xueying Du et al

A. Z. H. Yang, C. Le Goues, R. Martins, and V . J. Hellendoorn, “Large language models for test-free fault localization,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 2024, pp. 17:1– 17:12. [Online]. Available: https://doi.org/10.1145/3597503.3623342

work page doi:10.1145/3597503.3623342 2024
[6]

Flexfl: Flexible and effective fault localization with open-source large language models,

C. Xu, Z. Liu, X. Ren, G. Zhang, M. Liang, and D. Lo, “Flexfl: Flexible and effective fault localization with open-source large language models,” CoRR, vol. abs/2411.10714, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.10714

work page doi:10.48550/arxiv.2411.10714 2024
[7]

System-level test: State of the art and challenges,

D. Appello, H. H. Chen, M. Sauer, I. Polian, P. Bernardi, and M. S. Reorda, “System-level test: State of the art and challenges,” in27th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2021, Torino, Italy, June 28-30, 2021 . IEEE, 2021, pp. 1–7. [Online]. Available: https://doi.org/10.1109/IOLTS52814.2021.9486708

work page doi:10.1109/iolts52814.2021.9486708 2021
[8]

An empirical study of bugs in test code,

A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of bugs in test code,” in 2015 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 2015, pp. 101–110

work page 2015
[9]

Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,

S. Fatima, H. Hemmati, and L. C. Briand, “Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,” IEEE Trans. Software Eng. , vol. 50, no. 12, pp. 3146–3171,

work page
[10]

Available: https://doi.org/10.1109/TSE.2024.3472476

[Online]. Available: https://doi.org/10.1109/TSE.2024.3472476

work page doi:10.1109/tse.2024.3472476 2024
[11]

Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,

K. Ke, “Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,” in 2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE) . IEEE Computer Society, 2025, pp. 762–762

work page 2025
[12]

Automated test case repair using language models,

A. Saboor Yaraghi, D. Holden, N. Kahani, and L. Briand, “Automated test case repair using language models,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 1104–1133, 2025

work page 2025
[13]

Utfix: Change aware unit test repairing using llm,

S. Rahman, S. Kuhar, B. Cirisci, P. Garg, S. Wang, X. Ma, A. Deoras, and B. Ray, “Utfix: Change aware unit test repairing using llm,” Proceedings of the ACM on Programming Languages , vol. 9, no. OOPSLA1, pp. 143–168, 2025

work page 2025
[14]

Boosting spectrum-based fault localization via multi-correct programs in online programming,

W. Zheng, H. Hu, T. Chen, F. Yang, X. Fan, and P. Xiao, “Boosting spectrum-based fault localization via multi-correct programs in online programming,” IEICE Trans. Inf. Syst. , vol. 107, no. 4, pp. 525–536,

work page
[15]

Available: https://doi.org/10.1587/transinf.2023edp7164

[Online]. Available: https://doi.org/10.1587/transinf.2023edp7164

work page doi:10.1587/transinf.2023edp7164
[16]

Spectrum-based rule- and item- level localization of faults in context-free grammars,

M. Raselimo and B. Fischer, “Spectrum-based rule- and item- level localization of faults in context-free grammars,” J. Syst. Softw., vol. 215, p. 112067, 2024. [Online]. Available: https: //doi.org/10.1016/j.jss.2024.112067

work page doi:10.1016/j.jss.2024.112067 2024
[17]

A survey of challenges in spectrum-based software fault localization,

Q. I. Sarhan and ´A. Besz ´edes, “A survey of challenges in spectrum-based software fault localization,” IEEE Access , vol. 10, pp. 10 618–10 639, 2022. [Online]. Available: https://doi.org/10.1109/ ACCESS.2022.3144079

work page arXiv 2022
[18]

Spectrum-based Software Fault Localization: A Survey of Techniques, Advances, and Challenges

H. A. de Souza, M. L. Chaim, and F. Kon, “Spectrum-based software fault localization: A survey of techniques, advances, and challenges,” arXiv preprint arXiv:1607.04347 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Spectrum-based fault localization techniques application on multiple- fault programs: A review,

A. Zakari, S. Abdullahi, N. Shagari, A. B. Tambawal, N. M. Shanono, J. Z. Maitama, R. A. Rasheed, A. Adamu, and S. M. Abdulrahman, “Spectrum-based fault localization techniques application on multiple- fault programs: A review,” Global Journal of Computer Science and Technology, vol. 20, pp. 41–48, 2020

work page 2020
[20]

Isolating failure-inducing thread schedules,

J. Choi and A. Zeller, “Isolating failure-inducing thread schedules,” in Proceedings of the International Symposium on Software Testing and Analysis, ISSTA 2002, Roma, Italy, July 22-24, 2002 , P. G. Frankl, Ed. ACM, 2002, pp. 210–220. [Online]. Available: https://doi.org/10.1145/566172.566211

work page doi:10.1145/566172.566211 2002
[21]

Do system test cases grow old?

R. Feldt, “Do system test cases grow old?” in Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland, Ohio, USA. IEEE Computer Society, 2014, pp. 343–352. [Online]. Available: https://doi.org/10.1109/ICST.2014.47

work page doi:10.1109/icst.2014.47 2014
[22]

Abstract execution: A technique for efficiently tracing programs,

J. R. Larus, “Abstract execution: A technique for efficiently tracing programs,” Softw. Pract. Exp., vol. 20, no. 12, pp. 1241–1258, 1990. [Online]. Available: https://doi.org/10.1002/spe.4380201205

work page doi:10.1002/spe.4380201205 1990
[23]

Combining code and requirements coverage with execution cost for test suite reduction,

A. Marchetto, G. Scanniello, and A. Susi, “Combining code and requirements coverage with execution cost for test suite reduction,” IEEE Trans. Software Eng., vol. 45, no. 4, pp. 363–390, 2019. [Online]. Available: https://doi.org/10.1109/TSE.2017.2777831

work page doi:10.1109/tse.2017.2777831 2019
[24]

Analysis of overhead in dynamic java performance monitoring,

V . Hork´y, J. Kotrc, P. Libic, and P. Tuma, “Analysis of overhead in dynamic java performance monitoring,” in Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016, Delft, The Netherlands, March 12-16, 2016 , A. Avritzer, A. Iosup, X. Zhu, and S. Becker, Eds. ACM, 2016, pp. 275–286. [Online]. Available: https://do...

work page doi:10.1145/2851553.2851569 2016
[25]

doi: 10.18653/v1/N19-1423

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short ...

work page doi:10.18653/v1/n19-1423 2019
[26]

A survey on automated driving system testing: Landscapes and trends.ACM Trans

S. Tang, Z. Zhang, Y . Zhang, J. Zhou, Y . Guo, S. Liu, S. Guo, Y . Li, L. Ma, Y . Xue, and Y . Liu, “A survey on automated driving system testing: Landscapes and trends,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 124:1–124:62, 2023. [Online]. Available: https://doi.org/10.1145/3579642

work page doi:10.1145/3579642 2023
[27]

Black box and white box testing techniques- a literature review,

S. Nidhra and J. Dondeti, “Black box and white box testing techniques- a literature review,” International Journal of Embedded Systems and Applications (IJESA), vol. 2, no. 2, pp. 29–50, 2012

work page 2012
[28]

The paradox of source code secrecy,

S. K. Katyal, “The paradox of source code secrecy,” Cornell L. Rev., vol. 104, p. 1183, 2018

work page 2018
[29]

Exploring risks in the usage of third-party libraries,

S. Raemaekers, A. van Deursen, and J. Visser, “Exploring risks in the usage of third-party libraries,” in of the BElgian-NEtherlands software eVOLution seminar, vol. 31, 2011

work page 2011
[30]

An empirical study of usages, updates and risks of third-party libraries in java projects,

Y . Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y . Wu, and Y . Liu, “An empirical study of usages, updates and risks of third-party libraries in java projects,” in IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020 . IEEE, 2020, pp. 35–45. [Online]. Available: https://...

work page doi:10.1109/icsme46990.2020.00014 2020
[31]

Test-case reduction for C compiler bugs,

J. Regehr, Y . Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for C compiler bugs,” in ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 , J. Vitek, H. Lin, and F. Tip, Eds. ACM, 2012, pp. 335–346. [Online]. Available: https://doi.org/10.1145/2254064.2254104

work page doi:10.1145/2254064.2254104 2012
[32]

Automatic test case optimization: A bacteriologic algorithm,

B. Baudry, F. Fleurey, J. J´ez´equel, and Y . L. Traon, “Automatic test case optimization: A bacteriologic algorithm,” IEEE Softw., vol. 22, no. 2, pp. 76–82, 2005. [Online]. Available: https://doi.org/10.1109/MS.2005.30

work page doi:10.1109/ms.2005.30 2005
[33]

An insight into test case optimization: Ideas and trends with future perspectives,

N. Gupta, A. Sharma, and M. K. Pachariya, “An insight into test case optimization: Ideas and trends with future perspectives,” IEEE Access , vol. 7, pp. 22 310–22 327, 2019. [Online]. Available: https://doi.org/10.1109/ACCESS.2019.2899471

work page doi:10.1109/access.2019.2899471 2019
[34]

Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,

R. White and J. Krinke, “Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,” Empir. Softw. Eng. , vol. 27, no. 3, p. 67, 2022. [Online]. Available: https://doi.org/10.1007/s10664-021-10079-1

work page doi:10.1007/s10664-021-10079-1 2022
[35]

Towards optimizing the costs of LLM usage,

S. Shekhar, T. Dubey, K. Mukherjee, A. Saxena, A. Tyagi, and N. Kotla, “Towards optimizing the costs of LLM usage,” CoRR, vol. abs/2402.01742, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2402.01742

work page doi:10.48550/arxiv.2402.01742 2024
[36]

Chatunitest: a chatgpt- based automated unit test generation tool,

Z. Xie, Y . Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt- based automated unit test generation tool,” CoRR, vol. abs/2305.04764,

work page arXiv
[37]

Chatunitest: a chatgpt- based automated unit test generation tool,

[Online]. Available: https://doi.org/10.48550/arXiv.2305.04764

work page doi:10.48550/arxiv.2305.04764
[38]

A3test: Assertion-augmented automated test case generation,

S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-augmented automated test case generation,” Inf. Softw. Technol., vol. 176, p. 107565, 2024. [Online]. Available: https: //doi.org/10.1016/j.infsof.2024.107565

work page doi:10.1016/j.infsof.2024.107565 2024
[39]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 919–931. 21

work page 2023
[40]

A systematic literature review of test breakage prevention and repair techniques,

J. Imtiaz, S. Sherin, M. U. Khan, and M. Z. Iqbal, “A systematic literature review of test breakage prevention and repair techniques,” Information and Software Technology , vol. 113, pp. 1–19, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584919300990

work page 2019
[41]

Resumption strategies for interrupted programming tasks,

C. Parnin and S. Rugaber, “Resumption strategies for interrupted programming tasks,” Softw. Qual. J. , vol. 19, no. 1, pp. 5–34, 2011. [Online]. Available: https://doi.org/10.1007/s11219-010-9104-9

work page doi:10.1007/s11219-010-9104-9 2011
[42]

Machine learning-based network status detection and fault localization,

A. R. Mohammed, S. A. Mohammed, D. C ˆot´e, and S. Shirmohammadi, “Machine learning-based network status detection and fault localization,” IEEE Trans. Instrum. Meas. , vol. 70, pp. 1–10, 2021. [Online]. Available: https://doi.org/10.1109/TIM.2021.3094223

work page doi:10.1109/tim.2021.3094223 2021
[43]

AUTOTRAINER: an automatic DNN training problem detection and repair system,

M. Wardat, W. Le, and H. Rajan, “Deeplocalize: Fault localization for deep neural networks,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 . IEEE, 2021, pp. 251–262. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00034

work page doi:10.1109/icse43902.2021.00034 2021
[44]

Fault localization for hardware design code with time-aware program spectrum,

J. Wu, Z. Zhang, D. Yang, X. Meng, J. He, X. Mao, and Y . Lei, “Fault localization for hardware design code with time-aware program spectrum,” in IEEE 40th International Conference on Computer Design, ICCD 2022, Olympic Valley, CA, USA, October 23-26, 2022 . IEEE, 2022, pp. 537–544. [Online]. Available: https://doi.org/10.1109/ICCD56317.2022.00085

work page doi:10.1109/iccd56317.2022.00085 2022
[45]

Bag of tricks for inference-time computation of llm reasoning,

F. Liu, W. Chao, N. Tan, and H. Liu, “Bag of tricks for inference-time computation of llm reasoning,” arXiv preprint arXiv:2502.07191 , 2025

work page arXiv 2025
[46]

Efficient and elastic llms,

P. Jain, “Efficient and elastic llms,” Google Research India, Tech. Rep.,

work page
[47]

Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf

[Online]. Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf

work page
[48]

Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,

Y . Wu, Z. Li, J. M. Zhang, and Y . Liu, “Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,” CoRR, vol. abs/2310.16253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16253

work page doi:10.48550/arxiv.2310.16253 2023
[49]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,

S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024 , Y . Graham and M. Purv...

work page 2024
[50]

A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman,Compilers: Principles, Techniques, and Tools (2nd Edition) . USA: Addison-Wesley Longman Publishing Co., Inc., 2006

work page 2006
[51]

Fault localization using execution traces,

M. A. Francel and S. Rugaber, “Fault localization using execution traces,” in Proceedings of the 30th Annual Southeast Regional Conference, 1992, Raleigh, North Carolina, USA, April 8-10, 1992 , M. A. V ouk, D. S. Reeves, and C. M. Pancake, Eds. ACM, 1992, pp. 69–76. [Online]. Available: https://doi.org/10.1145/503720.503747

work page doi:10.1145/503720.503747 1992
[52]

Fault localization using execution slices and dataflow tests,

H. Agrawal, J. R. Horgan, S. London, and W. E. Wong, “Fault localization using execution slices and dataflow tests,” in Sixth International Symposium on Software Reliability Engineering, ISSRE 1995, Toulouse, France, October 24-27, 1995 . IEEE Computer Society, 1995, pp. 143–151. [Online]. Available: https: //doi.org/10.1109/ISSRE.1995.497652

work page doi:10.1109/issre.1995.497652 1995
[53]

Visualization of test information to assist fault localization,

J. A. Jones, M. J. Harrold, and J. T. Stasko, “Visualization of test information to assist fault localization,” in Proceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA , W. Tracz, M. Young, and J. Magee, Eds. ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397

work page doi:10.1145/581339.581397 2002
[54]

Interactive fault localization techniques in a spreadsheet environment,

J. R. Ruthruff, M. M. Burnett, and G. Rothermel, “Interactive fault localization techniques in a spreadsheet environment,” IEEE Trans. Software Eng., vol. 32, no. 4, pp. 213–239, 2006. [Online]. Available: https://doi.org/10.1109/TSE.2006.37

work page doi:10.1109/tse.2006.37 2006
[55]

Fault localization with nearest neighbor queries,

M. Renieris and S. P. Reiss, “Fault localization with nearest neighbor queries,” in 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada . IEEE Computer Society, 2003, pp. 30–39. [Online]. Available: https://doi.org/10.1109/ASE.2003.1240292

work page doi:10.1109/ase.2003.1240292 2003
[56]

Heuristics for automatic localization of soft- ware faults,

H. Pan and E. Spafford, “Heuristics for automatic localization of soft- ware faults,” Software Engineering Research Center, Purdue University, Tech. Rep. SERC-TR-116-P, 7 1992

work page 1992
[57]

Tracing back log data to its log statement: from research to practice,

D. Schipper, M. F. Aniche, and A. van Deursen, “Tracing back log data to its log statement: from research to practice,” in Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada , M. D. Storey, B. Adams, and S. Haiduc, Eds. IEEE / ACM, 2019, pp. 545–549. [Online]. Available: https://doi...

work page doi:10.1109/msr.2019.00081 2019
[58]

On matching log analysis to source code: A systematic mapping study,

V . Bushong, R. Sanders, J. Curtis, M. Du, T. Cern ´y, K. Frajt ´ak, M. Bures, P. Tisnovsky, and D. Shin, “On matching log analysis to source code: A systematic mapping study,” in RACS ’20: International Conference on Research in Adaptive and Convergent Systems, Gwangju, Korea, October 13-16, 2020 , T. Cern ´y and J. W. Park, Eds. ACM, 2020, pp. 181–187. ...

work page doi:10.1145/3400286.3418262 2020
[59]

Trace reconstruction in system logs for processing with process mining,

J. P. J ¨urgensen, “Trace reconstruction in system logs for processing with process mining,” in Proceedings of the 2nd International Conference on Industry 4.0 and Smart Manufacturing (ISM 2020), Virtual Event, Austria, 23-25 November 2020 , ser. Procedia Computer Science, F. Longo, M. Affenzeller, and A. Padovano, Eds., vol. 180. Elsevier, 2020, pp. 352–...

work page doi:10.1016/j.procs.2021.01.173 2020
[60]

Brevity is the soul of wit: Pruning long files for code generation,

A. K. Singh, Y . Yang, K. Tirumala, M. Elhoushi, and A. S. Morcos, “Brevity is the soul of wit: Pruning long files for code generation,” CoRR, vol. abs/2407.00434, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.00434

work page doi:10.48550/arxiv.2407.00434 2024
[61]

VIDUR: A large-scale simulation framework for LLM inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “VIDUR: A large-scale simulation framework for LLM inference,” in Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 , P. B. Gibbons, G. Pekhimenko, and C. D. Sa, Eds. mlsys.org, 2024...

work page 2024
[62]

Learning to predict program execution by modeling dynamic dependency on code graphs,

C. C. Le, H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Q. Bui, “Learning to predict program execution by modeling dynamic dependency on code graphs,” CoRR, vol. abs/2408.02816, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.02816

work page doi:10.48550/arxiv.2408.02816 2024
[63]

Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks,

J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hemmati, “Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks,” arXiv preprint arXiv:2310.10508, 2023

work page arXiv 2023
[64]

Better zero-shot reasoning with role-play prompting,

A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou, E. Wang, and X. Dong, “Better zero-shot reasoning with role-play prompting,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024...

work page doi:10.18653/v1/2024.naacl-long.228 2024
[65]

Rethinking the role-play prompting in mathemat- ical reasoning tasks,

Z. Han and Z. Wang, “Rethinking the role-play prompting in mathemat- ical reasoning tasks,” in Proceedings of the 1st Workshop on Efficiency, Security, and Generalization of Multimedia Foundation Models , 2024, pp. 13–17

work page 2024
[66]

LLM lies: Hallucinations are not bugs, but features as adversarial examples,

J. Yao, K. Ning, Z. Liu, M. Ning, and L. Yuan, “LLM lies: Hallucinations are not bugs, but features as adversarial examples,” CoRR, vol. abs/2310.01469, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.01469

work page doi:10.48550/arxiv.2310.01469 2023
[67]

Data preparation for deep learning based code smell detection: A systematic literature review,

F. Zhang, Z. Zhang, J. W. Keung, X. Tang, Z. Yang, X. Yu, and W. Hu, “Data preparation for deep learning based code smell detection: A systematic literature review,” J. Syst. Softw. , vol. 216, p. 112131,

work page
[68]

Available: https://doi.org/10.1016/j.jss.2024.112131

[Online]. Available: https://doi.org/10.1016/j.jss.2024.112131

work page doi:10.1016/j.jss.2024.112131 2024
[69]

Code quality analysis: Exploring blank lines as indicators of increased code complexity,

R. Galiullin and Y . Bugayenko, “Code quality analysis: Exploring blank lines as indicators of increased code complexity,” Nov. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.14132684

work page doi:10.5281/zenodo.14132684 2024
[70]

Quality analysis of source code comments,

D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in 2013 21st International Conference on Program Comprehension (ICPC), 2013, pp. 83–92

work page 2013
[71]

Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,

A. Bagheri and P. Heged ¨us, “Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,” in 19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022 . ACM, 2022, pp. 117–121. [Online]. Available: https://doi.org/10.1145/3524842.3528034

work page doi:10.1145/3524842.3528034 2022
[72]

The three sigma rule,

F. Pukelsheim, “The three sigma rule,” The American Statistician , vol. 48, no. 2, pp. 88–91, 1994

work page 1994
[73]

Singh and N

R. Singh and N. S. Mangat, Stratified Sampling . Dordrecht: Springer Netherlands, 1996, pp. 102–144. [Online]. Available: https://doi.org/10.1007/978-94-017-1404-4 5

work page doi:10.1007/978-94-017-1404-4 1996
[74]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance,

M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal of the american statistical association, vol. 32, no. 200, pp. 675–701, 1937

work page 1937
[75]

URL http://www.jstor.org/stable/3001968

F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968 22

work page arXiv 1945
[76]

Globug: Using global data in fault localization,

N. Miryeganeh, S. Hashtroudi, and H. Hemmati, “Globug: Using global data in fault localization,” J. Syst. Softw., vol. 177, p. 110961,

work page
[77]

Available: https://doi.org/10.1016/j.jss.2021.110961

[Online]. Available: https://doi.org/10.1016/j.jss.2021.110961

work page doi:10.1016/j.jss.2021.110961 2021
[78]

Historical spectrum based fault localization,

M. Wen, J. Chen, Y . Tian, R. Wu, D. Hao, S. Han, and S. Cheung, “Historical spectrum based fault localization,” IEEE Trans. Software Eng., vol. 47, no. 11, pp. 2348–2368, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2948158

work page doi:10.1109/tse.2019.2948158 2021
[79]

IRBFL: an information retrieval based fault localization approach,

Z. Li, X. Bai, H. Wang, and Y . Liu, “IRBFL: an information retrieval based fault localization approach,” in 44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020 . IEEE, 2020, pp. 991–996. [Online]. Available: https://doi.org/10.1109/COMPSAC48688.2020.0-142

work page doi:10.1109/compsac48688.2020.0-142 2020
[80]

Binary Codes Capable of Correcting Deletions, Insertions and Reversals,

V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady , vol. 10, p. 707, Feb. 1966

work page 1966

Showing first 80 references.

[1] [1]

An empirical study of fault localization families and their combinations,

D. Zou, J. Liang, Y . Xiong, M. D. Ernst, and L. Zhang, “An empirical study of fault localization families and their combinations,” IEEE Trans. Software Eng. , vol. 47, no. 2, pp. 332–347, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2892102

work page doi:10.1109/tse.2019.2892102 2021

[2] [2]

A survey on software fault localization,

W. E. Wong, R. Gao, Y . Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016

work page 2016

[3] [3]

Evaluating and improving fault localization,

S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 , S. Uchitel, A. Orso, and M. P. Robillard, Eds. IEEE / ACM, 2017, pp. 609–620. [Online]. A...

work page doi:10.1109/icse.2017.62 2017

[4] [4]

Agentfl: Scaling llm-based fault localization to project-level context,

Y . Qin, S. Wang, Y . Lou, J. Dong, K. Wang, X. Li, and X. Mao, “Agentfl: Scaling llm-based fault localization to project-level context,” CoRR, vol. abs/2403.16362, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.16362

work page doi:10.48550/arxiv.2403.16362 2024

[5] [5]

Xueying Du et al

A. Z. H. Yang, C. Le Goues, R. Martins, and V . J. Hellendoorn, “Large language models for test-free fault localization,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 2024, pp. 17:1– 17:12. [Online]. Available: https://doi.org/10.1145/3597503.3623342

work page doi:10.1145/3597503.3623342 2024

[6] [6]

Flexfl: Flexible and effective fault localization with open-source large language models,

C. Xu, Z. Liu, X. Ren, G. Zhang, M. Liang, and D. Lo, “Flexfl: Flexible and effective fault localization with open-source large language models,” CoRR, vol. abs/2411.10714, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.10714

work page doi:10.48550/arxiv.2411.10714 2024

[7] [7]

System-level test: State of the art and challenges,

D. Appello, H. H. Chen, M. Sauer, I. Polian, P. Bernardi, and M. S. Reorda, “System-level test: State of the art and challenges,” in27th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2021, Torino, Italy, June 28-30, 2021 . IEEE, 2021, pp. 1–7. [Online]. Available: https://doi.org/10.1109/IOLTS52814.2021.9486708

work page doi:10.1109/iolts52814.2021.9486708 2021

[8] [8]

An empirical study of bugs in test code,

A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of bugs in test code,” in 2015 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 2015, pp. 101–110

work page 2015

[9] [9]

Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,

S. Fatima, H. Hemmati, and L. C. Briand, “Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,” IEEE Trans. Software Eng. , vol. 50, no. 12, pp. 3146–3171,

work page

[10] [10]

Available: https://doi.org/10.1109/TSE.2024.3472476

[Online]. Available: https://doi.org/10.1109/TSE.2024.3472476

work page doi:10.1109/tse.2024.3472476 2024

[11] [11]

Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,

K. Ke, “Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,” in 2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE) . IEEE Computer Society, 2025, pp. 762–762

work page 2025

[12] [12]

Automated test case repair using language models,

A. Saboor Yaraghi, D. Holden, N. Kahani, and L. Briand, “Automated test case repair using language models,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 1104–1133, 2025

work page 2025

[13] [13]

Utfix: Change aware unit test repairing using llm,

S. Rahman, S. Kuhar, B. Cirisci, P. Garg, S. Wang, X. Ma, A. Deoras, and B. Ray, “Utfix: Change aware unit test repairing using llm,” Proceedings of the ACM on Programming Languages , vol. 9, no. OOPSLA1, pp. 143–168, 2025

work page 2025

[14] [14]

Boosting spectrum-based fault localization via multi-correct programs in online programming,

W. Zheng, H. Hu, T. Chen, F. Yang, X. Fan, and P. Xiao, “Boosting spectrum-based fault localization via multi-correct programs in online programming,” IEICE Trans. Inf. Syst. , vol. 107, no. 4, pp. 525–536,

work page

[15] [15]

Available: https://doi.org/10.1587/transinf.2023edp7164

[Online]. Available: https://doi.org/10.1587/transinf.2023edp7164

work page doi:10.1587/transinf.2023edp7164

[16] [16]

Spectrum-based rule- and item- level localization of faults in context-free grammars,

M. Raselimo and B. Fischer, “Spectrum-based rule- and item- level localization of faults in context-free grammars,” J. Syst. Softw., vol. 215, p. 112067, 2024. [Online]. Available: https: //doi.org/10.1016/j.jss.2024.112067

work page doi:10.1016/j.jss.2024.112067 2024

[17] [17]

A survey of challenges in spectrum-based software fault localization,

Q. I. Sarhan and ´A. Besz ´edes, “A survey of challenges in spectrum-based software fault localization,” IEEE Access , vol. 10, pp. 10 618–10 639, 2022. [Online]. Available: https://doi.org/10.1109/ ACCESS.2022.3144079

work page arXiv 2022

[18] [18]

Spectrum-based Software Fault Localization: A Survey of Techniques, Advances, and Challenges

H. A. de Souza, M. L. Chaim, and F. Kon, “Spectrum-based software fault localization: A survey of techniques, advances, and challenges,” arXiv preprint arXiv:1607.04347 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Spectrum-based fault localization techniques application on multiple- fault programs: A review,

A. Zakari, S. Abdullahi, N. Shagari, A. B. Tambawal, N. M. Shanono, J. Z. Maitama, R. A. Rasheed, A. Adamu, and S. M. Abdulrahman, “Spectrum-based fault localization techniques application on multiple- fault programs: A review,” Global Journal of Computer Science and Technology, vol. 20, pp. 41–48, 2020

work page 2020

[20] [20]

Isolating failure-inducing thread schedules,

J. Choi and A. Zeller, “Isolating failure-inducing thread schedules,” in Proceedings of the International Symposium on Software Testing and Analysis, ISSTA 2002, Roma, Italy, July 22-24, 2002 , P. G. Frankl, Ed. ACM, 2002, pp. 210–220. [Online]. Available: https://doi.org/10.1145/566172.566211

work page doi:10.1145/566172.566211 2002

[21] [21]

Do system test cases grow old?

R. Feldt, “Do system test cases grow old?” in Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland, Ohio, USA. IEEE Computer Society, 2014, pp. 343–352. [Online]. Available: https://doi.org/10.1109/ICST.2014.47

work page doi:10.1109/icst.2014.47 2014

[22] [22]

Abstract execution: A technique for efficiently tracing programs,

J. R. Larus, “Abstract execution: A technique for efficiently tracing programs,” Softw. Pract. Exp., vol. 20, no. 12, pp. 1241–1258, 1990. [Online]. Available: https://doi.org/10.1002/spe.4380201205

work page doi:10.1002/spe.4380201205 1990

[23] [23]

Combining code and requirements coverage with execution cost for test suite reduction,

A. Marchetto, G. Scanniello, and A. Susi, “Combining code and requirements coverage with execution cost for test suite reduction,” IEEE Trans. Software Eng., vol. 45, no. 4, pp. 363–390, 2019. [Online]. Available: https://doi.org/10.1109/TSE.2017.2777831

work page doi:10.1109/tse.2017.2777831 2019

[24] [24]

Analysis of overhead in dynamic java performance monitoring,

V . Hork´y, J. Kotrc, P. Libic, and P. Tuma, “Analysis of overhead in dynamic java performance monitoring,” in Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016, Delft, The Netherlands, March 12-16, 2016 , A. Avritzer, A. Iosup, X. Zhu, and S. Becker, Eds. ACM, 2016, pp. 275–286. [Online]. Available: https://do...

work page doi:10.1145/2851553.2851569 2016

[25] [25]

doi: 10.18653/v1/N19-1423

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short ...

work page doi:10.18653/v1/n19-1423 2019

[26] [26]

A survey on automated driving system testing: Landscapes and trends.ACM Trans

S. Tang, Z. Zhang, Y . Zhang, J. Zhou, Y . Guo, S. Liu, S. Guo, Y . Li, L. Ma, Y . Xue, and Y . Liu, “A survey on automated driving system testing: Landscapes and trends,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 124:1–124:62, 2023. [Online]. Available: https://doi.org/10.1145/3579642

work page doi:10.1145/3579642 2023

[27] [27]

Black box and white box testing techniques- a literature review,

S. Nidhra and J. Dondeti, “Black box and white box testing techniques- a literature review,” International Journal of Embedded Systems and Applications (IJESA), vol. 2, no. 2, pp. 29–50, 2012

work page 2012

[28] [28]

The paradox of source code secrecy,

S. K. Katyal, “The paradox of source code secrecy,” Cornell L. Rev., vol. 104, p. 1183, 2018

work page 2018

[29] [29]

Exploring risks in the usage of third-party libraries,

S. Raemaekers, A. van Deursen, and J. Visser, “Exploring risks in the usage of third-party libraries,” in of the BElgian-NEtherlands software eVOLution seminar, vol. 31, 2011

work page 2011

[30] [30]

An empirical study of usages, updates and risks of third-party libraries in java projects,

Y . Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y . Wu, and Y . Liu, “An empirical study of usages, updates and risks of third-party libraries in java projects,” in IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020 . IEEE, 2020, pp. 35–45. [Online]. Available: https://...

work page doi:10.1109/icsme46990.2020.00014 2020

[31] [31]

Test-case reduction for C compiler bugs,

J. Regehr, Y . Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for C compiler bugs,” in ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 , J. Vitek, H. Lin, and F. Tip, Eds. ACM, 2012, pp. 335–346. [Online]. Available: https://doi.org/10.1145/2254064.2254104

work page doi:10.1145/2254064.2254104 2012

[32] [32]

Automatic test case optimization: A bacteriologic algorithm,

B. Baudry, F. Fleurey, J. J´ez´equel, and Y . L. Traon, “Automatic test case optimization: A bacteriologic algorithm,” IEEE Softw., vol. 22, no. 2, pp. 76–82, 2005. [Online]. Available: https://doi.org/10.1109/MS.2005.30

work page doi:10.1109/ms.2005.30 2005

[33] [33]

An insight into test case optimization: Ideas and trends with future perspectives,

N. Gupta, A. Sharma, and M. K. Pachariya, “An insight into test case optimization: Ideas and trends with future perspectives,” IEEE Access , vol. 7, pp. 22 310–22 327, 2019. [Online]. Available: https://doi.org/10.1109/ACCESS.2019.2899471

work page doi:10.1109/access.2019.2899471 2019

[34] [34]

Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,

R. White and J. Krinke, “Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,” Empir. Softw. Eng. , vol. 27, no. 3, p. 67, 2022. [Online]. Available: https://doi.org/10.1007/s10664-021-10079-1

work page doi:10.1007/s10664-021-10079-1 2022

[35] [35]

Towards optimizing the costs of LLM usage,

S. Shekhar, T. Dubey, K. Mukherjee, A. Saxena, A. Tyagi, and N. Kotla, “Towards optimizing the costs of LLM usage,” CoRR, vol. abs/2402.01742, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2402.01742

work page doi:10.48550/arxiv.2402.01742 2024

[36] [36]

Chatunitest: a chatgpt- based automated unit test generation tool,

Z. Xie, Y . Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt- based automated unit test generation tool,” CoRR, vol. abs/2305.04764,

work page arXiv

[37] [37]

Chatunitest: a chatgpt- based automated unit test generation tool,

[Online]. Available: https://doi.org/10.48550/arXiv.2305.04764

work page doi:10.48550/arxiv.2305.04764

[38] [38]

A3test: Assertion-augmented automated test case generation,

S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-augmented automated test case generation,” Inf. Softw. Technol., vol. 176, p. 107565, 2024. [Online]. Available: https: //doi.org/10.1016/j.infsof.2024.107565

work page doi:10.1016/j.infsof.2024.107565 2024

[39] [39]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 919–931. 21

work page 2023

[40] [40]

A systematic literature review of test breakage prevention and repair techniques,

J. Imtiaz, S. Sherin, M. U. Khan, and M. Z. Iqbal, “A systematic literature review of test breakage prevention and repair techniques,” Information and Software Technology , vol. 113, pp. 1–19, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584919300990

work page 2019

[41] [41]

Resumption strategies for interrupted programming tasks,

C. Parnin and S. Rugaber, “Resumption strategies for interrupted programming tasks,” Softw. Qual. J. , vol. 19, no. 1, pp. 5–34, 2011. [Online]. Available: https://doi.org/10.1007/s11219-010-9104-9

work page doi:10.1007/s11219-010-9104-9 2011

[42] [42]

Machine learning-based network status detection and fault localization,

A. R. Mohammed, S. A. Mohammed, D. C ˆot´e, and S. Shirmohammadi, “Machine learning-based network status detection and fault localization,” IEEE Trans. Instrum. Meas. , vol. 70, pp. 1–10, 2021. [Online]. Available: https://doi.org/10.1109/TIM.2021.3094223

work page doi:10.1109/tim.2021.3094223 2021

[43] [43]

AUTOTRAINER: an automatic DNN training problem detection and repair system,

M. Wardat, W. Le, and H. Rajan, “Deeplocalize: Fault localization for deep neural networks,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 . IEEE, 2021, pp. 251–262. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00034

work page doi:10.1109/icse43902.2021.00034 2021

[44] [44]

Fault localization for hardware design code with time-aware program spectrum,

J. Wu, Z. Zhang, D. Yang, X. Meng, J. He, X. Mao, and Y . Lei, “Fault localization for hardware design code with time-aware program spectrum,” in IEEE 40th International Conference on Computer Design, ICCD 2022, Olympic Valley, CA, USA, October 23-26, 2022 . IEEE, 2022, pp. 537–544. [Online]. Available: https://doi.org/10.1109/ICCD56317.2022.00085

work page doi:10.1109/iccd56317.2022.00085 2022

[45] [45]

Bag of tricks for inference-time computation of llm reasoning,

F. Liu, W. Chao, N. Tan, and H. Liu, “Bag of tricks for inference-time computation of llm reasoning,” arXiv preprint arXiv:2502.07191 , 2025

work page arXiv 2025

[46] [46]

Efficient and elastic llms,

P. Jain, “Efficient and elastic llms,” Google Research India, Tech. Rep.,

work page

[47] [47]

Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf

[Online]. Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf

work page

[48] [48]

Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,

Y . Wu, Z. Li, J. M. Zhang, and Y . Liu, “Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,” CoRR, vol. abs/2310.16253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16253

work page doi:10.48550/arxiv.2310.16253 2023

[49] [49]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,

S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024 , Y . Graham and M. Purv...

work page 2024

[50] [50]

A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman,Compilers: Principles, Techniques, and Tools (2nd Edition) . USA: Addison-Wesley Longman Publishing Co., Inc., 2006

work page 2006

[51] [51]

Fault localization using execution traces,

M. A. Francel and S. Rugaber, “Fault localization using execution traces,” in Proceedings of the 30th Annual Southeast Regional Conference, 1992, Raleigh, North Carolina, USA, April 8-10, 1992 , M. A. V ouk, D. S. Reeves, and C. M. Pancake, Eds. ACM, 1992, pp. 69–76. [Online]. Available: https://doi.org/10.1145/503720.503747

work page doi:10.1145/503720.503747 1992

[52] [52]

Fault localization using execution slices and dataflow tests,

H. Agrawal, J. R. Horgan, S. London, and W. E. Wong, “Fault localization using execution slices and dataflow tests,” in Sixth International Symposium on Software Reliability Engineering, ISSRE 1995, Toulouse, France, October 24-27, 1995 . IEEE Computer Society, 1995, pp. 143–151. [Online]. Available: https: //doi.org/10.1109/ISSRE.1995.497652

work page doi:10.1109/issre.1995.497652 1995

[53] [53]

Visualization of test information to assist fault localization,

J. A. Jones, M. J. Harrold, and J. T. Stasko, “Visualization of test information to assist fault localization,” in Proceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA , W. Tracz, M. Young, and J. Magee, Eds. ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397

work page doi:10.1145/581339.581397 2002

[54] [54]

Interactive fault localization techniques in a spreadsheet environment,

J. R. Ruthruff, M. M. Burnett, and G. Rothermel, “Interactive fault localization techniques in a spreadsheet environment,” IEEE Trans. Software Eng., vol. 32, no. 4, pp. 213–239, 2006. [Online]. Available: https://doi.org/10.1109/TSE.2006.37

work page doi:10.1109/tse.2006.37 2006

[55] [55]

Fault localization with nearest neighbor queries,

M. Renieris and S. P. Reiss, “Fault localization with nearest neighbor queries,” in 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada . IEEE Computer Society, 2003, pp. 30–39. [Online]. Available: https://doi.org/10.1109/ASE.2003.1240292

work page doi:10.1109/ase.2003.1240292 2003

[56] [56]

Heuristics for automatic localization of soft- ware faults,

H. Pan and E. Spafford, “Heuristics for automatic localization of soft- ware faults,” Software Engineering Research Center, Purdue University, Tech. Rep. SERC-TR-116-P, 7 1992

work page 1992

[57] [57]

Tracing back log data to its log statement: from research to practice,

D. Schipper, M. F. Aniche, and A. van Deursen, “Tracing back log data to its log statement: from research to practice,” in Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada , M. D. Storey, B. Adams, and S. Haiduc, Eds. IEEE / ACM, 2019, pp. 545–549. [Online]. Available: https://doi...

work page doi:10.1109/msr.2019.00081 2019

[58] [58]

On matching log analysis to source code: A systematic mapping study,

V . Bushong, R. Sanders, J. Curtis, M. Du, T. Cern ´y, K. Frajt ´ak, M. Bures, P. Tisnovsky, and D. Shin, “On matching log analysis to source code: A systematic mapping study,” in RACS ’20: International Conference on Research in Adaptive and Convergent Systems, Gwangju, Korea, October 13-16, 2020 , T. Cern ´y and J. W. Park, Eds. ACM, 2020, pp. 181–187. ...

work page doi:10.1145/3400286.3418262 2020

[59] [59]

Trace reconstruction in system logs for processing with process mining,

J. P. J ¨urgensen, “Trace reconstruction in system logs for processing with process mining,” in Proceedings of the 2nd International Conference on Industry 4.0 and Smart Manufacturing (ISM 2020), Virtual Event, Austria, 23-25 November 2020 , ser. Procedia Computer Science, F. Longo, M. Affenzeller, and A. Padovano, Eds., vol. 180. Elsevier, 2020, pp. 352–...

work page doi:10.1016/j.procs.2021.01.173 2020

[60] [60]

Brevity is the soul of wit: Pruning long files for code generation,

A. K. Singh, Y . Yang, K. Tirumala, M. Elhoushi, and A. S. Morcos, “Brevity is the soul of wit: Pruning long files for code generation,” CoRR, vol. abs/2407.00434, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.00434

work page doi:10.48550/arxiv.2407.00434 2024

[61] [61]

VIDUR: A large-scale simulation framework for LLM inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “VIDUR: A large-scale simulation framework for LLM inference,” in Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 , P. B. Gibbons, G. Pekhimenko, and C. D. Sa, Eds. mlsys.org, 2024...

work page 2024

[62] [62]

Learning to predict program execution by modeling dynamic dependency on code graphs,

C. C. Le, H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Q. Bui, “Learning to predict program execution by modeling dynamic dependency on code graphs,” CoRR, vol. abs/2408.02816, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.02816

work page doi:10.48550/arxiv.2408.02816 2024

[63] [63]

Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks,

J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hemmati, “Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks,” arXiv preprint arXiv:2310.10508, 2023

work page arXiv 2023

[64] [64]

Better zero-shot reasoning with role-play prompting,

A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou, E. Wang, and X. Dong, “Better zero-shot reasoning with role-play prompting,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024...

work page doi:10.18653/v1/2024.naacl-long.228 2024

[65] [65]

Rethinking the role-play prompting in mathemat- ical reasoning tasks,

Z. Han and Z. Wang, “Rethinking the role-play prompting in mathemat- ical reasoning tasks,” in Proceedings of the 1st Workshop on Efficiency, Security, and Generalization of Multimedia Foundation Models , 2024, pp. 13–17

work page 2024

[66] [66]

LLM lies: Hallucinations are not bugs, but features as adversarial examples,

J. Yao, K. Ning, Z. Liu, M. Ning, and L. Yuan, “LLM lies: Hallucinations are not bugs, but features as adversarial examples,” CoRR, vol. abs/2310.01469, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.01469

work page doi:10.48550/arxiv.2310.01469 2023

[67] [67]

Data preparation for deep learning based code smell detection: A systematic literature review,

F. Zhang, Z. Zhang, J. W. Keung, X. Tang, Z. Yang, X. Yu, and W. Hu, “Data preparation for deep learning based code smell detection: A systematic literature review,” J. Syst. Softw. , vol. 216, p. 112131,

work page

[68] [68]

Available: https://doi.org/10.1016/j.jss.2024.112131

[Online]. Available: https://doi.org/10.1016/j.jss.2024.112131

work page doi:10.1016/j.jss.2024.112131 2024

[69] [69]

Code quality analysis: Exploring blank lines as indicators of increased code complexity,

R. Galiullin and Y . Bugayenko, “Code quality analysis: Exploring blank lines as indicators of increased code complexity,” Nov. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.14132684

work page doi:10.5281/zenodo.14132684 2024

[70] [70]

Quality analysis of source code comments,

D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in 2013 21st International Conference on Program Comprehension (ICPC), 2013, pp. 83–92

work page 2013

[71] [71]

Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,

A. Bagheri and P. Heged ¨us, “Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,” in 19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022 . ACM, 2022, pp. 117–121. [Online]. Available: https://doi.org/10.1145/3524842.3528034

work page doi:10.1145/3524842.3528034 2022

[72] [72]

The three sigma rule,

F. Pukelsheim, “The three sigma rule,” The American Statistician , vol. 48, no. 2, pp. 88–91, 1994

work page 1994

[73] [73]

Singh and N

R. Singh and N. S. Mangat, Stratified Sampling . Dordrecht: Springer Netherlands, 1996, pp. 102–144. [Online]. Available: https://doi.org/10.1007/978-94-017-1404-4 5

work page doi:10.1007/978-94-017-1404-4 1996

[74] [74]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance,

M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal of the american statistical association, vol. 32, no. 200, pp. 675–701, 1937

work page 1937

[75] [75]

URL http://www.jstor.org/stable/3001968

F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968 22

work page arXiv 1945

[76] [76]

Globug: Using global data in fault localization,

N. Miryeganeh, S. Hashtroudi, and H. Hemmati, “Globug: Using global data in fault localization,” J. Syst. Softw., vol. 177, p. 110961,

work page

[77] [77]

Available: https://doi.org/10.1016/j.jss.2021.110961

[Online]. Available: https://doi.org/10.1016/j.jss.2021.110961

work page doi:10.1016/j.jss.2021.110961 2021

[78] [78]

Historical spectrum based fault localization,

M. Wen, J. Chen, Y . Tian, R. Wu, D. Hao, S. Han, and S. Cheung, “Historical spectrum based fault localization,” IEEE Trans. Software Eng., vol. 47, no. 11, pp. 2348–2368, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2948158

work page doi:10.1109/tse.2019.2948158 2021

[79] [79]

IRBFL: an information retrieval based fault localization approach,

Z. Li, X. Bai, H. Wang, and Y . Liu, “IRBFL: an information retrieval based fault localization approach,” in 44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020 . IEEE, 2020, pp. 991–996. [Online]. Available: https://doi.org/10.1109/COMPSAC48688.2020.0-142

work page doi:10.1109/compsac48688.2020.0-142 2020

[80] [80]

Binary Codes Capable of Correcting Deletions, Insertions and Reversals,

V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady , vol. 10, p. 707, Feb. 1966

work page 1966