Efficient Black-Box Fault Localization for System-Level Test Code Using Large Language Models
Pith reviewed 2026-05-19 07:28 UTC · model grok-4.3
The pith
Pruned traces from one failure log let LLMs rank faulty statements in system-level test code without repeated executions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a black-box, execution-free technique for system-level test code fault localization can match or exceed prior LLM-guided accuracy by using three novel algorithms to build a pruned trace estimate from one failure log. This trace, combined with the error message, supplies the LLM with enough context to rank faulty statements correctly. The method works on complex test scripts without needing the system-under-test source code and was evaluated on an industrial dataset of faulty Python test cases not seen in LLM pre-training.
What carries the argument
Three novel algorithms that identify statements likely involved in the failure to produce a pruned execution trace estimate from a single failure log.
If this is right
- Estimated traces match actual execution traces with an F1 score of around 90%.
- Pruning reduces LLM inference time by up to 34% with no loss in fault localization performance.
- The approach works on complex test scripts that assess full system behavior without access to system-under-test source code.
- It achieves equal or higher accuracy than the latest LLM-guided method while using over 85% less average inference time and 93% fewer tokens per test case.
Where Pith is reading between the lines
- The single-log pruning approach could extend to handling intermittent or non-deterministic test failures more reliably than methods that require multiple runs.
- The technique might apply to test scripts in languages other than Python when similar execution logs are available.
- Integration into test maintenance tools could automate initial fault ranking for engineers debugging large system test suites.
Load-bearing premise
The three novel algorithms can produce a pruned trace from a single failure log that, together with the error message, supplies the LLM with enough context to correctly identify and rank the actual faulty statements in the test code.
What would settle it
On the industrial dataset of faulty Python test cases, the LLM rankings using the pruned traces show lower accuracy than the latest LLM-guided method at line, block, or function level.
Figures
read the original abstract
Fault localization (FL) is a critical step in debugging, which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system-under-test (SUT) rather than in the often complex system-level test code. However, the latter is also important, as in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system-level test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to complex test scripts that assess full system behavior. We evaluate our technique at the function, block, and line levels using an industrial dataset of faulty Python test cases that were not used in pre-training LLMs. Results show that our best-estimated traces closely match the actual traces, with an F1 score of around 90%. Additionally, pruning the complex system-level test code reduces the LLM's inference time by up to 34% without any loss in FL performance. Our method achieves equal or higher FL accuracy, requiring over 85% less average inference time per test case and 93% fewer tokens than the latest LLM-guided FL method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a static, black-box LLM-based method for fault localization in complex system-level test code (TCFL). Given only a single failure log and error message, three novel pruning algorithms estimate a reduced execution trace; this trace plus the error message is fed to an LLM to rank faulty statements at function, block, and line granularity. No SUT source code or repeated executions are required. On an industrial dataset of faulty Python tests, the best pruned traces achieve ~90% F1 overlap with ground-truth traces, LLM inference time drops up to 34% with no accuracy loss, and the method matches or exceeds prior LLM-guided FL accuracy while cutting average inference time by >85% and tokens by 93%.
Significance. If the pruning step reliably retains the actual faulty statements and the LLM ranking is robust, the technique offers a practical route to low-cost, execution-free debugging of non-deterministic system tests. The use of real industrial faults outside LLM pre-training data and the concrete efficiency gains are strengths that would be valuable to the SE community if the evaluation protocol is fully documented.
major comments (3)
- [Evaluation / Results] The central claim that the pruned trace plus error message suffices for correct LLM ranking rests on the assumption that the three pruning algorithms never drop the root-cause statements. The reported ~90% F1 measures trace overlap but does not quantify how often the omitted 10% contains the actual fault; this must be shown explicitly (e.g., by reporting the fraction of cases where the ground-truth faulty line is retained after pruning).
- [Evaluation / Results] Statistical significance, confidence intervals, and the exact number of test cases are not reported for the FL accuracy, time, and token reductions. Without these, it is impossible to assess whether the claimed 85% time and 93% token savings are reliable or could be due to selection effects in the industrial dataset.
- [Approach / Trace Pruning Algorithms] The manuscript should detail the precise decision rules inside the three novel pruning algorithms (thresholds, heuristics for identifying 'likely involved' statements) and demonstrate that they preserve statements executed only on the failing path; otherwise the subsequent LLM prompt may systematically lack the root cause.
minor comments (2)
- [Related Work / Evaluation] Add a table or paragraph explicitly comparing the new method against the 'latest LLM-guided FL method' on identical test cases, including the exact baseline name and citation.
- [Evaluation] Clarify whether the industrial dataset will be released (even in anonymized form) to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our paper. We address each of the major comments below and outline the revisions we intend to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Evaluation / Results] The central claim that the pruned trace plus error message suffices for correct LLM ranking rests on the assumption that the three pruning algorithms never drop the root-cause statements. The reported ~90% F1 measures trace overlap but does not quantify how often the omitted 10% contains the actual fault; this must be shown explicitly (e.g., by reporting the fraction of cases where the ground-truth faulty line is retained after pruning).
Authors: We agree that explicitly demonstrating the retention of root-cause statements is crucial for validating our approach. Although the ~90% F1 score suggests strong overall agreement between pruned and ground-truth traces, it does not isolate the retention rate for the specific faulty elements. In the revised manuscript, we will add a new table or subsection in the evaluation that reports the fraction of test cases where the ground-truth faulty line, block, and function are preserved in the pruned traces for each of the three algorithms. This analysis will be performed on our industrial dataset and will directly address the concern regarding potential loss of the root cause. revision: yes
-
Referee: [Evaluation / Results] Statistical significance, confidence intervals, and the exact number of test cases are not reported for the FL accuracy, time, and token reductions. Without these, it is impossible to assess whether the claimed 85% time and 93% token savings are reliable or could be due to selection effects in the industrial dataset.
Authors: We thank the referee for this observation. We will revise the manuscript to include the exact number of test cases in the industrial dataset, along with statistical significance tests and confidence intervals for the FL accuracy, time, and token reduction metrics. Specifically, we plan to use bootstrap resampling to compute 95% confidence intervals and report p-values for comparisons against baseline methods. This will allow readers to better assess the robustness of our efficiency claims. revision: yes
-
Referee: [Approach / Trace Pruning Algorithms] The manuscript should detail the precise decision rules inside the three novel pruning algorithms (thresholds, heuristics for identifying 'likely involved' statements) and demonstrate that they preserve statements executed only on the failing path; otherwise the subsequent LLM prompt may systematically lack the root cause.
Authors: We appreciate the referee's call for greater precision in describing our pruning algorithms. In the current manuscript, the three algorithms are outlined at a high level in Section 3. We will expand this section to provide the exact decision rules, including any thresholds and heuristics employed to determine 'likely involved' statements. Furthermore, we will include a formal argument or empirical demonstration showing that the algorithms are designed to retain statements on the failing execution path, based on the information available in the failure log. Pseudocode for each algorithm will be added to facilitate understanding and reproducibility. revision: yes
Circularity Check
No circularity; empirical evaluation is self-contained
full rationale
The paper describes a static LLM-based fault localization technique for system-level test code that relies on three novel pruning algorithms applied to a single failure log plus error message. All load-bearing claims—~90% F1 trace overlap, equal-or-better FL accuracy, 85% lower inference time, and 93% fewer tokens—are presented as outcomes of an external industrial evaluation on previously unseen faulty Python test cases rather than any derivation, fitted parameter, or self-citation chain that reduces to the method’s own inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked; the reported performance metrics are measured against ground-truth traces and prior LLM-guided baselines, making the results falsifiable outside the paper’s own definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- trace pruning thresholds
axioms (1)
- domain assumption A single failure execution log contains enough information to estimate the relevant execution trace for fault localization.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method uses a single failure execution log to estimate the test's execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose three novel algorithms that can estimate the execution trace of a faulty test case with sufficient accuracy, while pruning information irrelevant to the failure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An empirical study of fault localization families and their combinations,
D. Zou, J. Liang, Y . Xiong, M. D. Ernst, and L. Zhang, “An empirical study of fault localization families and their combinations,” IEEE Trans. Software Eng. , vol. 47, no. 2, pp. 332–347, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2892102
-
[2]
A survey on software fault localization,
W. E. Wong, R. Gao, Y . Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,” IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016
work page 2016
-
[3]
Evaluating and improving fault localization,
S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 , S. Uchitel, A. Orso, and M. P. Robillard, Eds. IEEE / ACM, 2017, pp. 609–620. [Online]. A...
-
[4]
Agentfl: Scaling llm-based fault localization to project-level context,
Y . Qin, S. Wang, Y . Lou, J. Dong, K. Wang, X. Li, and X. Mao, “Agentfl: Scaling llm-based fault localization to project-level context,” CoRR, vol. abs/2403.16362, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2403.16362
-
[5]
A. Z. H. Yang, C. Le Goues, R. Martins, and V . J. Hellendoorn, “Large language models for test-free fault localization,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 2024, pp. 17:1– 17:12. [Online]. Available: https://doi.org/10.1145/3597503.3623342
-
[6]
Flexfl: Flexible and effective fault localization with open-source large language models,
C. Xu, Z. Liu, X. Ren, G. Zhang, M. Liang, and D. Lo, “Flexfl: Flexible and effective fault localization with open-source large language models,” CoRR, vol. abs/2411.10714, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2411.10714
-
[7]
System-level test: State of the art and challenges,
D. Appello, H. H. Chen, M. Sauer, I. Polian, P. Bernardi, and M. S. Reorda, “System-level test: State of the art and challenges,” in27th IEEE International Symposium on On-Line Testing and Robust System Design, IOLTS 2021, Torino, Italy, June 28-30, 2021 . IEEE, 2021, pp. 1–7. [Online]. Available: https://doi.org/10.1109/IOLTS52814.2021.9486708
-
[8]
An empirical study of bugs in test code,
A. Vahabzadeh, A. M. Fard, and A. Mesbah, “An empirical study of bugs in test code,” in 2015 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 2015, pp. 101–110
work page 2015
-
[9]
Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,
S. Fatima, H. Hemmati, and L. C. Briand, “Flakyfix: Using large language models for predicting flaky test fix categories and test code repair,” IEEE Trans. Software Eng. , vol. 50, no. 12, pp. 3146–3171,
-
[10]
Available: https://doi.org/10.1109/TSE.2024.3472476
[Online]. Available: https://doi.org/10.1109/TSE.2024.3472476
-
[11]
Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,
K. Ke, “Niodebugger: A novel approach to repair non-idempotent- outcome tests with llm-based agent,” in 2025 IEEE/ACM 47th Interna- tional Conference on Software Engineering (ICSE) . IEEE Computer Society, 2025, pp. 762–762
work page 2025
-
[12]
Automated test case repair using language models,
A. Saboor Yaraghi, D. Holden, N. Kahani, and L. Briand, “Automated test case repair using language models,” IEEE Transactions on Software Engineering, vol. 51, no. 4, pp. 1104–1133, 2025
work page 2025
-
[13]
Utfix: Change aware unit test repairing using llm,
S. Rahman, S. Kuhar, B. Cirisci, P. Garg, S. Wang, X. Ma, A. Deoras, and B. Ray, “Utfix: Change aware unit test repairing using llm,” Proceedings of the ACM on Programming Languages , vol. 9, no. OOPSLA1, pp. 143–168, 2025
work page 2025
-
[14]
Boosting spectrum-based fault localization via multi-correct programs in online programming,
W. Zheng, H. Hu, T. Chen, F. Yang, X. Fan, and P. Xiao, “Boosting spectrum-based fault localization via multi-correct programs in online programming,” IEICE Trans. Inf. Syst. , vol. 107, no. 4, pp. 525–536,
-
[15]
Available: https://doi.org/10.1587/transinf.2023edp7164
[Online]. Available: https://doi.org/10.1587/transinf.2023edp7164
-
[16]
Spectrum-based rule- and item- level localization of faults in context-free grammars,
M. Raselimo and B. Fischer, “Spectrum-based rule- and item- level localization of faults in context-free grammars,” J. Syst. Softw., vol. 215, p. 112067, 2024. [Online]. Available: https: //doi.org/10.1016/j.jss.2024.112067
-
[17]
A survey of challenges in spectrum-based software fault localization,
Q. I. Sarhan and ´A. Besz ´edes, “A survey of challenges in spectrum-based software fault localization,” IEEE Access , vol. 10, pp. 10 618–10 639, 2022. [Online]. Available: https://doi.org/10.1109/ ACCESS.2022.3144079
-
[18]
Spectrum-based Software Fault Localization: A Survey of Techniques, Advances, and Challenges
H. A. de Souza, M. L. Chaim, and F. Kon, “Spectrum-based software fault localization: A survey of techniques, advances, and challenges,” arXiv preprint arXiv:1607.04347 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
Spectrum-based fault localization techniques application on multiple- fault programs: A review,
A. Zakari, S. Abdullahi, N. Shagari, A. B. Tambawal, N. M. Shanono, J. Z. Maitama, R. A. Rasheed, A. Adamu, and S. M. Abdulrahman, “Spectrum-based fault localization techniques application on multiple- fault programs: A review,” Global Journal of Computer Science and Technology, vol. 20, pp. 41–48, 2020
work page 2020
-
[20]
Isolating failure-inducing thread schedules,
J. Choi and A. Zeller, “Isolating failure-inducing thread schedules,” in Proceedings of the International Symposium on Software Testing and Analysis, ISSTA 2002, Roma, Italy, July 22-24, 2002 , P. G. Frankl, Ed. ACM, 2002, pp. 210–220. [Online]. Available: https://doi.org/10.1145/566172.566211
-
[21]
Do system test cases grow old?
R. Feldt, “Do system test cases grow old?” in Seventh IEEE International Conference on Software Testing, Verification and Validation, ICST 2014, March 31 2014-April 4, 2014, Cleveland, Ohio, USA. IEEE Computer Society, 2014, pp. 343–352. [Online]. Available: https://doi.org/10.1109/ICST.2014.47
-
[22]
Abstract execution: A technique for efficiently tracing programs,
J. R. Larus, “Abstract execution: A technique for efficiently tracing programs,” Softw. Pract. Exp., vol. 20, no. 12, pp. 1241–1258, 1990. [Online]. Available: https://doi.org/10.1002/spe.4380201205
-
[23]
Combining code and requirements coverage with execution cost for test suite reduction,
A. Marchetto, G. Scanniello, and A. Susi, “Combining code and requirements coverage with execution cost for test suite reduction,” IEEE Trans. Software Eng., vol. 45, no. 4, pp. 363–390, 2019. [Online]. Available: https://doi.org/10.1109/TSE.2017.2777831
-
[24]
Analysis of overhead in dynamic java performance monitoring,
V . Hork´y, J. Kotrc, P. Libic, and P. Tuma, “Analysis of overhead in dynamic java performance monitoring,” in Proceedings of the 7th ACM/SPEC International Conference on Performance Engineering, ICPE 2016, Delft, The Netherlands, March 12-16, 2016 , A. Avritzer, A. Iosup, X. Zhu, and S. Becker, Eds. ACM, 2016, pp. 275–286. [Online]. Available: https://do...
-
[25]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short ...
-
[26]
A survey on automated driving system testing: Landscapes and trends.ACM Trans
S. Tang, Z. Zhang, Y . Zhang, J. Zhou, Y . Guo, S. Liu, S. Guo, Y . Li, L. Ma, Y . Xue, and Y . Liu, “A survey on automated driving system testing: Landscapes and trends,” ACM Trans. Softw. Eng. Methodol., vol. 32, no. 5, pp. 124:1–124:62, 2023. [Online]. Available: https://doi.org/10.1145/3579642
-
[27]
Black box and white box testing techniques- a literature review,
S. Nidhra and J. Dondeti, “Black box and white box testing techniques- a literature review,” International Journal of Embedded Systems and Applications (IJESA), vol. 2, no. 2, pp. 29–50, 2012
work page 2012
-
[28]
The paradox of source code secrecy,
S. K. Katyal, “The paradox of source code secrecy,” Cornell L. Rev., vol. 104, p. 1183, 2018
work page 2018
-
[29]
Exploring risks in the usage of third-party libraries,
S. Raemaekers, A. van Deursen, and J. Visser, “Exploring risks in the usage of third-party libraries,” in of the BElgian-NEtherlands software eVOLution seminar, vol. 31, 2011
work page 2011
-
[30]
An empirical study of usages, updates and risks of third-party libraries in java projects,
Y . Wang, B. Chen, K. Huang, B. Shi, C. Xu, X. Peng, Y . Wu, and Y . Liu, “An empirical study of usages, updates and risks of third-party libraries in java projects,” in IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020 . IEEE, 2020, pp. 35–45. [Online]. Available: https://...
-
[31]
Test-case reduction for C compiler bugs,
J. Regehr, Y . Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for C compiler bugs,” in ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012 , J. Vitek, H. Lin, and F. Tip, Eds. ACM, 2012, pp. 335–346. [Online]. Available: https://doi.org/10.1145/2254064.2254104
-
[32]
Automatic test case optimization: A bacteriologic algorithm,
B. Baudry, F. Fleurey, J. J´ez´equel, and Y . L. Traon, “Automatic test case optimization: A bacteriologic algorithm,” IEEE Softw., vol. 22, no. 2, pp. 76–82, 2005. [Online]. Available: https://doi.org/10.1109/MS.2005.30
-
[33]
An insight into test case optimization: Ideas and trends with future perspectives,
N. Gupta, A. Sharma, and M. K. Pachariya, “An insight into test case optimization: Ideas and trends with future perspectives,” IEEE Access , vol. 7, pp. 22 310–22 327, 2019. [Online]. Available: https://doi.org/10.1109/ACCESS.2019.2899471
-
[34]
Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,
R. White and J. Krinke, “Tctracer: Establishing test-to-code traceability links using dynamic and static techniques,” Empir. Softw. Eng. , vol. 27, no. 3, p. 67, 2022. [Online]. Available: https://doi.org/10.1007/s10664-021-10079-1
-
[35]
Towards optimizing the costs of LLM usage,
S. Shekhar, T. Dubey, K. Mukherjee, A. Saxena, A. Tyagi, and N. Kotla, “Towards optimizing the costs of LLM usage,” CoRR, vol. abs/2402.01742, 2024. [Online]. Available: https: //doi.org/10.48550/arXiv.2402.01742
-
[36]
Chatunitest: a chatgpt- based automated unit test generation tool,
Z. Xie, Y . Chen, C. Zhi, S. Deng, and J. Yin, “Chatunitest: a chatgpt- based automated unit test generation tool,” CoRR, vol. abs/2305.04764,
-
[37]
Chatunitest: a chatgpt- based automated unit test generation tool,
[Online]. Available: https://doi.org/10.48550/arXiv.2305.04764
-
[38]
A3test: Assertion-augmented automated test case generation,
S. Alagarsamy, C. Tantithamthavorn, and A. Aleti, “A3test: Assertion-augmented automated test case generation,” Inf. Softw. Technol., vol. 176, p. 107565, 2024. [Online]. Available: https: //doi.org/10.1016/j.infsof.2024.107565
-
[39]
Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,
C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 919–931. 21
work page 2023
-
[40]
A systematic literature review of test breakage prevention and repair techniques,
J. Imtiaz, S. Sherin, M. U. Khan, and M. Z. Iqbal, “A systematic literature review of test breakage prevention and repair techniques,” Information and Software Technology , vol. 113, pp. 1–19, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584919300990
work page 2019
-
[41]
Resumption strategies for interrupted programming tasks,
C. Parnin and S. Rugaber, “Resumption strategies for interrupted programming tasks,” Softw. Qual. J. , vol. 19, no. 1, pp. 5–34, 2011. [Online]. Available: https://doi.org/10.1007/s11219-010-9104-9
-
[42]
Machine learning-based network status detection and fault localization,
A. R. Mohammed, S. A. Mohammed, D. C ˆot´e, and S. Shirmohammadi, “Machine learning-based network status detection and fault localization,” IEEE Trans. Instrum. Meas. , vol. 70, pp. 1–10, 2021. [Online]. Available: https://doi.org/10.1109/TIM.2021.3094223
-
[43]
AUTOTRAINER: an automatic DNN training problem detection and repair system,
M. Wardat, W. Le, and H. Rajan, “Deeplocalize: Fault localization for deep neural networks,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 . IEEE, 2021, pp. 251–262. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00034
-
[44]
Fault localization for hardware design code with time-aware program spectrum,
J. Wu, Z. Zhang, D. Yang, X. Meng, J. He, X. Mao, and Y . Lei, “Fault localization for hardware design code with time-aware program spectrum,” in IEEE 40th International Conference on Computer Design, ICCD 2022, Olympic Valley, CA, USA, October 23-26, 2022 . IEEE, 2022, pp. 537–544. [Online]. Available: https://doi.org/10.1109/ICCD56317.2022.00085
-
[45]
Bag of tricks for inference-time computation of llm reasoning,
F. Liu, W. Chao, N. Tan, and H. Liu, “Bag of tricks for inference-time computation of llm reasoning,” arXiv preprint arXiv:2502.07191 , 2025
-
[46]
P. Jain, “Efficient and elastic llms,” Google Research India, Tech. Rep.,
-
[47]
Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf
[Online]. Available: http://www.prateekjain.org/publications/ slides/inference efficient llms.pdf
-
[48]
Y . Wu, Z. Li, J. M. Zhang, and Y . Liu, “Condefects: A new dataset to address the data leakage concern for llm-based fault localization and program repair,” CoRR, vol. abs/2310.16253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16253
-
[49]
Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,
S. Balloccu, P. Schmidtov ´a, M. Lango, and O. Dusek, “Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024 , Y . Graham and M. Purv...
work page 2024
-
[50]
A. V . Aho, M. S. Lam, R. Sethi, and J. D. Ullman,Compilers: Principles, Techniques, and Tools (2nd Edition) . USA: Addison-Wesley Longman Publishing Co., Inc., 2006
work page 2006
-
[51]
Fault localization using execution traces,
M. A. Francel and S. Rugaber, “Fault localization using execution traces,” in Proceedings of the 30th Annual Southeast Regional Conference, 1992, Raleigh, North Carolina, USA, April 8-10, 1992 , M. A. V ouk, D. S. Reeves, and C. M. Pancake, Eds. ACM, 1992, pp. 69–76. [Online]. Available: https://doi.org/10.1145/503720.503747
-
[52]
Fault localization using execution slices and dataflow tests,
H. Agrawal, J. R. Horgan, S. London, and W. E. Wong, “Fault localization using execution slices and dataflow tests,” in Sixth International Symposium on Software Reliability Engineering, ISSRE 1995, Toulouse, France, October 24-27, 1995 . IEEE Computer Society, 1995, pp. 143–151. [Online]. Available: https: //doi.org/10.1109/ISSRE.1995.497652
-
[53]
Visualization of test information to assist fault localization,
J. A. Jones, M. J. Harrold, and J. T. Stasko, “Visualization of test information to assist fault localization,” in Proceedings of the 24th International Conference on Software Engineering, ICSE 2002, 19-25 May 2002, Orlando, Florida, USA , W. Tracz, M. Young, and J. Magee, Eds. ACM, 2002, pp. 467–477. [Online]. Available: https://doi.org/10.1145/581339.581397
-
[54]
Interactive fault localization techniques in a spreadsheet environment,
J. R. Ruthruff, M. M. Burnett, and G. Rothermel, “Interactive fault localization techniques in a spreadsheet environment,” IEEE Trans. Software Eng., vol. 32, no. 4, pp. 213–239, 2006. [Online]. Available: https://doi.org/10.1109/TSE.2006.37
-
[55]
Fault localization with nearest neighbor queries,
M. Renieris and S. P. Reiss, “Fault localization with nearest neighbor queries,” in 18th IEEE International Conference on Automated Software Engineering (ASE 2003), 6-10 October 2003, Montreal, Canada . IEEE Computer Society, 2003, pp. 30–39. [Online]. Available: https://doi.org/10.1109/ASE.2003.1240292
-
[56]
Heuristics for automatic localization of soft- ware faults,
H. Pan and E. Spafford, “Heuristics for automatic localization of soft- ware faults,” Software Engineering Research Center, Purdue University, Tech. Rep. SERC-TR-116-P, 7 1992
work page 1992
-
[57]
Tracing back log data to its log statement: from research to practice,
D. Schipper, M. F. Aniche, and A. van Deursen, “Tracing back log data to its log statement: from research to practice,” in Proceedings of the 16th International Conference on Mining Software Repositories, MSR 2019, 26-27 May 2019, Montreal, Canada , M. D. Storey, B. Adams, and S. Haiduc, Eds. IEEE / ACM, 2019, pp. 545–549. [Online]. Available: https://doi...
-
[58]
On matching log analysis to source code: A systematic mapping study,
V . Bushong, R. Sanders, J. Curtis, M. Du, T. Cern ´y, K. Frajt ´ak, M. Bures, P. Tisnovsky, and D. Shin, “On matching log analysis to source code: A systematic mapping study,” in RACS ’20: International Conference on Research in Adaptive and Convergent Systems, Gwangju, Korea, October 13-16, 2020 , T. Cern ´y and J. W. Park, Eds. ACM, 2020, pp. 181–187. ...
-
[59]
Trace reconstruction in system logs for processing with process mining,
J. P. J ¨urgensen, “Trace reconstruction in system logs for processing with process mining,” in Proceedings of the 2nd International Conference on Industry 4.0 and Smart Manufacturing (ISM 2020), Virtual Event, Austria, 23-25 November 2020 , ser. Procedia Computer Science, F. Longo, M. Affenzeller, and A. Padovano, Eds., vol. 180. Elsevier, 2020, pp. 352–...
-
[60]
Brevity is the soul of wit: Pruning long files for code generation,
A. K. Singh, Y . Yang, K. Tirumala, M. Elhoushi, and A. S. Morcos, “Brevity is the soul of wit: Pruning long files for code generation,” CoRR, vol. abs/2407.00434, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.00434
-
[61]
VIDUR: A large-scale simulation framework for LLM inference,
A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “VIDUR: A large-scale simulation framework for LLM inference,” in Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024 , P. B. Gibbons, G. Pekhimenko, and C. D. Sa, Eds. mlsys.org, 2024...
work page 2024
-
[62]
Learning to predict program execution by modeling dynamic dependency on code graphs,
C. C. Le, H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Q. Bui, “Learning to predict program execution by modeling dynamic dependency on code graphs,” CoRR, vol. abs/2408.02816, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.02816
-
[63]
J. Shin, C. Tang, T. Mohati, M. Nayebi, S. Wang, and H. Hemmati, “Prompt engineering or fine tuning: An empirical assessment of large language models in automated software engineering tasks,” arXiv preprint arXiv:2310.10508, 2023
-
[64]
Better zero-shot reasoning with role-play prompting,
A. Kong, S. Zhao, H. Chen, Q. Li, Y . Qin, R. Sun, X. Zhou, E. Wang, and X. Dong, “Better zero-shot reasoning with role-play prompting,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024...
-
[65]
Rethinking the role-play prompting in mathemat- ical reasoning tasks,
Z. Han and Z. Wang, “Rethinking the role-play prompting in mathemat- ical reasoning tasks,” in Proceedings of the 1st Workshop on Efficiency, Security, and Generalization of Multimedia Foundation Models , 2024, pp. 13–17
work page 2024
-
[66]
LLM lies: Hallucinations are not bugs, but features as adversarial examples,
J. Yao, K. Ning, Z. Liu, M. Ning, and L. Yuan, “LLM lies: Hallucinations are not bugs, but features as adversarial examples,” CoRR, vol. abs/2310.01469, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.01469
-
[67]
Data preparation for deep learning based code smell detection: A systematic literature review,
F. Zhang, Z. Zhang, J. W. Keung, X. Tang, Z. Yang, X. Yu, and W. Hu, “Data preparation for deep learning based code smell detection: A systematic literature review,” J. Syst. Softw. , vol. 216, p. 112131,
-
[68]
Available: https://doi.org/10.1016/j.jss.2024.112131
[Online]. Available: https://doi.org/10.1016/j.jss.2024.112131
-
[69]
Code quality analysis: Exploring blank lines as indicators of increased code complexity,
R. Galiullin and Y . Bugayenko, “Code quality analysis: Exploring blank lines as indicators of increased code complexity,” Nov. 2024. [Online]. Available: https://doi.org/10.5281/zenodo.14132684
-
[70]
Quality analysis of source code comments,
D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in 2013 21st International Conference on Program Comprehension (ICPC), 2013, pp. 83–92
work page 2013
-
[71]
Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,
A. Bagheri and P. Heged ¨us, “Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,” in 19th IEEE/ACM International Conference on Mining Software Repositories, MSR 2022, Pittsburgh, PA, USA, May 23-24, 2022 . ACM, 2022, pp. 117–121. [Online]. Available: https://doi.org/10.1145/3524842.3528034
-
[72]
F. Pukelsheim, “The three sigma rule,” The American Statistician , vol. 48, no. 2, pp. 88–91, 1994
work page 1994
-
[73]
R. Singh and N. S. Mangat, Stratified Sampling . Dordrecht: Springer Netherlands, 1996, pp. 102–144. [Online]. Available: https://doi.org/10.1007/978-94-017-1404-4 5
-
[74]
The use of ranks to avoid the assumption of normality implicit in the analysis of variance,
M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal of the american statistical association, vol. 32, no. 200, pp. 675–701, 1937
work page 1937
-
[75]
URL http://www.jstor.org/stable/3001968
F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968 22
-
[76]
Globug: Using global data in fault localization,
N. Miryeganeh, S. Hashtroudi, and H. Hemmati, “Globug: Using global data in fault localization,” J. Syst. Softw., vol. 177, p. 110961,
-
[77]
Available: https://doi.org/10.1016/j.jss.2021.110961
[Online]. Available: https://doi.org/10.1016/j.jss.2021.110961
-
[78]
Historical spectrum based fault localization,
M. Wen, J. Chen, Y . Tian, R. Wu, D. Hao, S. Han, and S. Cheung, “Historical spectrum based fault localization,” IEEE Trans. Software Eng., vol. 47, no. 11, pp. 2348–2368, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2948158
-
[79]
IRBFL: an information retrieval based fault localization approach,
Z. Li, X. Bai, H. Wang, and Y . Liu, “IRBFL: an information retrieval based fault localization approach,” in 44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, July 13-17, 2020 . IEEE, 2020, pp. 991–996. [Online]. Available: https://doi.org/10.1109/COMPSAC48688.2020.0-142
-
[80]
Binary Codes Capable of Correcting Deletions, Insertions and Reversals,
V . I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady , vol. 10, p. 707, Feb. 1966
work page 1966
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.