Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Alessio Bucaioni; Anton Ununger; Pernilla Hall; Riccardo Rubei

arxiv: 2604.25700 · v2 · submitted 2026-04-28 · 💻 cs.SE · cs.LG

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Pernilla Hall , Anton Ununger , Riccardo Rubei , Alessio Bucaioni This is my paper

Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords fault localizationbug reportstext classificationmachine learningsoftware maintenancelanguage modelsdata augmentation

0 comments

The pith

Traditional machine learning models outperform fine-tuned transformers when localizing faults using only bug report text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether natural language bug reports contain enough information to locate software faults without any code or runtime data. Researchers applied both classical machine learning classifiers and fine-tuned language models to five years of real bug reports from an industrial robotics setting, each tied to its confirmed fix. Classical models using term frequency features performed better overall, and adding data augmentation boosted the Random Forest results. Such a method could help maintenance teams in complex systems diagnose issues faster by leveraging the text they already have.

Core claim

The central discovery is that framing fault localization as text classification on bug reports allows traditional models with TF-IDF to exceed the performance of fine-tuned RoBERTa variants on industrial data, with data augmentation aiding the top classical model. This establishes that historical bug reports can be leveraged for effective, text-only AI support in fault finding, providing a scalable addition to industrial debugging without extra dependencies.

What carries the argument

A text classification system that learns to map bug report descriptions to fault locations in the code using term frequency-inverse document frequency features and supervised classifiers trained on historical reports paired with verified fixes.

If this is right

Developers can use existing bug report archives to train fault prediction tools immediately.
Data augmentation offers a practical way to enhance performance on smaller or imbalanced industrial datasets.
Transformer-based models require careful evaluation before deployment in specialized domains rather than default use.
The approach integrates directly into current maintenance processes at low cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the text-only method works well, it could extend to predicting fault types or severity in addition to locations.
Testing across multiple companies' bug report corpora would show how general the outperformance of classical models is.
Over time, as more data accumulates, performance could improve, making periodic model updates valuable.
Exploring the linguistic patterns in bug reports that enable accurate localization might inform better report writing guidelines.

Load-bearing premise

The wording in bug reports provides enough unique clues to identify the precise code changes that resolved each reported issue.

What would settle it

A test on previously unseen bug reports from the same environment where none of the models correctly identify the fault location for a majority of cases, or where accuracy falls below that of a naive keyword-matching baseline.

Figures

Figures reproduced from arXiv: 2604.25700 by Alessio Bucaioni, Anton Ununger, Pernilla Hall, Riccardo Rubei.

**Figure 1.** Figure 1: Visualization of label imbalance in the dataset. view at source ↗

**Figure 2.** Figure 2: MAP comparison: TF–IDF vs. sentence embeddings view at source ↗

read the original abstract

Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

On ABB's five years of bug reports, TF-IDF models beat fine-tuned RoBERTa variants for text-only fault localization, but the transformer training setup lacks the details needed to trust the gap.

read the letter

The central result is that logistic regression, SVM, and random forest with TF-IDF features beat fine-tuned RoBERTa-Base and Distil-RoBERTa on five years of resolved ABB Robotics bug reports, with data augmentation helping the random forest. The paper treats fault localization as supervised text classification using only the bug report text, no code or traces required. That setup matches real maintenance work where developers often start with just the report description and a link to the eventual fix. The finding that classical models win here is a useful counter-example to the default assumption that transformers will dominate once fine-tuned. The industrial data and the direct comparison give the claim some grounding that synthetic or open-source benchmarks often lack. The authors are right that this approach is low-cost and deployable inside existing workflows. The weakest part is the transformer side. The manuscript calls the models “fine-tuned” but gives no learning-rate schedule, epoch count, batch size, or hyper-parameter search procedure. On domain-specific industrial text that may be short or jargon-heavy, those choices matter a lot; an under-tuned run can easily lose to a well-tuned bag-of-words baseline. The abstract also omits concrete accuracy numbers, confidence intervals, or statistical tests, so the size of the reported gap is hard to judge from the summary alone. Proprietary data further limits external checks. This work is for people who build or evaluate fault-localization tools in large, long-lived codebases. A reader already working on text-based prediction in software engineering would get value from the setup and the caution about model choice. It is worth sending to referees, but only with a request for the missing training details and the actual performance figures so the comparison can be assessed properly.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that traditional machine learning models using TF-IDF features (Logistic Regression, SVM, and Random Forest) outperform fine-tuned transformer models (RoBERTa-Base and Distil-RoBERTa) for fault localization based solely on the natural language content of bug reports. Evaluated on five years of proprietary resolved bug reports from ABB Robotics, each linked to verified code fixes, the study frames the task as supervised text classification and finds that data augmentation further enhances Random Forest performance. This challenges the assumption that transformer-based models are superior in industrial contexts with domain-specific data and demonstrates the feasibility of text-only AI-assisted fault localization in maintenance workflows.

Significance. If the findings hold, the work is significant for its industrial relevance and empirical challenge to transformer dominance in software engineering tasks. It offers a practical, low-cost method deployable without source code or traces, using historical data for training. The benchmarking on real ABB data provides valuable lessons on when classical methods may be preferable, contributing to more realistic expectations for ML adoption in industry.

major comments (1)

[Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.

minor comments (1)

[Abstract] Specific quantitative results, such as accuracy or F1 scores, dataset statistics, and any statistical tests, are not mentioned in the abstract, which would help readers assess the magnitude of the performance differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The point raised about the transformer fine-tuning procedure is valid and directly relevant to the reproducibility and interpretation of our central claim. We address it below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.

Authors: We agree that the original manuscript omitted the necessary hyperparameters for the fine-tuning of RoBERTa-Base and Distil-RoBERTa, which limits the ability to fully assess whether the observed performance gap reflects a genuine limitation or suboptimal optimization. In the revised manuscript we will add a dedicated paragraph in the Experimental Setup section that reports the exact procedure: both models were fine-tuned using the AdamW optimizer with a learning rate of 2e-5 and linear decay, for 3 epochs, batch size 16, weight decay 0.01, and maximum sequence length 512. These values were selected after a modest grid search over learning rates {1e-5, 2e-5, 5e-5} and epoch counts {2, 3, 4} on a 10 % validation split of the training data; the final configuration was the one that maximized macro-F1 on the validation set. We will also state that no extensive hyperparameter optimization beyond this grid was performed, given the industrial data constraints. With these additions the comparison becomes fully transparent and readers can judge the adequacy of the optimization themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The manuscript frames fault localization as a standard supervised text classification task and reports performance of classical ML models versus fine-tuned transformers on a held-out industrial dataset. No equations, derivations, or self-referential definitions appear; results are obtained by training on labeled historical bug reports and evaluating on separate test instances. This setup does not reduce any claimed prediction to its own inputs by construction. The evaluation follows conventional ML practice with no load-bearing self-citations or ansatz smuggling that would create circularity. Minor self-citation of prior work, if present, is not used to justify the central empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions and the domain assumption that bug report text alone is predictive of fault location; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Bug report text contains sufficient information to classify fault locations via supervised learning
Invoked when framing fault localization as a text classification problem using only natural language content.

pith-pipeline@v0.9.0 · 5580 in / 1229 out tokens · 52345 ms · 2026-05-14T21:57:31.155050+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

[1]

Rui Abreu and Arjan J. C. van Gemund. 2009. A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis. InEighth Symposium on Abstraction, Reformulation, and Approximation, SARA 2009, Lake Arrowhead, California, USA, 8-10 August 2009, Vadim Bulitko and J. Christopher Bug-Report–Driven Fault Localization: Industrial B...

work page 2009
[2]

Hafiza Anisa Ahmed, Narmeen Zakaria Bawany, and Jawwad Ahmed Shamsi

work page
[3]

doi:10.1109/ACCESS.2021.3069248

CaPBug-A Framework for Automatic Bug Categorization and Prioritization Using NLP and Machine Learning Algorithms.IEEE Access9 (2021), 50496–50512. doi:10.1109/ACCESS.2021.3069248

work page doi:10.1109/access.2021.3069248 2021
[4]

Renato Andrade, César Teixeira, Nuno Laranjeiro, and Marco Vieira. [n. d.]. An Empirical Study on the Classification of Bug Reports with Machine Learning. arXiv. Accessed: April 27, 2025. [Online]. Available: https://arxiv.org/abs/2503. 00660

work page 2025
[5]

Ascari, Lucilia Yoshie Araki, Aurora T

Luciano C. Ascari, Lucilia Yoshie Araki, Aurora T. R. Pozo, and Silvia R. Vergilio

work page
[6]

In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009

Exploring machine learning techniques for fault localization. In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009. IEEE, 1–6. doi:10.1109/LATW.2009.4813783

work page doi:10.1109/latw.2009.4813783 2009
[7]

Briand, Yvan Labiche, and Xuetao Liu

Lionel C. Briand, Yvan Labiche, and Xuetao Liu. 2007. Using Machine Learning to Support Debugging with Tarantula. InISSRE 2007, The 18th IEEE International Symposium on Software Reliability, Trollhättan, Sweden, 5-9 November 2007. IEEE Computer Society, 137–146. doi:10.1109/ISSRE.2007.31

work page doi:10.1109/issre.2007.31 2007
[8]

Ing-Xiang Chen, Cheng-Zen Yang, Ting-Kun Lu, and Hojun Jaygarl. 2008. Implicit Social Network Model for Predicting and Tracking the Location of Faults. In2008 32nd Annual IEEE International Computer Software and Applications Conference. 136–143. doi:10.1109/COMPSAC.2008.162

work page doi:10.1109/compsac.2008.162 2008
[9]

Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In 27th International Conference on Software Engineering (ICSE 2005), 15-21 May 2005, St. Louis, Missouri, USA, Gruia-Catalin Roman, William G. Griswold, and Bashar Nuseibeh (Eds.). ACM, 342–351. doi:10.1145/1062455.1062522

work page doi:10.1145/1062455.1062522 2005
[10]

Eric Wong

Vidroha Debroy and W. Eric Wong. 2009. Insights on Fault Interference for Programs with Multiple Bugs. InISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 165–174. doi:10.1109/ISSRE.2009.14

work page doi:10.1109/issre.2009.14 2009
[11]

Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. 1999. Model-Based Diagnosis of Hardware Designs.Artif. Intell.111, 1-2 (1999), 3–39. doi:10.1016/ S0004-3702(99)00034-X

work page 1999
[12]

Meng Gao, Pengyu Li, Congcong Chen, and Yunsong Jiang. 2018. Research on Software Multiple Fault Localization Method Based on Machine Learning. MATEC Web of Conferences232 (Nov. 2018). doi:10.1051/matecconf/201823201060

work page doi:10.1051/matecconf/201823201060 2018
[13]

Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files.Information Processing and Management54, 6 (2018), 1058–1076. doi:10.1016/j.ipm.2018.07.004

work page doi:10.1016/j.ipm.2018.07.004 2018
[14]

Thomas Hirsch and Birgit Hofer. 2022. Using textual bug reports to predict the fault category of software bugs.Array15 (Sept. 2022). doi:10.1016/j.array.2022. 100189

work page doi:10.1016/j.array.2022 2022
[15]

Maliheh Izadi. 2022. CatIss: An Intelligent Tool for Categorizing Issues Re- ports using Transformers. In2022 IEEE/ACM 1st International Workshop on Natu- ral Language-Based Software Engineering (NLBSE). 44–47. doi:10.1145/3528588. 3528662

work page doi:10.1145/3528588 2022
[16]

Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, Article 64 (Jul. 2024), 23 pages. Issue FSE. doi:10.1145/3660771

work page doi:10.1145/3660771 2024
[17]

Omer Koksal and Bedir Tekinerdogan. 2022. Automated Classification of Un- structured Bilingual Software Bug Reports: An Industrial Case Study Research. Applied Sciences12, 1 (2022). doi:10.3390/app12010338

work page doi:10.3390/app12010338 2022
[18]

Zhengliang Li, Zhiwei Jiang, Xiang Chen, Kaibo Cao, and Qing Gu. 2020. Laprob: A Label propagation-Based software bug localization method.Information and Software Technology130 (Oct. 2020). doi:10.1016/j.infsof.2020.106410

work page doi:10.1016/j.infsof.2020.106410 2020
[19]

Zheng, Alex Aiken, and Michael I

Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan

work page
[20]

InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W

Scalable statistical bug isolation. InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 15–26. doi:10.1145/1065010.1065014

work page doi:10.1145/1065010.1065014 2005
[21]

Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hypothesis Testing-Based Approach.IEEE Trans. Software Eng.32, 10 (2006), 831–848. doi:10.1109/TSE.2006.105

work page doi:10.1109/tse.2006.105 2006
[22]

Edward Ogheneovo. 2014. On the Relationship between Software Complexity and Maintenance Costs.Journal of Computer and Communications02 (Jan. 2014), 1–16. doi:10.4236/jcc.2014.214001

work page doi:10.4236/jcc.2014.214001 2014
[23]

Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InChi conference on human factors in computing systems extended abstracts. 1–7

work page 2022
[24]

Neil Walkinshaw and Leandro Minku. 2018. Are 20% of files responsible for 80% of defects?. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oulu, Finland)(ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. doi:10.1145/3239235.3239244

work page doi:10.1145/3239235.3239244 2018
[25]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

work page 2024
[26]

Eric Wong, Vidroha Debroy, and Dianxiang Xu

W. Eric Wong, Vidroha Debroy, and Dianxiang Xu. 2012. Towards Better Fault Localization: A Crosstab-Based Statistical Approach.IEEE Trans. Syst. Man Cybern. Part C42, 3 (2012), 378–396. doi:10.1109/TSMCC.2011.2118751

work page doi:10.1109/tsmcc.2011.2118751 2012
[27]

Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (Aug. 2016), 707–740. doi:10.1109/TSE.2016.2521368

work page doi:10.1109/tse.2016.2521368 2016
[28]

Eric Wong and Yu Qi

W. Eric Wong and Yu Qi. 2009. Bp Neural Network-Based Effective Fault Lo- calization.Int. J. Softw. Eng. Knowl. Eng.19, 4 (2009), 573–597. doi:10.1142/ S021819400900426X

work page 2009
[29]

Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair.arXiv preprint arXiv:2301.13246(2023)

work page arXiv 2023
[30]

Chuyang Xu, Zhongxin Liu, Xiaoxue Ren, Gehao Zhang, Ming Liang, and David Lo. 2025. FlexFL: Flexible and Effective Fault Localization With Open-Source Large Language Models.IEEE Transactions on Software Engineering51, 5 (2025), 1455–1471. doi:10.1109/TSE.2025.3553363

work page doi:10.1109/tse.2025.3553363 2025
[31]

Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn

work page
[32]

Dataflow analysis-inspired deep learning for efficient vulnerability detection

Large Language Models for Test-Free Fault Localization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. doi:10.1145/3597503.3623342

work page doi:10.1145/3597503.3623342
[33]

Haiyang Yang, Yulu Zhou, Tian Liang, and Li Kuang. 2025. ChatDL: An LLM- Based Defect Localization Approach for Software in IIoT Flexible Manufacturing. IEEE Internet of Things Journal12, 16 (2025), 32333–32343. doi:10.1109/JIOT.2025. 3531512

work page doi:10.1109/jiot.2025 2025
[34]

2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition

Andreas Zeller. 2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition. Academic Press. http://store.elsevier.com/product.jsp?isbn= 9780123745156&pagename=search

work page 2009
[35]

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure- Inducing Input.IEEE Trans. Software Eng.28, 2 (2002), 183–200. doi:10.1109/32. 988498

work page doi:10.1109/32 2002
[36]

Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022. Repairing bugs in python assignments using large language models.arXiv preprint arXiv:2209.14876(2022)

work page arXiv 2022
[37]

Hu Zhengdong, Polpini Jantima, and Gamgarn Somprasertsri. 2025. A Compar- ative Study of Multilabel Classification Techniques for Analyzing Bug Report Dependencies.Journal of Information Systems Engineering and Management10 (2025). doi:10.52783/jisem.v10i26s.4253

work page doi:10.52783/jisem.v10i26s.4253 2025

[1] [1]

Rui Abreu and Arjan J. C. van Gemund. 2009. A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis. InEighth Symposium on Abstraction, Reformulation, and Approximation, SARA 2009, Lake Arrowhead, California, USA, 8-10 August 2009, Vadim Bulitko and J. Christopher Bug-Report–Driven Fault Localization: Industrial B...

work page 2009

[2] [2]

Hafiza Anisa Ahmed, Narmeen Zakaria Bawany, and Jawwad Ahmed Shamsi

work page

[3] [3]

doi:10.1109/ACCESS.2021.3069248

CaPBug-A Framework for Automatic Bug Categorization and Prioritization Using NLP and Machine Learning Algorithms.IEEE Access9 (2021), 50496–50512. doi:10.1109/ACCESS.2021.3069248

work page doi:10.1109/access.2021.3069248 2021

[4] [4]

Renato Andrade, César Teixeira, Nuno Laranjeiro, and Marco Vieira. [n. d.]. An Empirical Study on the Classification of Bug Reports with Machine Learning. arXiv. Accessed: April 27, 2025. [Online]. Available: https://arxiv.org/abs/2503. 00660

work page 2025

[5] [5]

Ascari, Lucilia Yoshie Araki, Aurora T

Luciano C. Ascari, Lucilia Yoshie Araki, Aurora T. R. Pozo, and Silvia R. Vergilio

work page

[6] [6]

In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009

Exploring machine learning techniques for fault localization. In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009. IEEE, 1–6. doi:10.1109/LATW.2009.4813783

work page doi:10.1109/latw.2009.4813783 2009

[7] [7]

Briand, Yvan Labiche, and Xuetao Liu

Lionel C. Briand, Yvan Labiche, and Xuetao Liu. 2007. Using Machine Learning to Support Debugging with Tarantula. InISSRE 2007, The 18th IEEE International Symposium on Software Reliability, Trollhättan, Sweden, 5-9 November 2007. IEEE Computer Society, 137–146. doi:10.1109/ISSRE.2007.31

work page doi:10.1109/issre.2007.31 2007

[8] [8]

Ing-Xiang Chen, Cheng-Zen Yang, Ting-Kun Lu, and Hojun Jaygarl. 2008. Implicit Social Network Model for Predicting and Tracking the Location of Faults. In2008 32nd Annual IEEE International Computer Software and Applications Conference. 136–143. doi:10.1109/COMPSAC.2008.162

work page doi:10.1109/compsac.2008.162 2008

[9] [9]

Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In 27th International Conference on Software Engineering (ICSE 2005), 15-21 May 2005, St. Louis, Missouri, USA, Gruia-Catalin Roman, William G. Griswold, and Bashar Nuseibeh (Eds.). ACM, 342–351. doi:10.1145/1062455.1062522

work page doi:10.1145/1062455.1062522 2005

[10] [10]

Eric Wong

Vidroha Debroy and W. Eric Wong. 2009. Insights on Fault Interference for Programs with Multiple Bugs. InISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 165–174. doi:10.1109/ISSRE.2009.14

work page doi:10.1109/issre.2009.14 2009

[11] [11]

Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. 1999. Model-Based Diagnosis of Hardware Designs.Artif. Intell.111, 1-2 (1999), 3–39. doi:10.1016/ S0004-3702(99)00034-X

work page 1999

[12] [12]

Meng Gao, Pengyu Li, Congcong Chen, and Yunsong Jiang. 2018. Research on Software Multiple Fault Localization Method Based on Machine Learning. MATEC Web of Conferences232 (Nov. 2018). doi:10.1051/matecconf/201823201060

work page doi:10.1051/matecconf/201823201060 2018

[13] [13]

Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files.Information Processing and Management54, 6 (2018), 1058–1076. doi:10.1016/j.ipm.2018.07.004

work page doi:10.1016/j.ipm.2018.07.004 2018

[14] [14]

Thomas Hirsch and Birgit Hofer. 2022. Using textual bug reports to predict the fault category of software bugs.Array15 (Sept. 2022). doi:10.1016/j.array.2022. 100189

work page doi:10.1016/j.array.2022 2022

[15] [15]

Maliheh Izadi. 2022. CatIss: An Intelligent Tool for Categorizing Issues Re- ports using Transformers. In2022 IEEE/ACM 1st International Workshop on Natu- ral Language-Based Software Engineering (NLBSE). 44–47. doi:10.1145/3528588. 3528662

work page doi:10.1145/3528588 2022

[16] [16]

Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, Article 64 (Jul. 2024), 23 pages. Issue FSE. doi:10.1145/3660771

work page doi:10.1145/3660771 2024

[17] [17]

Omer Koksal and Bedir Tekinerdogan. 2022. Automated Classification of Un- structured Bilingual Software Bug Reports: An Industrial Case Study Research. Applied Sciences12, 1 (2022). doi:10.3390/app12010338

work page doi:10.3390/app12010338 2022

[18] [18]

Zhengliang Li, Zhiwei Jiang, Xiang Chen, Kaibo Cao, and Qing Gu. 2020. Laprob: A Label propagation-Based software bug localization method.Information and Software Technology130 (Oct. 2020). doi:10.1016/j.infsof.2020.106410

work page doi:10.1016/j.infsof.2020.106410 2020

[19] [19]

Zheng, Alex Aiken, and Michael I

Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan

work page

[20] [20]

InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W

Scalable statistical bug isolation. InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 15–26. doi:10.1145/1065010.1065014

work page doi:10.1145/1065010.1065014 2005

[21] [21]

Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hypothesis Testing-Based Approach.IEEE Trans. Software Eng.32, 10 (2006), 831–848. doi:10.1109/TSE.2006.105

work page doi:10.1109/tse.2006.105 2006

[22] [22]

Edward Ogheneovo. 2014. On the Relationship between Software Complexity and Maintenance Costs.Journal of Computer and Communications02 (Jan. 2014), 1–16. doi:10.4236/jcc.2014.214001

work page doi:10.4236/jcc.2014.214001 2014

[23] [23]

Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InChi conference on human factors in computing systems extended abstracts. 1–7

work page 2022

[24] [24]

Neil Walkinshaw and Leandro Minku. 2018. Are 20% of files responsible for 80% of defects?. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oulu, Finland)(ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. doi:10.1145/3239235.3239244

work page doi:10.1145/3239235.3239244 2018

[25] [25]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

work page 2024

[26] [26]

Eric Wong, Vidroha Debroy, and Dianxiang Xu

W. Eric Wong, Vidroha Debroy, and Dianxiang Xu. 2012. Towards Better Fault Localization: A Crosstab-Based Statistical Approach.IEEE Trans. Syst. Man Cybern. Part C42, 3 (2012), 378–396. doi:10.1109/TSMCC.2011.2118751

work page doi:10.1109/tsmcc.2011.2118751 2012

[27] [27]

Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (Aug. 2016), 707–740. doi:10.1109/TSE.2016.2521368

work page doi:10.1109/tse.2016.2521368 2016

[28] [28]

Eric Wong and Yu Qi

W. Eric Wong and Yu Qi. 2009. Bp Neural Network-Based Effective Fault Lo- calization.Int. J. Softw. Eng. Knowl. Eng.19, 4 (2009), 573–597. doi:10.1142/ S021819400900426X

work page 2009

[29] [29]

Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair.arXiv preprint arXiv:2301.13246(2023)

work page arXiv 2023

[30] [30]

Chuyang Xu, Zhongxin Liu, Xiaoxue Ren, Gehao Zhang, Ming Liang, and David Lo. 2025. FlexFL: Flexible and Effective Fault Localization With Open-Source Large Language Models.IEEE Transactions on Software Engineering51, 5 (2025), 1455–1471. doi:10.1109/TSE.2025.3553363

work page doi:10.1109/tse.2025.3553363 2025

[31] [31]

Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn

work page

[32] [32]

Dataflow analysis-inspired deep learning for efficient vulnerability detection

Large Language Models for Test-Free Fault Localization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. doi:10.1145/3597503.3623342

work page doi:10.1145/3597503.3623342

[33] [33]

Haiyang Yang, Yulu Zhou, Tian Liang, and Li Kuang. 2025. ChatDL: An LLM- Based Defect Localization Approach for Software in IIoT Flexible Manufacturing. IEEE Internet of Things Journal12, 16 (2025), 32333–32343. doi:10.1109/JIOT.2025. 3531512

work page doi:10.1109/jiot.2025 2025

[34] [34]

2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition

Andreas Zeller. 2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition. Academic Press. http://store.elsevier.com/product.jsp?isbn= 9780123745156&pagename=search

work page 2009

[35] [35]

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure- Inducing Input.IEEE Trans. Software Eng.28, 2 (2002), 183–200. doi:10.1109/32. 988498

work page doi:10.1109/32 2002

[36] [36]

Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022. Repairing bugs in python assignments using large language models.arXiv preprint arXiv:2209.14876(2022)

work page arXiv 2022

[37] [37]

Hu Zhengdong, Polpini Jantima, and Gamgarn Somprasertsri. 2025. A Compar- ative Study of Multilabel Classification Techniques for Analyzing Bug Report Dependencies.Journal of Information Systems Engineering and Management10 (2025). doi:10.52783/jisem.v10i26s.4253

work page doi:10.52783/jisem.v10i26s.4253 2025