Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics
Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3
The pith
Traditional machine learning models outperform fine-tuned transformers when localizing faults using only bug report text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that framing fault localization as text classification on bug reports allows traditional models with TF-IDF to exceed the performance of fine-tuned RoBERTa variants on industrial data, with data augmentation aiding the top classical model. This establishes that historical bug reports can be leveraged for effective, text-only AI support in fault finding, providing a scalable addition to industrial debugging without extra dependencies.
What carries the argument
A text classification system that learns to map bug report descriptions to fault locations in the code using term frequency-inverse document frequency features and supervised classifiers trained on historical reports paired with verified fixes.
If this is right
- Developers can use existing bug report archives to train fault prediction tools immediately.
- Data augmentation offers a practical way to enhance performance on smaller or imbalanced industrial datasets.
- Transformer-based models require careful evaluation before deployment in specialized domains rather than default use.
- The approach integrates directly into current maintenance processes at low cost.
Where Pith is reading between the lines
- If the text-only method works well, it could extend to predicting fault types or severity in addition to locations.
- Testing across multiple companies' bug report corpora would show how general the outperformance of classical models is.
- Over time, as more data accumulates, performance could improve, making periodic model updates valuable.
- Exploring the linguistic patterns in bug reports that enable accurate localization might inform better report writing guidelines.
Load-bearing premise
The wording in bug reports provides enough unique clues to identify the precise code changes that resolved each reported issue.
What would settle it
A test on previously unseen bug reports from the same environment where none of the models correctly identify the fault location for a majority of cases, or where accuracy falls below that of a naive keyword-matching baseline.
Figures
read the original abstract
Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that traditional machine learning models using TF-IDF features (Logistic Regression, SVM, and Random Forest) outperform fine-tuned transformer models (RoBERTa-Base and Distil-RoBERTa) for fault localization based solely on the natural language content of bug reports. Evaluated on five years of proprietary resolved bug reports from ABB Robotics, each linked to verified code fixes, the study frames the task as supervised text classification and finds that data augmentation further enhances Random Forest performance. This challenges the assumption that transformer-based models are superior in industrial contexts with domain-specific data and demonstrates the feasibility of text-only AI-assisted fault localization in maintenance workflows.
Significance. If the findings hold, the work is significant for its industrial relevance and empirical challenge to transformer dominance in software engineering tasks. It offers a practical, low-cost method deployable without source code or traces, using historical data for training. The benchmarking on real ABB data provides valuable lessons on when classical methods may be preferable, contributing to more realistic expectations for ML adoption in industry.
major comments (1)
- [Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.
minor comments (1)
- [Abstract] Specific quantitative results, such as accuracy or F1 scores, dataset statistics, and any statistical tests, are not mentioned in the abstract, which would help readers assess the magnitude of the performance differences.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The point raised about the transformer fine-tuning procedure is valid and directly relevant to the reproducibility and interpretation of our central claim. We address it below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.
Authors: We agree that the original manuscript omitted the necessary hyperparameters for the fine-tuning of RoBERTa-Base and Distil-RoBERTa, which limits the ability to fully assess whether the observed performance gap reflects a genuine limitation or suboptimal optimization. In the revised manuscript we will add a dedicated paragraph in the Experimental Setup section that reports the exact procedure: both models were fine-tuned using the AdamW optimizer with a learning rate of 2e-5 and linear decay, for 3 epochs, batch size 16, weight decay 0.01, and maximum sequence length 512. These values were selected after a modest grid search over learning rates {1e-5, 2e-5, 5e-5} and epoch counts {2, 3, 4} on a 10 % validation split of the training data; the final configuration was the one that maximized macro-F1 on the validation set. We will also state that no extensive hyperparameter optimization beyond this grid was performed, given the industrial data constraints. With these additions the comparison becomes fully transparent and readers can judge the adequacy of the optimization themselves. revision: yes
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
The manuscript frames fault localization as a standard supervised text classification task and reports performance of classical ML models versus fine-tuned transformers on a held-out industrial dataset. No equations, derivations, or self-referential definitions appear; results are obtained by training on labeled historical bug reports and evaluating on separate test instances. This setup does not reduce any claimed prediction to its own inputs by construction. The evaluation follows conventional ML practice with no load-bearing self-citations or ansatz smuggling that would create circularity. Minor self-citation of prior work, if present, is not used to justify the central empirical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bug report text contains sufficient information to classify fault locations via supervised learning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rui Abreu and Arjan J. C. van Gemund. 2009. A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis. InEighth Symposium on Abstraction, Reformulation, and Approximation, SARA 2009, Lake Arrowhead, California, USA, 8-10 August 2009, Vadim Bulitko and J. Christopher Bug-Report–Driven Fault Localization: Industrial B...
work page 2009
-
[2]
Hafiza Anisa Ahmed, Narmeen Zakaria Bawany, and Jawwad Ahmed Shamsi
-
[3]
doi:10.1109/ACCESS.2021.3069248
CaPBug-A Framework for Automatic Bug Categorization and Prioritization Using NLP and Machine Learning Algorithms.IEEE Access9 (2021), 50496–50512. doi:10.1109/ACCESS.2021.3069248
-
[4]
Renato Andrade, César Teixeira, Nuno Laranjeiro, and Marco Vieira. [n. d.]. An Empirical Study on the Classification of Bug Reports with Machine Learning. arXiv. Accessed: April 27, 2025. [Online]. Available: https://arxiv.org/abs/2503. 00660
work page 2025
-
[5]
Ascari, Lucilia Yoshie Araki, Aurora T
Luciano C. Ascari, Lucilia Yoshie Araki, Aurora T. R. Pozo, and Silvia R. Vergilio
-
[6]
In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009
Exploring machine learning techniques for fault localization. In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009. IEEE, 1–6. doi:10.1109/LATW.2009.4813783
-
[7]
Briand, Yvan Labiche, and Xuetao Liu
Lionel C. Briand, Yvan Labiche, and Xuetao Liu. 2007. Using Machine Learning to Support Debugging with Tarantula. InISSRE 2007, The 18th IEEE International Symposium on Software Reliability, Trollhättan, Sweden, 5-9 November 2007. IEEE Computer Society, 137–146. doi:10.1109/ISSRE.2007.31
-
[8]
Ing-Xiang Chen, Cheng-Zen Yang, Ting-Kun Lu, and Hojun Jaygarl. 2008. Implicit Social Network Model for Predicting and Tracking the Location of Faults. In2008 32nd Annual IEEE International Computer Software and Applications Conference. 136–143. doi:10.1109/COMPSAC.2008.162
-
[9]
Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In 27th International Conference on Software Engineering (ICSE 2005), 15-21 May 2005, St. Louis, Missouri, USA, Gruia-Catalin Roman, William G. Griswold, and Bashar Nuseibeh (Eds.). ACM, 342–351. doi:10.1145/1062455.1062522
-
[10]
Vidroha Debroy and W. Eric Wong. 2009. Insights on Fault Interference for Programs with Multiple Bugs. InISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 165–174. doi:10.1109/ISSRE.2009.14
-
[11]
Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. 1999. Model-Based Diagnosis of Hardware Designs.Artif. Intell.111, 1-2 (1999), 3–39. doi:10.1016/ S0004-3702(99)00034-X
work page 1999
-
[12]
Meng Gao, Pengyu Li, Congcong Chen, and Yunsong Jiang. 2018. Research on Software Multiple Fault Localization Method Based on Machine Learning. MATEC Web of Conferences232 (Nov. 2018). doi:10.1051/matecconf/201823201060
-
[13]
Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files.Information Processing and Management54, 6 (2018), 1058–1076. doi:10.1016/j.ipm.2018.07.004
-
[14]
Thomas Hirsch and Birgit Hofer. 2022. Using textual bug reports to predict the fault category of software bugs.Array15 (Sept. 2022). doi:10.1016/j.array.2022. 100189
-
[15]
Maliheh Izadi. 2022. CatIss: An Intelligent Tool for Categorizing Issues Re- ports using Transformers. In2022 IEEE/ACM 1st International Workshop on Natu- ral Language-Based Software Engineering (NLBSE). 44–47. doi:10.1145/3528588. 3528662
-
[16]
Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, Article 64 (Jul. 2024), 23 pages. Issue FSE. doi:10.1145/3660771
-
[17]
Omer Koksal and Bedir Tekinerdogan. 2022. Automated Classification of Un- structured Bilingual Software Bug Reports: An Industrial Case Study Research. Applied Sciences12, 1 (2022). doi:10.3390/app12010338
-
[18]
Zhengliang Li, Zhiwei Jiang, Xiang Chen, Kaibo Cao, and Qing Gu. 2020. Laprob: A Label propagation-Based software bug localization method.Information and Software Technology130 (Oct. 2020). doi:10.1016/j.infsof.2020.106410
-
[19]
Zheng, Alex Aiken, and Michael I
Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan
-
[20]
Scalable statistical bug isolation. InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 15–26. doi:10.1145/1065010.1065014
-
[21]
Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hypothesis Testing-Based Approach.IEEE Trans. Software Eng.32, 10 (2006), 831–848. doi:10.1109/TSE.2006.105
-
[22]
Edward Ogheneovo. 2014. On the Relationship between Software Complexity and Maintenance Costs.Journal of Computer and Communications02 (Jan. 2014), 1–16. doi:10.4236/jcc.2014.214001
-
[23]
Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InChi conference on human factors in computing systems extended abstracts. 1–7
work page 2022
-
[24]
Neil Walkinshaw and Leandro Minku. 2018. Are 20% of files responsible for 80% of defects?. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oulu, Finland)(ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. doi:10.1145/3239235.3239244
-
[25]
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936
work page 2024
-
[26]
Eric Wong, Vidroha Debroy, and Dianxiang Xu
W. Eric Wong, Vidroha Debroy, and Dianxiang Xu. 2012. Towards Better Fault Localization: A Crosstab-Based Statistical Approach.IEEE Trans. Syst. Man Cybern. Part C42, 3 (2012), 378–396. doi:10.1109/TSMCC.2011.2118751
-
[27]
Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa
W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (Aug. 2016), 707–740. doi:10.1109/TSE.2016.2521368
-
[28]
W. Eric Wong and Yu Qi. 2009. Bp Neural Network-Based Effective Fault Lo- calization.Int. J. Softw. Eng. Knowl. Eng.19, 4 (2009), 573–597. doi:10.1142/ S021819400900426X
work page 2009
- [29]
-
[30]
Chuyang Xu, Zhongxin Liu, Xiaoxue Ren, Gehao Zhang, Ming Liang, and David Lo. 2025. FlexFL: Flexible and Effective Fault Localization With Open-Source Large Language Models.IEEE Transactions on Software Engineering51, 5 (2025), 1455–1471. doi:10.1109/TSE.2025.3553363
-
[31]
Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn
-
[32]
Dataflow analysis-inspired deep learning for efficient vulnerability detection
Large Language Models for Test-Free Fault Localization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. doi:10.1145/3597503.3623342
-
[33]
Haiyang Yang, Yulu Zhou, Tian Liang, and Li Kuang. 2025. ChatDL: An LLM- Based Defect Localization Approach for Software in IIoT Flexible Manufacturing. IEEE Internet of Things Journal12, 16 (2025), 32333–32343. doi:10.1109/JIOT.2025. 3531512
-
[34]
2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition
Andreas Zeller. 2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition. Academic Press. http://store.elsevier.com/product.jsp?isbn= 9780123745156&pagename=search
work page 2009
-
[35]
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure- Inducing Input.IEEE Trans. Software Eng.28, 2 (2002), 183–200. doi:10.1109/32. 988498
work page doi:10.1109/32 2002
- [36]
-
[37]
Hu Zhengdong, Polpini Jantima, and Gamgarn Somprasertsri. 2025. A Compar- ative Study of Multilabel Classification Techniques for Analyzing Bug Report Dependencies.Journal of Information Systems Engineering and Management10 (2025). doi:10.52783/jisem.v10i26s.4253
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.