pith. sign in

arxiv: 2604.25700 · v2 · submitted 2026-04-28 · 💻 cs.SE · cs.LG

Bug-Report-Driven Fault Localization: Industrial Benchmarking and Lesson Learned at ABB Robotics

Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords fault localizationbug reportstext classificationmachine learningsoftware maintenancelanguage modelsdata augmentation
0
0 comments X

The pith

Traditional machine learning models outperform fine-tuned transformers when localizing faults using only bug report text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether natural language bug reports contain enough information to locate software faults without any code or runtime data. Researchers applied both classical machine learning classifiers and fine-tuned language models to five years of real bug reports from an industrial robotics setting, each tied to its confirmed fix. Classical models using term frequency features performed better overall, and adding data augmentation boosted the Random Forest results. Such a method could help maintenance teams in complex systems diagnose issues faster by leveraging the text they already have.

Core claim

The central discovery is that framing fault localization as text classification on bug reports allows traditional models with TF-IDF to exceed the performance of fine-tuned RoBERTa variants on industrial data, with data augmentation aiding the top classical model. This establishes that historical bug reports can be leveraged for effective, text-only AI support in fault finding, providing a scalable addition to industrial debugging without extra dependencies.

What carries the argument

A text classification system that learns to map bug report descriptions to fault locations in the code using term frequency-inverse document frequency features and supervised classifiers trained on historical reports paired with verified fixes.

If this is right

  • Developers can use existing bug report archives to train fault prediction tools immediately.
  • Data augmentation offers a practical way to enhance performance on smaller or imbalanced industrial datasets.
  • Transformer-based models require careful evaluation before deployment in specialized domains rather than default use.
  • The approach integrates directly into current maintenance processes at low cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the text-only method works well, it could extend to predicting fault types or severity in addition to locations.
  • Testing across multiple companies' bug report corpora would show how general the outperformance of classical models is.
  • Over time, as more data accumulates, performance could improve, making periodic model updates valuable.
  • Exploring the linguistic patterns in bug reports that enable accurate localization might inform better report writing guidelines.

Load-bearing premise

The wording in bug reports provides enough unique clues to identify the precise code changes that resolved each reported issue.

What would settle it

A test on previously unseen bug reports from the same environment where none of the models correctly identify the fault location for a majority of cases, or where accuracy falls below that of a naive keyword-matching baseline.

Figures

Figures reproduced from arXiv: 2604.25700 by Alessio Bucaioni, Anton Ununger, Pernilla Hall, Riccardo Rubei.

Figure 1
Figure 1. Figure 1: Visualization of label imbalance in the dataset. view at source ↗
Figure 2
Figure 2. Figure 2: MAP comparison: TF–IDF vs. sentence embeddings view at source ↗
read the original abstract

Software quality assurance remains a major challenge in industrial environments, where large-scale and long-lived systems inevitably accumulate defects. Identifying the location of a fault is often time-consuming and costly, particularly during maintenance phases when developers must rely primarily on textual bug reports rather than complete runtime or code-level context. In this study, we investigated if artificial intelligence can support fault localization using only the natural-language content of bug reports. By relying only on textual information, our approach requires no access to source code, execution traces, or static analysis artifacts, making it directly deployable within existing industrial maintenance workflows. We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Our evaluation used proprietary data from ABB Robotics in Sweden, comprising five years of resolved industrial bug reports, each linked to its verified code fix. This setting allowed us to assess model effectiveness under realistic industrial constraints. Our results showed that traditional models using term frequency-inverse document features consistently outperformed the fine-tuned language models on this dataset, while data augmentation improved Random Forest performance. These findings challenge the assumption that transformer-based models universally outperform classical approaches in industrial contexts with domain-specific data. We demonstrated that historical bug reports can be systematically used for text-based, artificial intelligence-assisted fault localization, providing a scalable, low-cost, and empirically grounded complement to common debugging practices in industry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that traditional machine learning models using TF-IDF features (Logistic Regression, SVM, and Random Forest) outperform fine-tuned transformer models (RoBERTa-Base and Distil-RoBERTa) for fault localization based solely on the natural language content of bug reports. Evaluated on five years of proprietary resolved bug reports from ABB Robotics, each linked to verified code fixes, the study frames the task as supervised text classification and finds that data augmentation further enhances Random Forest performance. This challenges the assumption that transformer-based models are superior in industrial contexts with domain-specific data and demonstrates the feasibility of text-only AI-assisted fault localization in maintenance workflows.

Significance. If the findings hold, the work is significant for its industrial relevance and empirical challenge to transformer dominance in software engineering tasks. It offers a practical, low-cost method deployable without source code or traces, using historical data for training. The benchmarking on real ABB data provides valuable lessons on when classical methods may be preferable, contributing to more realistic expectations for ML adoption in industry.

major comments (1)
  1. [Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.
minor comments (1)
  1. [Abstract] Specific quantitative results, such as accuracy or F1 scores, dataset statistics, and any statistical tests, are not mentioned in the abstract, which would help readers assess the magnitude of the performance differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The point raised about the transformer fine-tuning procedure is valid and directly relevant to the reproducibility and interpretation of our central claim. We address it below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Transformer Fine-Tuning Procedure] The paper describes the transformer models as 'fine-tuned' but provides no details on learning rate, epochs, batch size, weight decay, or hyperparameter tuning. This is a load-bearing issue for the central claim that these models underperformed the TF-IDF baselines, as inadequate optimization could explain the result on limited domain-specific data rather than a fundamental limitation of the approach.

    Authors: We agree that the original manuscript omitted the necessary hyperparameters for the fine-tuning of RoBERTa-Base and Distil-RoBERTa, which limits the ability to fully assess whether the observed performance gap reflects a genuine limitation or suboptimal optimization. In the revised manuscript we will add a dedicated paragraph in the Experimental Setup section that reports the exact procedure: both models were fine-tuned using the AdamW optimizer with a learning rate of 2e-5 and linear decay, for 3 epochs, batch size 16, weight decay 0.01, and maximum sequence length 512. These values were selected after a modest grid search over learning rates {1e-5, 2e-5, 5e-5} and epoch counts {2, 3, 4} on a 10 % validation split of the training data; the final configuration was the one that maximized macro-F1 on the validation set. We will also state that no extensive hyperparameter optimization beyond this grid was performed, given the industrial data constraints. With these additions the comparison becomes fully transparent and readers can judge the adequacy of the optimization themselves. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The manuscript frames fault localization as a standard supervised text classification task and reports performance of classical ML models versus fine-tuned transformers on a held-out industrial dataset. No equations, derivations, or self-referential definitions appear; results are obtained by training on labeled historical bug reports and evaluating on separate test instances. This setup does not reduce any claimed prediction to its own inputs by construction. The evaluation follows conventional ML practice with no load-bearing self-citations or ansatz smuggling that would create circularity. Minor self-citation of prior work, if present, is not used to justify the central empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions and the domain assumption that bug report text alone is predictive of fault location; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Bug report text contains sufficient information to classify fault locations via supervised learning
    Invoked when framing fault localization as a text classification problem using only natural language content.

pith-pipeline@v0.9.0 · 5580 in / 1229 out tokens · 52345 ms · 2026-05-14T21:57:31.155050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We framed fault localization as a supervised text classification problem and evaluated three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) and two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Rui Abreu and Arjan J. C. van Gemund. 2009. A Low-Cost Approximate Minimal Hitting Set Algorithm and its Application to Model-Based Diagnosis. InEighth Symposium on Abstraction, Reformulation, and Approximation, SARA 2009, Lake Arrowhead, California, USA, 8-10 August 2009, Vadim Bulitko and J. Christopher Bug-Report–Driven Fault Localization: Industrial B...

  2. [2]

    Hafiza Anisa Ahmed, Narmeen Zakaria Bawany, and Jawwad Ahmed Shamsi

  3. [3]

    doi:10.1109/ACCESS.2021.3069248

    CaPBug-A Framework for Automatic Bug Categorization and Prioritization Using NLP and Machine Learning Algorithms.IEEE Access9 (2021), 50496–50512. doi:10.1109/ACCESS.2021.3069248

  4. [4]

    Renato Andrade, César Teixeira, Nuno Laranjeiro, and Marco Vieira. [n. d.]. An Empirical Study on the Classification of Bug Reports with Machine Learning. arXiv. Accessed: April 27, 2025. [Online]. Available: https://arxiv.org/abs/2503. 00660

  5. [5]

    Ascari, Lucilia Yoshie Araki, Aurora T

    Luciano C. Ascari, Lucilia Yoshie Araki, Aurora T. R. Pozo, and Silvia R. Vergilio

  6. [6]

    In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009

    Exploring machine learning techniques for fault localization. In10th Latin American Test Workshop, LATW 2009, Rio de Janeiro, Brazil, March 2-5, 2009. IEEE, 1–6. doi:10.1109/LATW.2009.4813783

  7. [7]

    Briand, Yvan Labiche, and Xuetao Liu

    Lionel C. Briand, Yvan Labiche, and Xuetao Liu. 2007. Using Machine Learning to Support Debugging with Tarantula. InISSRE 2007, The 18th IEEE International Symposium on Software Reliability, Trollhättan, Sweden, 5-9 November 2007. IEEE Computer Society, 137–146. doi:10.1109/ISSRE.2007.31

  8. [8]

    Ing-Xiang Chen, Cheng-Zen Yang, Ting-Kun Lu, and Hojun Jaygarl. 2008. Implicit Social Network Model for Predicting and Tracking the Location of Faults. In2008 32nd Annual IEEE International Computer Software and Applications Conference. 136–143. doi:10.1109/COMPSAC.2008.162

  9. [9]

    Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In 27th International Conference on Software Engineering (ICSE 2005), 15-21 May 2005, St. Louis, Missouri, USA, Gruia-Catalin Roman, William G. Griswold, and Bashar Nuseibeh (Eds.). ACM, 342–351. doi:10.1145/1062455.1062522

  10. [10]

    Eric Wong

    Vidroha Debroy and W. Eric Wong. 2009. Insights on Fault Interference for Programs with Multiple Bugs. InISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 165–174. doi:10.1109/ISSRE.2009.14

  11. [11]

    Gerhard Friedrich, Markus Stumptner, and Franz Wotawa. 1999. Model-Based Diagnosis of Hardware Designs.Artif. Intell.111, 1-2 (1999), 3–39. doi:10.1016/ S0004-3702(99)00034-X

  12. [12]

    Meng Gao, Pengyu Li, Congcong Chen, and Yunsong Jiang. 2018. Research on Software Multiple Fault Localization Method Based on Machine Learning. MATEC Web of Conferences232 (Nov. 2018). doi:10.1051/matecconf/201823201060

  13. [13]

    Reza Gharibi, Amir Hossein Rasekh, Mohammad Hadi Sadreddini, and Seyed Mostafa Fakhrahmad. 2018. Leveraging textual properties of bug reports to localize relevant source files.Information Processing and Management54, 6 (2018), 1058–1076. doi:10.1016/j.ipm.2018.07.004

  14. [14]

    Thomas Hirsch and Birgit Hofer. 2022. Using textual bug reports to predict the fault category of software bugs.Array15 (Sept. 2022). doi:10.1016/j.array.2022. 100189

  15. [15]

    Maliheh Izadi. 2022. CatIss: An Intelligent Tool for Categorizing Issues Re- ports using Transformers. In2022 IEEE/ACM 1st International Workshop on Natu- ral Language-Based Software Engineering (NLBSE). 44–47. doi:10.1145/3528588. 3528662

  16. [16]

    Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, Article 64 (Jul. 2024), 23 pages. Issue FSE. doi:10.1145/3660771

  17. [17]

    Omer Koksal and Bedir Tekinerdogan. 2022. Automated Classification of Un- structured Bilingual Software Bug Reports: An Industrial Case Study Research. Applied Sciences12, 1 (2022). doi:10.3390/app12010338

  18. [18]

    Zhengliang Li, Zhiwei Jiang, Xiang Chen, Kaibo Cao, and Qing Gu. 2020. Laprob: A Label propagation-Based software bug localization method.Information and Software Technology130 (Oct. 2020). doi:10.1016/j.infsof.2020.106410

  19. [19]

    Zheng, Alex Aiken, and Michael I

    Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan

  20. [20]

    InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W

    Scalable statistical bug isolation. InProceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 15–26. doi:10.1145/1065010.1065014

  21. [21]

    Chao Liu, Long Fei, Xifeng Yan, Jiawei Han, and Samuel P. Midkiff. 2006. Statistical Debugging: A Hypothesis Testing-Based Approach.IEEE Trans. Software Eng.32, 10 (2006), 831–848. doi:10.1109/TSE.2006.105

  22. [22]

    Edward Ogheneovo. 2014. On the Relationship between Software Complexity and Maintenance Costs.Journal of Computer and Communications02 (Jan. 2014), 1–16. doi:10.4236/jcc.2014.214001

  23. [23]

    Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. 2022. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. InChi conference on human factors in computing systems extended abstracts. 1–7

  24. [24]

    Neil Walkinshaw and Leandro Minku. 2018. Are 20% of files responsible for 80% of defects?. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement(Oulu, Finland)(ESEM ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. doi:10.1145/3239235.3239244

  25. [25]

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

  26. [26]

    Eric Wong, Vidroha Debroy, and Dianxiang Xu

    W. Eric Wong, Vidroha Debroy, and Dianxiang Xu. 2012. Towards Better Fault Localization: A Crosstab-Based Statistical Approach.IEEE Trans. Syst. Man Cybern. Part C42, 3 (2012), 378–396. doi:10.1109/TSMCC.2011.2118751

  27. [27]

    Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

    W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (Aug. 2016), 707–740. doi:10.1109/TSE.2016.2521368

  28. [28]

    Eric Wong and Yu Qi

    W. Eric Wong and Yu Qi. 2009. Bp Neural Network-Based Effective Fault Lo- calization.Int. J. Softw. Eng. Knowl. Eng.19, 4 (2009), 573–597. doi:10.1142/ S021819400900426X

  29. [29]

    Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair.arXiv preprint arXiv:2301.13246(2023)

  30. [30]

    Chuyang Xu, Zhongxin Liu, Xiaoxue Ren, Gehao Zhang, Ming Liang, and David Lo. 2025. FlexFL: Flexible and Effective Fault Localization With Open-Source Large Language Models.IEEE Transactions on Software Engineering51, 5 (2025), 1455–1471. doi:10.1109/TSE.2025.3553363

  31. [31]

    Aidan Z. H. Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn

  32. [32]

    Dataflow analysis-inspired deep learning for efficient vulnerability detection

    Large Language Models for Test-Free Fault Localization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 17, 12 pages. doi:10.1145/3597503.3623342

  33. [33]

    Haiyang Yang, Yulu Zhou, Tian Liang, and Li Kuang. 2025. ChatDL: An LLM- Based Defect Localization Approach for Software in IIoT Flexible Manufacturing. IEEE Internet of Things Journal12, 16 (2025), 32333–32343. doi:10.1109/JIOT.2025. 3531512

  34. [34]

    2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition

    Andreas Zeller. 2009.Why Programs Fail - A Guide to Systematic Debug- ging, 2nd Edition. Academic Press. http://store.elsevier.com/product.jsp?isbn= 9780123745156&pagename=search

  35. [35]

    Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure- Inducing Input.IEEE Trans. Software Eng.28, 2 (2002), 183–200. doi:10.1109/32. 988498

  36. [36]

    Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022. Repairing bugs in python assignments using large language models.arXiv preprint arXiv:2209.14876(2022)

  37. [37]

    Hu Zhengdong, Polpini Jantima, and Gamgarn Somprasertsri. 2025. A Compar- ative Study of Multilabel Classification Techniques for Analyzing Bug Report Dependencies.Journal of Information Systems Engineering and Management10 (2025). doi:10.52783/jisem.v10i26s.4253