pith. machine review for the scientific record. sign in

arxiv: 2604.18862 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.AI

Recognition: unknown

Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords bug report identificationactive learningneural language modelshuman-machine collaborationGitHub repositoriessoftware qualityeffort reductionmutualistic learning
0
0 comments X

The pith

Mutualistic neural active learning reduces human effort in identifying bug reports by making them more readable while improving model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MNAL, a cross-project framework that combines neural language models with active learning to identify bug reports from GitHub repositories. It creates a mutualistic relationship where the model selects informative reports for humans to label in a readable way, and those labels along with pseudo-labels update the model more effectively. This setup aims to handle the growing volume of bug reports with less manual work and better accuracy than existing methods. Evaluations show large reductions in labeling effort and performance gains, and the approach works with different neural models. A user study with developers confirms perceived savings in time and resources.

Core claim

MNAL establishes a mutualistic neural active learning process in which the most informative human-labeled reports and their pseudo-labeled counterparts update the neural language model, while the reports requiring human labeling are rendered more readable and identifiable, resulting in up to 95.8% effort reduction for readability and 196.0% for identifiability, along with improved bug report identification performance across projects.

What carries the argument

The purposely crafted mutualistic relation between the neural language model and human labelers within the active learning framework.

If this is right

  • Up to 95.8% reduction in human effort for readability during labeling.
  • Up to 196.0% reduction in human effort for identifiability during labeling.
  • Improved performance in bug report identification over state-of-the-art and baseline methods.
  • Model-agnostic gains when paired with various neural language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mutualistic loop could extend to other software engineering classification tasks that mix automated prediction with human review.
  • Large-scale adoption in open-source projects might lower overall triage costs by shifting human focus to fewer, clearer cases.
  • Over time the approach may create feedback cycles that make the underlying models more robust to cross-project variations without extra tuning.

Load-bearing premise

The assumption that the mutualistic design will consistently make reports more readable for humans and lead to more effective model updates from the selected labels.

What would settle it

A replication study on a different set of GitHub projects where the effort reductions or performance gains are not observed compared to standard active learning baselines.

Figures

Figures reproduced from arXiv: 2604.18862 by Guoming Long, Hui Fang, Shihai Wang, Tao Chen.

Figure 1
Figure 1. Figure 1: An example of the submitted reports from GitHub. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrating the difference among learning with a neural language model, classic active learning, and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow of MNAL that supports enhanced human-machine teaming. The serial numbers refer to the order of workflow. • On the other hand, the Pseudo-labeling component enriches the data to train the neural language model, thanks to its ability to represent the report in a latent space. These, together with the newly human-labeled reports, are expected to improve the learning accuracy and efficiency, there… view at source ↗
Figure 4
Figure 4. Figure 4: Excerpt of the exampled reports with different scores of readability and identifiability. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Excerpt of the exampled reports that are the most similar and with the same label. The red texts [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrating the key steps of the pseudo labeling process in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparing the effort-aware uncertainty sampling in [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparing EDA (data augmentation) against [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparing the F1-score of different neural language models with and without pairing [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparing the readability of different neural language models with and without pairing [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparing the Identifiability of different neural language models with and without pairing [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparing MNAL with state-of-the-art active learning approaches for bug report identification over all 10 timesteps (10 runs each). The plots show the mean and standard deviation. MNAL achieves superior results on nearly all performance metrics. Although compared state-of-the￾art approaches also rely on active learning, they do not consider identifiability, reliability, or the pseudo-labeling process. Con… view at source ↗
Figure 13
Figure 13. Figure 13: The significant reduction in time and cost is a direct result of the effort-aware sampling, [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparing MNAL and MNAL𝑟𝑎𝑛 with respect to every individual of the 10 human participants (each labels 30 reports). For each participant, the pairwise comparison shows 𝑝 < 0.001. identifiability (i.e., a lower score means better). In particular, a few participants even considered all 30 reports queried by MNAL as “highly readable” or “highly identifiable” (an average rating of 0). Similarly, when comparing… view at source ↗
Figure 14
Figure 14. Figure 14: The changing sampling landscapes considered by [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Excerpt of the human-friendly vs. model-beneficial exampled reports [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparing MNAL under different numbers of pseudo-labeled reports used at each timestep over all 10 timesteps (10 runs each). The plots show the mean and standard deviation. to each human-labeled report (by default the pseudo-labeling approach assumes 𝑠 = 1). As such, the total number of pseudo-labeled reports per timestep would be 𝑠 × 𝑘 where 𝑘 is the query size [PITH_FULL_IMAGE:figures/full_fig_p039_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance upper bound experiment result of [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗
read the original abstract

Bug reports, encompassing a wide range of bug types, are crucial for maintaining software quality. However, the increasing complexity and volume of bug reports pose a significant challenge in sole manual identification and assignment to the appropriate teams for resolution, as dealing with all the reports is time-consuming and resource-intensive. In this paper, we introduce a cross-project framework, dubbed Mutualistic Neural Active Learning (MNAL), designed for automated and more effective identification of bug reports from GitHub repositories boosted by human-machine collaboration. MNAL utilizes a neural language model that learns and generalizes reports across different projects, coupled with active learning to form neural active learning. A distinctive feature of MNAL is the purposely crafted mutualistic relation between the machine learners (neural language model) and human labelers (developers) when enriching the knowledge learned. That is, the most informative human-labeled reports and their corresponding pseudo-labeled ones are used to update the model while those reports that need to be labeled by developers are more readable and identifiable, thereby enhancing the human-machine teaming therein. We evaluate MNAL using a large scale dataset against the SOTA approaches, baselines, and different variants. The results indicate that MNAL achieves up to 95.8% and 196.0% effort reduction in terms of readability and identifiability during human labeling, respectively, while resulting in a better performance in bug report identification. Additionally, our MNAL is model-agnostic since it is capable of improving the model performance with various underlying neural language models. To further verify the efficacy of our approach, we conducted a qualitative case study involving 10 human participants, who rate MNAL as being more effective while saving more time and monetary resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Mutualistic Neural Active Learning (MNAL), a cross-project framework combining neural language models with active learning for automated bug report identification from GitHub. It introduces a mutualistic human-machine relation in which the most informative human-labeled reports and their pseudo-labeled counterparts update the model while selected reports are optimized for human readability and identifiability. The authors report up to 95.8% and 196.0% effort reductions in readability and identifiability, superior bug-report identification performance versus SOTA and baselines, model-agnostic improvements across underlying NLMs, and positive outcomes from a 10-participant qualitative user study.

Significance. If the central claims are substantiated, the work could meaningfully advance automated software triage by showing how engineered human-AI collaboration reduces labeling effort while improving model accuracy across projects. The model-agnostic property and inclusion of a user study are practical strengths. However, the significance hinges on whether the reported gains can be attributed to the mutualistic mechanism rather than generic active learning or dataset characteristics.

major comments (3)
  1. [Experimental results] The evaluation compares MNAL to SOTA, baselines, and variants but provides no ablation that isolates the mutualistic loop (joint pseudo-label update plus readability/identifiability optimization) from standard uncertainty sampling. Without such a control, the 95.8% readability and 196.0% identifiability effort reductions cannot be causally attributed to the claimed mutualistic relation (see abstract and experimental results section).
  2. [Dataset and evaluation setup] No dataset statistics (total reports, number of projects, class balance, train/test split sizes), statistical significance tests, or confidence intervals are reported for the quantitative performance and effort-reduction claims, leaving the central empirical assertions without visible supporting derivation.
  3. [User study] The user study with 10 participants is described only at a high level; the protocol, exact tasks, time/monetary metrics, rating scales, and any statistical analysis of the results are not detailed, weakening the qualitative validation of the human-side effort reductions.
minor comments (2)
  1. [Abstract] The abstract states that MNAL is evaluated against 'different variants' but does not enumerate them; this should be made explicit early in the paper.
  2. [Method] The mutualistic relation is introduced conceptually but would benefit from a formal diagram or pseudocode in the method section to clarify the exact feedback schedule between pseudo-labels and readability optimization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the manuscript to strengthen the presentation of our results and methodology.

read point-by-point responses
  1. Referee: [Experimental results] The evaluation compares MNAL to SOTA, baselines, and variants but provides no ablation that isolates the mutualistic loop (joint pseudo-label update plus readability/identifiability optimization) from standard uncertainty sampling. Without such a control, the 95.8% readability and 196.0% identifiability effort reductions cannot be causally attributed to the claimed mutualistic relation (see abstract and experimental results section).

    Authors: We acknowledge the value of a more targeted ablation to isolate the mutualistic components. Although the original evaluation included multiple variants and baselines, we agree that an explicit control separating the joint pseudo-label update and readability/identifiability optimization from standard uncertainty sampling would better substantiate the causal contribution. In the revised manuscript, we have added this ablation study in the experimental results section, demonstrating that the mutualistic loop accounts for the majority of the reported effort reductions beyond generic active learning. revision: yes

  2. Referee: [Dataset and evaluation setup] No dataset statistics (total reports, number of projects, class balance, train/test split sizes), statistical significance tests, or confidence intervals are reported for the quantitative performance and effort-reduction claims, leaving the central empirical assertions without visible supporting derivation.

    Authors: We agree that these details are essential for transparency and reproducibility. The revised manuscript now includes a new subsection in the experimental setup that reports the full dataset statistics (total bug reports, number of projects, class balance, and train/test split sizes). We have also added statistical significance tests (paired t-tests with p-values) and 95% confidence intervals for all key performance metrics and effort-reduction percentages. revision: yes

  3. Referee: [User study] The user study with 10 participants is described only at a high level; the protocol, exact tasks, time/monetary metrics, rating scales, and any statistical analysis of the results are not detailed, weakening the qualitative validation of the human-side effort reductions.

    Authors: We appreciate this feedback on the level of detail provided. The revised manuscript expands the user study section to include the full protocol, exact tasks performed by participants, time and monetary metrics collected, the specific rating scales employed (including Likert-scale questions on effectiveness, readability, and effort), and the statistical analysis of results (means, standard deviations, and significance testing). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of engineered MNAL framework

full rationale

The paper describes an algorithmic framework (MNAL) that combines neural language models, active learning, and a designed mutualistic human-machine interaction for bug report identification. All quantitative claims (effort reductions, performance gains, model-agnosticism) are presented as outcomes of experiments on a large-scale GitHub dataset, compared against SOTA methods, baselines, and internal variants, plus a separate human participant case study. No equations, parameter fittings, or derivations are shown that reduce results to inputs by construction. The mutualistic relation is an explicit design choice whose effects are measured externally rather than assumed or self-defined. Self-citations, if present, are not load-bearing for the central empirical results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the generalization power of neural language models across projects and on the effectiveness of the mutualistic design; both are domain assumptions introduced without independent verification in the abstract.

axioms (2)
  • domain assumption Neural language models can learn and generalize bug reports across different projects
    Invoked as the foundation for the cross-project framework in the abstract.
  • ad hoc to paper A mutualistic relation between model and human labelers will improve both readability for humans and model updates
    Presented as the distinctive feature of MNAL without external grounding.
invented entities (1)
  • Mutualistic relation between machine learners and human labelers no independent evidence
    purpose: To create a feedback loop where human-labeled reports and pseudo-labels update the model while the model makes reports more readable for humans
    Introduced as the key innovation; no falsifiable external evidence is supplied in the abstract.

pith-pipeline@v0.9.0 · 5615 in / 1675 out tokens · 60019 ms · 2026-05-10T03:43:08.772815+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 84 canonical work pages · 1 internal anchor

  1. [1]

    Petar Afric, Davor Vukadin, Marin Silic, and Goran Delac. 2023. Empirical Study: How Issue Classification Influences Software Defect Prediction.IEEE Access11 (2023), 11732–11748. https://doi.org/10.1109/ACCESS.2023.3242045

  2. [2]

    Andrea Arcuri and Lionel C. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, USA, May 21-28, 2011, Richard N. Taylor, Harald C. Gall, and Nenad Medvidovic (Eds.). ACM, 1–10. http...

  3. [3]

    Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2017. Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction.IEEE Transactions on Software Engineering44, 6 (2017), 534–550

  4. [4]

    Selasie Aformaley Brown, Benjamin Asubam Weyori, Adebayo Felix Adekoya, Patrick Kwaku Kudjo, Solomon Mensah, and Samuel Abedu. 2021. DeepLaBB: a deep learning framework for blocking bugs. In2021 International Conference on Cyber Security and Internet of Things (ICSIoT). IEEE, 22–25

  5. [5]

    Davide Cacciarelli and Murat Kulahci. 2023. Active learning for data streams: a survey.Machine Learning(2023), 1–55

  6. [6]

    Pengzhou Chen and Tao Chen. 2026. PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning. In48th IEEE/ACM International Conference on Software Engineering (ICSE). ACM

  7. [7]

    Pengzhou Chen, Tao Chen, and Miqing Li. 2024. MMO: Meta Multi-Objectivization for Software Configuration Tuning.IEEE Trans. Software Eng.50, 6 (2024), 1478–1504. https://doi.org/10.1109/TSE.2024.3388910

  8. [8]

    Pengzhou Chen, Jingzhi Gong, and Tao Chen. 2025. Accuracy Can Lie: On the Impact of Surrogate Model in Configuration Tuning.IEEE Trans. Software Eng.51, 2 (2025), 548–580. https://doi.org/10.1109/TSE.2025.3525955

  9. [9]

    Tao Chen. 2022. Lifelong Dynamic Optimization for Self-Adaptive Systems: Fact or Fiction?. InIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 78–89. https://doi.org/10.1109/SANER53432.2022.00022

  10. [10]

    Tao Chen and Miqing Li. 2021. Multi-objectivizing software configuration tuning. InESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, Diomidis Spinellis, Georgios Gousios, Marsha Chechik, and Massimiliano Di Penta (Eds.). ACM, 453–465. https://...

  11. [11]

    Tao Chen and Miqing Li. 2023. Do Performance Aspirations Matter for Guiding Software Configuration Tuning? An Empirical Investigation under Dual Performance Objectives.ACM Trans. Softw. Eng. Methodol.32, 3 (2023), 68:1–68:41. https://doi.org/10.1145/3571853

  12. [12]

    Tao Chen and Miqing Li. 2023. The Weights Can Be Harmful: Pareto Search versus Weighted Search in Multi- objective Search-based Software Engineering.ACM Trans. Softw. Eng. Methodol.32, 1 (2023), 5:1–5:40. https: //doi.org/10.1145/3514233

  13. [13]

    Jian Cheng, Peisong Wang, Gang Li, Qinghao Hu, and Hanqing Lu. 2018. Recent advances in efficient computation of deep convolutional neural networks.Frontiers Inf. Technol. Electron. Eng.19, 1 (2018), 64–77. https://doi.org/10.1631/ FITEE.1700789

  14. [14]

    Heetae Cho, Seonah Lee, and Sungwon Kang. 2022. Classifying issue reports according to feature descriptions in a user manual based on a deep learning model.Inf. Softw. Technol.142 (2022), 106743. https://doi.org/10.1016/J.INFSOF. 2021.106743

  15. [15]

    Gabriele Ciravegna, Frédéric Precioso, Alessandro Betti, Kevin Mottin, and Marco Gori. 2023. Knowledge-Driven Active Learning. InMachine Learning and Knowledge Discovery in Databases: Research Track - European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 14169), Danai Koutra,...

  16. [16]

    Dipok Chandra Das and Md Rayhanur Rahman. 2018. Security and performance bug reports identification with class-imbalance sampling and feature selection. In2018 Joint 7th International Conference on Informatics, Electronics & Vision (ICIEV) and 2018 2nd International Conference on Imaging, Vision & Pattern Recognition (icIVPR). IEEE, 316–321

  17. [17]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Vol...

  18. [18]

    Jianshu Ding, Guisheng Fan, Huiqun Yu, and Zijie Huang. 2021. Automatic Identification of High Impact Bug Report by Test Smells of Textual Similar Bug Reports. In21st IEEE International Conference on Software Quality, Reliability and Security, QRS 2021, Hainan, China, December 6-10, 2021. IEEE, 446–457. https://doi.org/10.1109/QRS54544.2021.00056

  19. [19]

    Chengwen Du and Tao Chen. 2025. Causally Perturbed Fairness Testing.ACM Trans. Softw. Eng. Methodol.(Oct. 2025). https://doi.org/10.1145/3773088 Just Accepted

  20. [20]

    Junwei Du, Xinshuang Ren, Haojie Li, Feng Jiang, and Xu Yu. 2023. Prediction of bug-fixing time based on distinguish- able sequences fusion in open source software.J. Softw. Evol. Process.35, 11 (2023). https://doi.org/10.1002/SMR.2443

  21. [21]

    Xiaoting Du, Zheng Zheng, Guanping Xiao, Zenghui Zhou, and Kishor S. Trivedi. 2022. DeepSIM: Deep Semantic Information-Based Automatic Mandelbug Classification.IEEE Trans. Reliab.71, 4 (2022), 1540–1554. https://doi.org/ 10.1109/TR.2021.3110096

  22. [22]

    Yuanrui Fan, Xin Xia, David Lo, and Ahmed E. Hassan. 2020. Chaff from the Wheat: Characterizing and Determining Valid Bug Reports.IEEE Trans. Software Eng.46, 5 (2020), 495–525. https://doi.org/10.1109/TSE.2018.2864217

  23. [23]

    Fan Fang, John Wu, Yanyan Li, Xin Ye, Wajdi Aljedaani, and Mohamed Wiem Mkaouer. 2021. On the classification of bug reports to improve bug localization.Soft Comput.25, 11 (2021), 7307–7323. https://doi.org/10.1007/s00500-021- 05689-2

  24. [24]

    Sen Fang, Tao Zhang, Youshuai Tan, He Jiang, Xin Xia, and Xiaobing Sun. 2023. RepresentThemAll: A Universal Learning Representation of Bug Reports. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023. IEEE, 602–614. https://doi.org/10.1109/ICSE48619.2023.00060

  25. [25]

    James N Farr, James J Jenkins, and Donald G Paterson. 1951. Simplification of Flesch reading ease formula.Journal of applied psychology35, 5 (1951), 333

  26. [26]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...

  27. [27]

    Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A survey on instance selection for active learning.Knowl. Inf. Syst.35, 2 (2013), 249–283. https://doi.org/10.1007/s10115-012-0507-8

  28. [28]

    Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, and Michael R. Lyu. 2024. Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 80:1–80:13. https://doi.or...

  29. [29]

    Xiuting Ge, Chunrong Fang, Meiyuan Qian, Yu Ge, and Mingshuang Qing. 2022. Locality-based security bug report identification via active learning.Inf. Softw. Technol.147 (2022), 106899. https://doi.org/10.1016/j.infsof.2022.106899

  30. [30]

    Luiz Alberto Ferreira Gomes, Ricardo da Silva Torres, and Mario Lúcio Côrtes. 2023. BERT- and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: A comparative study.Inf. Softw. Technol.160 (2023), 107217. https://doi.org/10.1016/J.INFSOF.2023.107217

  31. [31]

    Jingzhi Gong and Tao Chen. 2023. Predicting Software Performance with Divide-and-Learn. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blincoe, and Paolo Tonella (Eds.). ACM, 858–870. http...

  32. [32]

    Jingzhi Gong and Tao Chen. 2024. Predicting Configuration Performance in Multiple Environments with Sequential Meta-Learning.Proc. ACM Softw. Eng.1, FSE (2024), 359–382. https://doi.org/10.1145/3643743

  33. [33]

    Jingzhi Gong, Tao Chen, and Rami Bahsoon. 2025. Dividable Configuration Performance Learning.IEEE Trans. Software Eng.51, 1 (2025), 106–134. https://doi.org/10.1109/TSE.2024.3491945

  34. [34]

    Shikai Guo, Rong Chen, Hui Li, Tianlun Zhang, and Yaqing Liu. 2019. Identify Severity Bug Report with Distribution Imbalance by CR-SMOTE and ELM.Int. J. Softw. Eng. Knowl. Eng.29, 2 (2019), 139–175. https://doi.org/10.1142/ S0218194019500074

  35. [35]

    Som Gupta and Sanjai Gupta. 2021. Bug Reports and Deep Learning Models.International Journal of Computer Science and Mobile Computing10 (12 2021), 21–26. https://doi.org/10.47760/ijcsmc.2021.v10i12.003

  36. [36]

    Jianjun He, Ling Xu, Yuanrui Fan, Zhou Xu, Meng Yan, and Yan Lei. 2020. Deep Learning Based Valid Bug Reports Determination and Explanation. In31st IEEE International Symposium on Software Reliability Engineering, ISSRE 2020, Coimbra, Portugal, October 12-15, 2020, Marco Vieira, Henrique Madeira, Nuno Antunes, and Zheng Zheng (Eds.). IEEE, 184–194. https:...

  37. [37]

    Steffen Herbold, Alexander Trautsch, and Fabian Trautsch. 2020. On the feasibility of automated prediction of bug and non-bug issues.Empir. Softw. Eng.25, 6 (2020), 5333–5369. https://doi.org/10.1007/S10664-020-09885-W

  38. [38]

    Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In35th International Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013, David Notkin, Betty H. C. Cheng, and Klaus Pohl (Eds.). IEEE Computer Society, 392–401. https: //doi.org/10.1109/ICSE.20...

  39. [39]

    Ernst, Michael W

    Abram Hindle, Neil A. Ernst, Michael W. Godfrey, and John Mylopoulos. 2011. Automated topic naming to support cross-project analysis of software maintenance activities. InProceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings, Arie v...

  40. [40]

    Pieter Hooimeijer and Westley Weimer. 2007. Modeling bug report quality. In22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA, R. E. Kurt Stirewalt, Alexander Egyed, and Bernd Fischer (Eds.). ACM, 34–43. https://doi.org/10.1145/1321631.1321639

  41. [41]

    Edwin T Jaynes. 1957. Information theory and statistical mechanics.Physical review106, 4 (1957), 620

  42. [42]

    Yuan Jiang, Pengcheng Lu, Xiaohong Su, and Tiantian Wang. 2020. LTRWES: A new framework for security bug report detection.Inf. Softw. Technol.124 (2020), 106314. https://doi.org/10.1016/J.INFSOF.2020.106314

  43. [43]

    Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomás Mikolov. 2016. FastText.zip: Compressing text classification models.CoRRabs/1612.03651 (2016). arXiv:1612.03651 http://arxiv.org/abs/1612.03651

  44. [44]

    Rafael Kallis, Maliheh Izadi, Luca Pascarella, Oscar Chaparro, and Pooja Rani. 2023. The NLBSE’23 Tool Competition. InProceedings of The 2nd International Workshop on Natural Language-based Software Engineering (NLBSE’23)

  45. [45]

    Rafael Kallis, Andrea Di Sorbo, Gerardo Canfora, and Sebastiano Panichella. 2019. Ticket Tagger: Machine Learning Driven Issue Classification. In2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH, USA, September 29 - October 4, 2019. IEEE, 406–409. https://doi.org/10.1109/ICSME.2019.00070

  46. [46]

    Jaweria Kanwal and Onaiza Maqbool. 2012. Bug Prioritization to Facilitate Bug Report Triage.J. Comput. Sci. Technol. 27, 2 (2012), 397–412. https://doi.org/10.1007/S11390-012-1230-3

  47. [47]

    Zaina, J

    Amy J. Ko, Thomas D. LaToza, and Margaret M. Burnett. 2015. A practical guide to controlled experiments of software engineering tools with human participants.Empir. Softw. Eng.20, 1 (2015), 110–141. https://doi.org/10.1007/S10664- 013-9279-3

  48. [48]

    Ashima Kukkar and Rajni Mohana. 2018. A supervised bug report classification with incorporate and textual field knowledge.Procedia computer science132 (2018), 352–361

  49. [49]

    C Kwak and S Lee. 2022. Issue report classification using a multimodal deep learning technique. (2022)

  50. [50]

    Le and Tomás Mikolov

    Quoc V. Le and Tomás Mikolov. 2014. Distributed Representations of Sentences and Documents. InProceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 (JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 1188–1196. http://proceedings.mlr.press/v32/le14.html ACM Trans. Softw. Eng. Methodol.,...

  51. [51]

    Dong-Gun Lee and Yeong-Seok Seo. 2020. Improving bug report triage performance using artificial intelligence based document generation model.Hum. centric Comput. Inf. Sci.10 (2020), 26. https://doi.org/10.1186/s13673-020-00229-7

  52. [52]

    Bin Li, Ying Wei, Xiaobing Sun, Lili Bo, Dingshan Chen, and Chuanqi Tao. 2022. Towards the identification of bug entities and relations in bug reports.Autom. Softw. Eng.29, 1 (2022), 24. https://doi.org/10.1007/s10515-022-00325-1

  53. [53]

    Hui Li, Yang Qu, Shikai Guo, Guofeng Gao, Rong Chen, and Chen Guo. 2020. Surprise Bug Report Prediction Utilizing Optimized Integration with Imbalanced Learning Strategy.Complex.2020 (2020), 8509821:1–8509821:14. https://doi.org/10.1155/2020/8509821

  54. [54]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.CoRRabs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

  55. [55]

    Guoming Long and Tao Chen. 2022. On Reporting Performance and Accuracy Bugs for Deep Learning Frameworks: An Exploratory Study from GitHub. InEASE 2022: The International Conference on Evaluation and Assessment in Software Engineering 2022, Gothenburg, Sweden, June 13 - 15, 2022, Miroslaw Staron, Christian Berger, Jocelyn Simmonds, and Rafael Prikladnicki...

  56. [56]

    Guoming Long, Tao Chen, and Georgina Cosma. 2022. Multifaceted Hierarchical Report Identification for Non- Functional Bugs in Deep Learning Frameworks. In29th Asia-Pacific Software Engineering Conference, APSEC 2022, Virtual Event, Japan, December 6-9, 2022. IEEE, 289–298. https://doi.org/10.1109/APSEC57359.2022.00041

  57. [57]

    Guoming Long, Jingzhi Gong, Hui Fang, and Tao Chen. 2025. Learning Software Bug Reports: A Systematic Literature Review.ACM Trans. Softw. Eng. Methodol.(July 2025). https://doi.org/10.1145/3750040 Just Accepted

  58. [58]

    Nicola Lunardon, Giovanna Menardi, and Nicola Torelli. 2014. ROSE: a package for binary imbalanced learning.R journal6, 1 (2014)

  59. [59]

    Youpeng Ma, Tao Chen, and Ke Li. 2025. Faster Configuration Performance Bug Testing with Neural Dual-Level Prioritization. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 988–1000. https://doi.org/10.1109/ICSE55347.2025.00201

  60. [60]

    Walid Maalej and Hadeer Nabil. 2015. Bug report, feature request, or simply praise? On automatically classifying app reviews. In23rd IEEE International Requirements Engineering Conference, RE 2015, Ottawa, ON, Canada, August 24-28, 2015, Didar Zowghi, Vincenzo Gervasi, and Daniel Amyot (Eds.). IEEE Computer Society, 116–125. https: //doi.org/10.1109/RE.20...

  61. [61]

    Senthil Mani, Rose Catherine, Vibha Singhal Sinha, and Avinava Dubey. 2012. AUSUM: approach for unsupervised bug report summarization. In20th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-20), SIGSOFT/FSE’12, Cary, NC, USA - November 11 - 16, 2012, Will Tracz, Martin P. Robillard, and Tevfik Bultan (Eds.). ACM, 11. https://doi.org/...

  62. [62]

    Jyoti Prakash Meher, Sourav Biswas, and Rajib Mall. 2024. Deep learning-based software bug classification.Inf. Softw. Technol.166 (2024), 107350. https://doi.org/10.1016/J.INFSOF.2023.107350

  63. [63]

    Fanqi Meng, Xuesong Wang, Jingdong Wang, and Peifang Wang. 2022. Automatic Classification of Bug Reports Based on Multiple Text Information and Reports’ Intention. InTheoretical Aspects of Software Engineering - 16th International Symposium, TASE 2022, Cluj-Napoca, Romania, July 8-10, 2022, Proceedings (Lecture Notes in Computer Science, Vol. 13299), Yami...

  64. [64]

    Nikolaos Mittas and Lefteris Angelis. 2013. Ranking and Clustering Software Cost Estimation Models through a Multiple Comparisons Algorithm.IEEE Trans. Software Eng.(2013)

  65. [65]

    Anas Nadeem, Muhammad Usman Sarwar, and Muhammad Zubair Malik. 2022. Automatic Issue Classifier: A Transfer Learning Framework for Classifying Issue Reports.CoRRabs/2202.06149 (2022). arXiv:2202.06149 https: //arxiv.org/abs/2202.06149

  66. [66]

    Behzad Soleimani Neysiani and Seyed Morteza Babamir. 2019. New methodology for contextual features usage in duplicate bug reports detection: dimension expansion based on manhattan distance similarity of topics. In2019 5th international conference on web research (ICWR). IEEE, 178–183

  67. [67]

    Behzad Soleimani Neysiani, Seyed Morteza Babamir, and Masayoshi Aritsugi. 2020. Efficient feature extraction model for validation performance improvement of duplicate bug report detection in software bug triage systems.Inf. Softw. Technol.126 (2020), 106344. https://doi.org/10.1016/J.INFSOF.2020.106344

  68. [68]

    Nitish Pandey, Debarshi Kumar Sanyal, Abir Hudait, and Amitava Sen. 2017. Automated classification of software issue reports using machine learning techniques: an empirical study.Innov. Syst. Softw. Eng.13, 4 (2017), 279–297. https://doi.org/10.1007/S11334-017-0294-1

  69. [69]

    Won’t We Fix this Issue?

    Sebastiano Panichella, Gerardo Canfora, and Andrea Di Sorbo. 2021. "Won’t We Fix this Issue?" Qualitative char- acterization and automated identification of wontfix issues on GitHub.Inf. Softw. Technol.139 (2021), 106665. https://doi.org/10.1016/J.INFSOF.2021.106665 ACM Trans. Softw. Eng. Methodol., Vol. 37, No. 4, Article 111. Publication date: August 20...

  70. [70]

    Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Dael...

  71. [71]

    Quentin Perez, Pierre-Antoine Jean, Christelle Urtado, and Sylvain Vauttier. 2021. Bug or not bug? That is the question. In29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021, Madrid, Spain, May 20-21, 2021. IEEE, 47–58. https://doi.org/10.1109/ICPC52881.2021.00014

  72. [72]

    Jantima Polpinij. 2021. A method of non-bug report identification from bug report repository.Artif. Life Robotics26, 3 (2021), 318–328. https://doi.org/10.1007/S10015-021-00681-3

  73. [73]

    Ketan Rathor, Jaspreet Kaur, Ullal Akshatha Nayak, S Kaliappan, Ramya Maranan, and V Kalpana. 2023. Technological Evaluation and Software Bug Training using Genetic Algorithm and Time Convolution Neural Network (GA-TCN). In2023 Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS). IEEE, 7–12

  74. [74]

    A survey of deep active learning,

    Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. 2022. A Survey of Deep Active Learning.ACM Comput. Surv.54, 9 (2022), 180:1–180:40. https://doi.org/10.1145/3472291

  75. [75]

    Bovier and F

    Gema Rodríguez-Pérez, Jesús M. González-Barahona, Gregorio Robles, Dorealda Dalipaj, and Nelson Sekitoleko. 2016. BugTracking: A Tool to Assist in the Identification of Bug Reports. InOpen Source Systems: Integrating Communities - 12th IFIP WG 2.13 International Conference, OSS 2016, Gothenburg, Sweden, May 30 - June 2, 2016, Proceedings (IFIP Advances in...

  76. [76]

    Tommaso Dal Sasso, Andrea Mocci, and Michele Lanza. 2016. What Makes a Satisficing Bug Report?. In2016 IEEE International Conference on Software Quality, Reliability and Security, QRS 2016, Vienna, Austria, August 1-3, 2016. IEEE, 164–174. https://doi.org/10.1109/QRS.2016.28

  77. [77]

    Andrew Jhon Scott and M Knott. 1974. A cluster analysis method for grouping means in the analysis of variance. Biometrics(1974), 507–512

  78. [78]

    Lin Shi, Fangwen Mu, Yumin Zhang, Ye Yang, Junjie Chen, Xiao Chen, Hanzhi Jiang, Ziyou Jiang, and Qing Wang

  79. [79]

    In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022

    BugListener: Identifying and Synthesizing Bug Reports from Collaborative Live Chats. In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 299–311. https://doi.org/10.1145/3510003.3510108

  80. [80]

    Williams, and Tim Menzies

    Rui Shu, Tianpei Xia, Laurie A. Williams, and Tim Menzies. 2019. Better Security Bug Report Classification via Hyperparameter Optimization.CoRRabs/1905.06872 (2019). arXiv:1905.06872 http://arxiv.org/abs/1905.06872

Showing first 80 references.