pith. sign in

arxiv: 1907.11031 · v1 · pith:BBVBUMR3new · submitted 2019-07-25 · 💻 cs.SE

Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of Bugs

Pith reviewed 2026-05-24 16:03 UTC · model grok-4.3

classification 💻 cs.SE
keywords root cause analysisbug classificationbug reportssoftware bugstaxonomyempirical studymachine learningbug triage
0
0 comments X

The pith

Analysis of 1,280 bug reports from 119 projects identifies nine common root causes that text alone can classify at 64% F-measure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors manually review bug reports across Mozilla, Apache, and Eclipse projects to create a taxonomy of why bugs occur. This produces nine recurring root cause categories that appear in the studied systems. They then build a machine learning model that reads the text of a report and assigns it to one of the nine categories. The model reaches 64% F-Measure and 74% AUC-ROC overall on the collected data. If the approach holds, developers could receive an immediate suggestion of the likely cause before beginning any investigation or triage.

Core claim

Examination of 1,280 bug reports from 119 projects in three ecosystems shows nine main root causes that are common across the systems. A classification model trained on the textual content of the reports is able to assign new bugs to these categories, achieving 64% F-Measure and 74% AUC-ROC overall.

What carries the argument

A taxonomy of nine root cause categories derived from manual inspection of bug report text, used to label data and train a supervised text classifier.

If this is right

  • Bug triage can begin with an automatic suggestion of root cause type rather than starting from raw text.
  • The nine categories supply a shared language for comparing bug patterns across different projects and ecosystems.
  • Fixing effort or prevention techniques can be studied separately for each root cause type.
  • The same labeling process can be repeated on new projects to extend or refine the taxonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the nine categories prove stable, they could serve as a target for static analysis tools that detect likely causes before code is committed.
  • Projects outside the three ecosystems might require only a small number of additional categories rather than an entirely new taxonomy.
  • Accuracy might rise if the model also receives metadata such as component or reporter experience in addition to report text.

Load-bearing premise

The text in a bug report is enough for analysts to agree on which of the nine root cause categories applies, and these nine categories describe bugs beyond the 119 projects examined.

What would settle it

A replication in which multiple independent analysts label the same 1,280 reports and obtain low agreement on categories, or a new collection of bug reports from additional projects where many cases fall outside the nine categories.

Figures

Figures reproduced from arXiv: 1907.11031 by Andy Zaidman, Fabio Palomba, Filomena Ferrucci, Gemma Catolino.

Figure 1
Figure 1. Figure 1: Bug reported and reopened in Apache HBase. tables atop clusters of commodity hardware. On August 15th, 2015 the bug report shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: shows the diffusion of root causes extracted from the 1,139 analyzed bug reports. As depicted, the most frequent one is the Functional Issue, which covers almost half of the entire dataset (i.e., 41,3%). This was somehow expected as a result: indeed, it is reasonable to believe that most of the problems raised are related to de￾velopers actively implementing new features or enhancing existing ones. Our fin… view at source ↗
Figure 3
Figure 3. Figure 3: RQ2 - Box plots reporting the Delay Before Response (DBR) for each identified bug root cause. conf.−issue network−issue db−issue gui−issue perf.−issue perm.−depr.−issue sec.−issue program−issue test−issue 0 5 10 15 20 25 Delay Before Assigned ● ● ● ● ● ● ● ● ● [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ2 - Box plots reporting the Delay Before Assigned (DBA) for each identified bug root cause. based on the developers’ expertise and workload, a certain type of bug is assigned faster than others. While further investigations around this hypothesis would be needed and beneficial to study the phenomenon deeper, we manually investigated the bugs of our dataset to find initial com￾pelling evidence that sugges… view at source ↗
Figure 5
Figure 5. Figure 5: RQ2 - Box plots reporting the Delay Before Change (DBC) for each identified bug root cause. conf.−issue network−issue db−issue gui−issue perf.−issue perm.−depr.−issue sec.−issue program−issue test−issue 0 50 100 150 Duration of Bug Fixing ● ● ● ● ● ● ● ● ● [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RQ2 - Box plots reporting the Duration of Bug Fixing (DBF) for each identified bug root cause. that these bugs can cause issues leading end-users not to interact with the system in a proper manner and, there￾fore, they represent issues that are worth to start fixing quickly. More surprisingly, the fixing process of program anomalies requires a higher number of hours to be started. While more investigations… view at source ↗
Figure 7
Figure 7. Figure 7: RQ2 - Box plots reporting the Delay After Change (DAC) for each identified bug root cause. base [64]. The only exception to this general discussion is related to the configuration-issue, which takes up to 33 hours to be integrated: however, given previous findings in literature [6, 53, 82], we see this as an expected result because configuration-related discussions generally trigger more comments by develo… view at source ↗
read the original abstract

Modern version control systems such as Git or SVN include bug tracking mechanisms, through which developers can highlight the presence of bugs through bug reports, i.e., textual descriptions reporting the problem and what are the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fixing of a reported bug to the most qualified developer. Nevertheless, only a few studies have reported on how to support developers in the process of understanding the type of a reported bug, which is the first and most time-consuming step to perform before assigning a bug-fix operation. In this paper, we target this problem in two ways: first, we analyze 1,280 bug reports of 119 popular projects belonging to three ecosystems such as Mozilla, Apache, and Eclipse, with the aim of building a taxonomy of the root causes of reported bugs; then, we devise and evaluate an automated classification model able to classify reported bugs according to the defined taxonomy. As a result, we found nine main common root causes of bugs over the considered systems. Moreover, our model achieves high F-Measure and AUC-ROC (64% and 74% on overall, respectively).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper manually analyzes 1,280 bug reports from 119 projects across Mozilla, Apache, and Eclipse to derive a taxonomy of nine root causes of bugs, then trains and evaluates a classifier on textual features of the reports, claiming overall F-Measure of 64% and AUC-ROC of 74%.

Significance. If the taxonomy proves reproducible and the classifier metrics hold under proper validation, the work would offer a concrete, multi-ecosystem taxonomy and a practical starting point for automated root-cause classification to support bug triage. The scale of the manual analysis across three ecosystems is a strength that could support broader applicability claims.

major comments (3)
  1. [Abstract] Abstract: performance numbers (64% F-Measure, 74% AUC-ROC) are stated without any description of the labeling procedure, number of annotators, inter-rater agreement, disagreement resolution, feature engineering, train-test split, or baseline comparisons. These omissions make the numbers unverifiable and directly undermine the central claim that the model achieves the reported performance on the derived taxonomy.
  2. [Taxonomy construction / manual analysis] Taxonomy construction section (manual analysis of 1,280 reports): no information is supplied on annotator count, annotation guidelines, or inter-rater agreement. Because the nine categories are defined from these labels and then used as ground truth for the classifier, the absence of reliability metrics is load-bearing for both the taxonomy and all downstream results.
  3. [Evaluation / results] Evaluation section: the claim that the nine categories are 'main common root causes' across the considered systems requires evidence that the categories are stable and not artifacts of individual annotator interpretation; without agreement statistics or a reproducibility check, the generalization statement in the abstract cannot be assessed.
minor comments (2)
  1. [Evaluation] Clarify whether the nine categories are mutually exclusive or allow multi-label assignment, and report per-category performance to show whether the overall F-Measure is driven by a few dominant classes.
  2. [Taxonomy presentation] The manuscript should include a table listing the nine root-cause categories with brief definitions and example bug-report excerpts to make the taxonomy concrete for readers.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed feedback on methodological transparency. We address each major comment below and will revise the manuscript accordingly where the original study design permits.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance numbers (64% F-Measure, 74% AUC-ROC) are stated without any description of the labeling procedure, number of annotators, inter-rater agreement, disagreement resolution, feature engineering, train-test split, or baseline comparisons. These omissions make the numbers unverifiable and directly undermine the central claim that the model achieves the reported performance on the derived taxonomy.

    Authors: We agree that the abstract omits key methodological details. In the revised version we will expand the abstract to briefly describe the labeling procedure, annotator involvement, train-test split, feature engineering, and baseline comparisons, while retaining full details in the body. This directly addresses verifiability of the reported metrics. revision: yes

  2. Referee: [Taxonomy construction / manual analysis] Taxonomy construction section (manual analysis of 1,280 reports): no information is supplied on annotator count, annotation guidelines, or inter-rater agreement. Because the nine categories are defined from these labels and then used as ground truth for the classifier, the absence of reliability metrics is load-bearing for both the taxonomy and all downstream results.

    Authors: We acknowledge the omission in the taxonomy construction section. The revision will add a description of the annotation guidelines and annotator count (primarily one author with co-author review and disagreement resolution via discussion). A formal inter-rater agreement statistic was not computed in the original study and therefore cannot be supplied. revision: partial

  3. Referee: [Evaluation / results] Evaluation section: the claim that the nine categories are 'main common root causes' across the considered systems requires evidence that the categories are stable and not artifacts of individual annotator interpretation; without agreement statistics or a reproducibility check, the generalization statement in the abstract cannot be assessed.

    Authors: We agree that stability evidence would strengthen the generalization claim. The revised evaluation section will add discussion of how the nine categories emerged consistently across the three ecosystems and any available reproducibility considerations from the manual analysis. revision: partial

standing simulated objections not resolved
  • Absence of computed inter-rater agreement statistics for the manual labeling, which prevents supplying quantitative reliability metrics for the taxonomy.

Circularity Check

0 steps flagged

No circularity: taxonomy and classifier derived from independent manual labeling process

full rationale

The paper's derivation consists of manual analysis of 1,280 bug reports to induce a 9-category taxonomy, followed by training and evaluating a supervised classifier on those labels. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided text. The taxonomy construction and model evaluation are standard empirical steps that do not reduce to each other by construction; the central claims rest on the (unreported) labeling process and performance metrics rather than any definitional loop or self-referential citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study relying on manual labeling of bug-report text; no mathematical derivations or invented physical entities.

axioms (1)
  • domain assumption Bug report text contains enough information for accurate root-cause labeling by human readers
    The entire taxonomy and subsequent classifier rest on this premise.

pith-pipeline@v0.9.0 · 5760 in / 1104 out tokens · 24309 ms · 2026-05-24T16:03:04.658391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 3 internal anchors

  1. [1]

    Akila, V., Zayaraz, G., and Govindasamy, V. 2015. Effective bug triage–a framework. Procedia Computer Science 48, 114–120

  2. [2]

    , Villaneau, J

    Antoine, J.-Y. , Villaneau, J. , and Lefeuvre, A. 2014. Weighted krippendorff’s alpha is a more reliable metrics for multi- coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. In EACL 2014. 10–p

  3. [3]

    , Ayari, K

    Antoniol, G. , Ayari, K. , Di Penta, M. , Khomh, F. , and Gu´eh´eneuc, Y.-G. 2008. Is it a bug or an enhancement?: a text-based approach to classify change requests. In Proceedings of the 2008 conference of the center for advanced studies on collab- orative research: meeting of minds . ACM, 23

  4. [4]

    Anvik, J. 2006. Automating bug report assignment. In Proc. Int’l Conference on Software Engineering (ICSE). ACM, 937–940. 18

  5. [5]

    , Hiew, L

    Anvik, J. , Hiew, L. , and Murphy, G. C. 2006. Who should fix this bug? In Proceedings of the International Conference on Software Engineering (ICSE). ACM, 361–370

  6. [6]

    and Murphy, G

    Anvik, J. and Murphy, G. C. 2011. Reducing the effort of bug report triage: Recommenders for development-oriented deci- sions. ACM Transactions on Software Engineering and Method- ology (TOSEM) 20, 3, 10

  7. [7]

    and Venolia, G

    Aranda, J. and Venolia, G. 2009. The secret life of bugs: Go- ing past the errors and omissions in software repositories. In Pro- ceedings of the International Conference on Software Engineering (ICSE). IEEE Computer Society, 298–308

  8. [8]

    , Krsul, I

    Aslam, T. , Krsul, I. , and Spafford, E. H. 1996. Use of a taxonomy of security faults

  9. [9]

    Baeza-Yates, R. A. and Ribeiro-Neto, B. 1999. Modern In- formation Retrieval . Addison-Wesley Longman Publishing Co., Inc

  10. [10]

    Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. , and Nielsen, H. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 5, 412–424

  11. [11]

    Bauer, M. W. 2007. Content analysis. an introduction to its methodology–by klaus krippendorff from words to numbers. nar- rative, data and social science–by roberto franzosi. The British Journal of Sociology 58, 2, 329–331

  12. [12]

    E., Di Penta, M

    Bavota, G., Linares-Vasquez, M., Bernal-Cardenas, C. E., Di Penta, M. , Oliveto, R. , and Poshyvanyk, D. 2015. The impact of api change-and fault-proneness on the user ratings of android apps. IEEE Transactions on Software Engineering 41, 4, 384–407

  13. [13]

    Bell, J., Legunsen, O., Hilton, M., Eloussi, L., Yung, T., and Marinov, D. 2018. Deflaker: Automatically detecting flaky tests. In Proceedings of the International Conference on Software Engineering (ICSE). ACM

  14. [14]

    , Gousios, G

    Beller, M. , Gousios, G. , Panichella, A. , Proksch, S. , Amann, S., and Zaidman, A. Developer testing in the ide: Pat- terns, beliefs, and behavior. IEEE Transactions on Software En- gineering (TSE). To Appear

  15. [15]

    Beller, M., Gousios, G., Panichella, A., and Zaidman, A

  16. [16]

    In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE)

    When, how, and why developers (do not) test in their ides. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). ACM, 179–190

  17. [17]

    , Gousios, G

    Beller, M. , Gousios, G. , and Zaidman, A. 2017. Oops, my tests broke the build: An explorative analysis of Travis CI with GitHub. In Mining Software Repositories (MSR), 2017 IEEE/ACM 14th International Conference on . IEEE, 356–367

  18. [18]

    Beller, M., Spruit, N., Spinellis, D., and Zaidman, A. 2018. On the dichotomy of debugging behavior among programmers. In Proceedings of the 40th International Conference on Software Engineering (ICSE). ACM, 572–583

  19. [19]

    and Bengio, Y

    Bergstra, J. and Bengio, Y. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Re- search 13, Feb, 281–305

  20. [20]

    , and Zimmermann, T

    Bettenburg, N., Just, S., Schr¨oter, A., Weiß, C., Prem- raj, R. , and Zimmermann, T. 2007. Quality of bug reports in eclipse. In Proceedings of the 2007 OOPSLA workshop on eclipse technology eXchange. ACM, 21–25

  21. [21]

    Bezemer, C.-P., McIntosh, S., Adams, B., German, D. M. , and Hassan, A. E. 2017. An empirical study of unspecified de- pendencies in make-based build systems. Empirical Software En- gineering 22, 6, 3117–3148

  22. [22]

    Blei, D. M. , Ng, A. Y. , and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan, 993–1022

  23. [23]

    , Premraj, R

    Breu, S. , Premraj, R. , Sillito, J. , and Zimmermann, T

  24. [24]

    In Proceedings of the ACM confer- ence on Computer Supported Cooperative Work (CSCW)

    Information needs in bug reports: improving cooperation between developers and users. In Proceedings of the ACM confer- ence on Computer Supported Cooperative Work (CSCW) . ACM, 301–310

  25. [25]

    Bruning, S., Weissleder, S., and Malek, M. 2007. A fault taxonomy for service-oriented architecture. In High Assurance Systems Engineering Symposium, 2007. HASE’07. 10th IEEE . IEEE, 367–368

  26. [26]

    and Abran, A

    Buglione, L. and Abran, A. 2006. Introducing root-cause analysis and orthogonal defect classification at lower cmmi matu- rity levels. Proc. MENSURA 910, 29–40

  27. [27]

    Catolino, G., Palomba, F., Zaidman, A., and Ferrucci, F

  28. [28]

    com/s/dcb95c70c4472b2ac935

    Not all bugs are created equal: Understanding and classify- ing the root cause of bugs - online appendix https://figshare. com/s/dcb95c70c4472b2ac935

  29. [29]

    M., Bishop, J., Steyn, J., Baresi, L., and Guinea, S

    Chan, K. M., Bishop, J., Steyn, J., Baresi, L., and Guinea, S. 2007. A fault taxonomy for web service composition. In In- ternational Conference on Service-Oriented Computing. Springer, 363–375

  30. [30]

    Chawla, N. V. , Bowyer, K. W. , Hall, L. O. , and Kegelmeyer, W. P. 2002. Smote: synthetic minority over- sampling technique. Journal of artificial intelligence research 16 , 321–357

  31. [31]

    , Bhandari, I

    Chillarege, R. , Bhandari, I. S. , Chaar, J. K. , Halliday, M. J. , Moebus, D. S. , Ray, B. K. , and Wong, M.-Y. 1992. Orthogonal defect classification-a concept for in-process measure- ments. IEEE Transactions on software Engineering 18, 11, 943– 956

  32. [32]

    Chowdhury, G. G. 2003. Natural language processing. Annual review of information science and technology 37, 1, 51–89

  33. [33]

    The evolu- tion and decay of statically detected source code vulnerabilities

    Di Penta, M., Cerulo, L., and Aversano, L.2008. The evolu- tion and decay of statically detected source code vulnerabilities. In Eighth IEEE International Working Conference on Source Code Analysis and Manipulation . IEEE, 101–110

  34. [34]

    , Denger, C

    Freimut, B. , Denger, C. , and Ketterer, M. 2005. An in- dustrial case study of implementing and validating defect classifi- cation for process improvement and quality management. In Soft- ware Metrics, 2005. 11th IEEE International Symposium . IEEE, 10–pp

  35. [35]

    word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method

    Goldberg, Y. and Levy, O. 2014. word2vec explained: Deriv- ing mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722

  36. [36]

    Gousios, G., Zaidman, A., Storey, M.-A., and Van Deursen, A. 2015. Work practices and challenges in pull-based develop- ment: the integrator’s perspective. In Proceedings of the 37th In- ternational Conference on Software Engineering-Volume 1. IEEE Press, 358–368

  37. [37]

    Hall, T., Beecham, S., Bowes, D., Gray, D., and Counsell, S. 2011. Developing fault-prediction models: What the research can show industry. IEEE software 28, 6, 96–99

  38. [38]

    Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps

    Hecking, T. and Leydesdorff, L. 2018. Topic modelling of empirical text corpora: Validity, reliability, and reproducibility in comparison to semantic maps. arXiv preprint arXiv:1806.01045

  39. [39]

    , Rodriguez, D

    Hern´andez-Gonz´alez, J. , Rodriguez, D. , Inza, I. , Harri- son, R., and Lozano, J. A. 2018. Learning to classify software defects from crowds: a novel approach. Applied Soft Comput- ing 62 , 579–591

  40. [40]

    Herzig, K., Just, S., and Zeller, A. 2013. It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In Pro- ceedings of the International Conference on Software Engineering (ICSE). IEEE, 392–401

  41. [41]

    and Weimer, W

    Hooimeijer, P. and Weimer, W. 2007. Modeling bug report quality. In Proceedings of the international conference on Auto- mated software engineering (ASE) . ACM, 34–43

  42. [42]

    Huang, L., Ng, V., Persing, I., Chen, M., Li, Z., Geng, R., and Tian, J. 2015. Autoodc: Automated generation of orthogonal defect classifications. Automated Software Engineering 22, 1, 3– 46

  43. [43]

    Javed, M. Y. , Mohsin, H. , et al. 2012. An automated ap- proach for software bug classification. In Complex, Intelligent and Software Intensive Systems (CISIS), 2012 Sixth International Conference on. IEEE, 414–419

  44. [44]

    Jeong, G., Kim, S., and Zimmermann, T. 2009. Improving bug triage with bug tossing graphs. In Proceedings of the joint meeting of the European software engineering conference & the symposium on The foundations of software engineering (ESEC/FSE) . ACM, 111–120

  45. [45]

    , Adamoli, A

    Jovic, M. , Adamoli, A. , and Hauswirth, M. 2011. Catch me if you can: performance bug detection in the wild. In ACM SIGPLAN Notices. Vol. 46. ACM, 155–170. 19

  46. [46]

    and Sureka, A

    Lal, S. and Sureka, A. 2012. Comparison of seven bug report types: A case-study of google chrome browser project. In Software Engineering Conference (APSEC), 2012 19th Asia-Pacific. Vol. 1. IEEE, 517–526

  47. [47]

    and Mikolov, T

    Le, Q. and Mikolov, T. 2014. Distributed representations of sentences and documents. In International Conference on Ma- chine Learning. 1188–1196

  48. [48]

    E., and Stoll, D

    Leszak, M., Perry, D. E., and Stoll, D. 2002. Classification and evaluation of defects in a project retrospective. Journal of Systems and Software 61, 3, 173–187

  49. [49]

    , Holden, K

    Lidwell, W. , Holden, K. , and Butler, J. 2010. Universal Principles of Design, Revised and Updated: 125 Ways to Enhance Usability, Influence Perception, Increase Appeal, Make Better De- sign Decisions, and Teach through Design 2nd Ed. Rockport Pub- lishers

  50. [50]

    and Accorsi, R

    Lowis, L. and Accorsi, R. 2011. Vulnerability analysis in soa- based business processes. IEEE Transactions on Services Com- puting 4, 3, 230–242

  51. [51]

    Luo, Q., Hariri, F., Eloussi, L., and Marinov, D. 2014. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 643–653

  52. [52]

    , Ray, B., and Kim, M

    McDonnell, T. , Ray, B., and Kim, M. 2013. An empirical study of api stability and adoption in the android ecosystem. In Proc. Int’l Conf. on Software Maintenance (ICSM). IEEE, 70–79

  53. [53]

    Memon, A. M. 2002. GUI testing: Pitfalls and process. Com- puter 35, 8, 87–88

  54. [54]

    N., Fritz, T., Murphy, G

    Meyer, A. N., Fritz, T., Murphy, G. C., and Zimmermann, T. 2014. Software developers’ perceptions of productivity. In Pro- ceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 19–29

  55. [55]

    Mileva, Y. M. , Dallmeier, V. , Burger, M. , and Zeller, A. 2009. Mining trends of library usage. In Proceedings of the joint international and annual ERCIM workshops on Princi- ples of software evolution (IWPSE) and software evolution (Evol) workshops. ACM, 57–62

  56. [56]

    , Fielding, R

    Mockus, A. , Fielding, R. T. , and Herbsleb, J. D. 2002. Two case studies of open source software development: Apache and mozilla. ACM Transactions on Software Engineering and Methodology (TOSEM) 11, 3, 309–346

  57. [57]

    and H ˚akansson, A

    Moradian, E. and H ˚akansson, A. 2006. Possible attacks on xml web services. IJCSNS International Journal of Computer Science and Network Security 6, 1B, 154–170

  58. [58]

    and Cubranic, D

    Murphy, G. and Cubranic, D. 2004. Automatic bug triage using text categorization. In Proceedings of the International Conference on Software Engineering & Knowledge Engineering (SEKE). 92–97

  59. [59]

    Nagwani, N., Verma, S., and Mehta, K. K. 2013. Generating taxonomic terms for software bug classification by utilizing topic models based on latent dirichlet allocation. InICT and Knowledge Engineering (ICT&KE), 2013 11th International Conference on . IEEE, 1–5

  60. [60]

    Nasrabadi, N. M. 2007. Pattern recognition and machine learning. Journal of electronic imaging 16, 4, 049901

  61. [61]

    Ostrand, T. J. and Weyuker, E. J.1984. Collecting and cate- gorizing software error data in an industrial environment. Journal of Systems and Software 4, 4, 289–300

  62. [62]

    , Bavota, G., Oliveto, R., Di Penta, M

    Palomba, F., Linares-V´asquez, M. , Bavota, G., Oliveto, R., Di Penta, M. , Poshyvanyk, D., and De Lucia, A. 2018. Crowdsourcing user reviews to support the evolution of mobile apps. Journal of Systems and Software 137 , 143–162

  63. [63]

    , Salza, P

    Palomba, F. , Salza, P. , Ciurumelea, A. , Panichella, S. , Gall, H., Ferrucci, F., and De Lucia, A. 2017. Recommend- ing and localizing change requests for mobile apps based on user reviews. In Proceedings of the 39th international conference on software engineering. IEEE Press, 106–117

  64. [64]

    and Zaidman, A

    Palomba, F. and Zaidman, A. 2017. Does refactoring of test smells induce fixing flaky tests? In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on . IEEE, 1–12

  65. [65]

    Panichella, A., Dit, B., Oliveto, R., Di Penta, M., Poshy- vanyk, D., and De Lucia, A. 2013. How to effectively use topic models for software engineering tasks? an approach based on ge- netic algorithms. In Proceedings of the 2013 International Con- ference on Software Engineering. IEEE Press, 522–531

  66. [66]

    , Nijholt, A., and Huang, T

    Pantic, M., Pentland, A. , Nijholt, A., and Huang, T. S

  67. [67]

    In Artifical Intelligence for Human Comput- ing

    Human computing and machine understanding of human behavior: a survey. In Artifical Intelligence for Human Comput- ing. Springer, 47–71

  68. [68]

    , Spadini, D

    Pascarella, L. , Spadini, D. , Palomba, F. , Bruntink, M. , and Bacchelli, A. 2018. Information needs in contemporary code review. Proceedings of the ACM on Human-Computer In- teraction 2, CSCW, 135

  69. [69]

    Peng, J., Heisterkamp, D. R. , and Dai, H. 2001. Lda/svm driven nearest neighbor classification. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on . Vol. 1. IEEE, I–I

  70. [70]

    Porter, M. F. 1980. An algorithm for suffix stripping. Pro- gram 14, 3, 130–137

  71. [71]

    Ray, B., Hellendoorn, V., Godhane, S., Tu, Z., Bacchelli, A., and Devanbu, P. 2016. On the naturalness of buggy code. In Proceedings of the International Conference on Software En- gineering (ICSE). ACM, 428–439

  72. [72]

    , Tang, L

    Refaeilzadeh, P. , Tang, L. , and Liu, H. 2009. Cross- validation. In Encyclopedia of database systems . Springer, 532– 538

  73. [73]

    Robbes, R., Lungu, M., and R¨othlisberger, D. 2012. How do developers react to api deprecation?: the case of a smalltalk ecosystem. In Proceedings of the ACM SIGSOFT 20th Interna- tional Symposium on the Foundations of Software Engineering . ACM, 56

  74. [74]

    and Buckley, C.1988

    Salton, G. and Buckley, C.1988. Term-weighting approaches in automatic text retrieval. Information processing & manage- ment 24, 5, 513–523

  75. [75]

    , DUva, C., De Lucia, A., and Ferrucci, F

    Salza, P., Palomba, F., Di Nucci, D. , DUva, C., De Lucia, A., and Ferrucci, F. 2018. Do developers update third-party libraries in mobile apps?

  76. [76]

    , Premraj, R

    Schr¨oter, A., Zimmermann, T. , Premraj, R. , and Zeller, A. 2006. If your bug database could talk. In Proceedings of the 5th international symposium on empirical software engineering . Vol. 2. 18–20

  77. [77]

    Shokripour, R., Anvik, J., Kasirun, Z. M. , and Zamani, S

  78. [78]

    In Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on

    Why so complicated? simple term filtering and weight- ing for location-based bug report assignment recommendation. In Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on. IEEE, 2–11

  79. [79]

    Stone, M. 1974. Cross-validatory choice and assessment of sta- tistical predictions. Journal of the royal statistical society. Series B (Methodological), 111–147

  80. [80]

    A discriminative model approach for accurate duplicate bug report retrieval

    Sun, C., Lo, D., Wang, X., Jiang, J., and Khoo, S.-C.2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1 . ACM, 45–54

Showing first 80 references.