Is this Build Failure Related to my Patch? An Empirical Study of Unrelated Build Failures in Continuous Integration

Andie Huang; Daniel Alencar da Costa; Grant Dick; Mariam El Mezouar

arxiv: 2605.05564 · v1 · submitted 2026-05-07 · 💻 cs.SE

Is this Build Failure Related to my Patch? An Empirical Study of Unrelated Build Failures in Continuous Integration

Andie Huang , Daniel Alencar da Costa , Grant Dick , Mariam El Mezouar This is my paper

Pith reviewed 2026-05-08 09:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords continuous integrationbuild failuresunrelated failuresPU learningsemi-supervised learningempirical studyApache projectsfeature importance

0 comments

The pith

Semi-supervised learning models predict whether a CI build failure is unrelated to the triggering code change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines cases in continuous integration where a build fails for reasons other than the recent code push that triggered it. Developers currently spend a median of four hours determining if each failure requires action on their change. The authors analyze thousands of failures across seven Apache projects and sample hundreds of confirmed unrelated cases to understand patterns, including that unrelated test failures account for twenty percent of developer-classified unrelated builds. They extract thirty-three features from issue reports, comments, and commits, then train semi-supervised Positive and Unlabeled models to predict unrelated failures. If effective, the models would let developers and CI tools quickly set aside failures that do not need investigation of the current patch.

Core claim

The authors extract 33 features from issue reports, issue comments, and commits associated with the triggering push. They build semi-supervised Positive and Unlabeled learning models for each of seven Apache projects. These models predict unrelated build failures and achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis identifies CI latency, repeated error messages, and the number of preceding comments as useful indicators.

What carries the argument

Semi-supervised Positive and Unlabeled (PU) learning models trained on 33 features drawn from issue reports, comments, and commits.

If this is right

Developers could receive automatic signals that a failure is unlikely to stem from their patch and skip unnecessary debugging.
CI pipelines could prioritize or route only probable related failures for immediate attention.
Repeated error messages and build latency emerge as practical signals that teams can monitor without full model retraining.
The models demonstrate that partial labeling of data suffices for useful prediction across multiple projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same approach might be combined with automated root-cause tools to suggest the actual source of an unrelated failure.
Teams could use early predictions to pause or reconfigure flaky test suites before full builds complete.
Feature sets focused on timing and repetition could transfer to other environments where build noise is common.

Load-bearing premise

The 33 features from issue reports, comments, and commits plus the sampled labeling of unrelated failures are representative enough to train models that generalize across builds and projects.

What would settle it

Applying the trained models to a new collection of build failures from the same Apache projects and measuring whether precision, recall, and AUC stay within the reported ranges or drop sharply.

Figures

Figures reproduced from arXiv: 2605.05564 by Andie Huang, Daniel Alencar da Costa, Grant Dick, Mariam El Mezouar.

**Figure 1.** Figure 1: An example of a CI bot comment in a historical issue report after the comple view at source ↗

**Figure 2.** Figure 2: Approach of Heuristic-based Labeling (HL) view at source ↗

**Figure 3.** Figure 3: The overview of the process in our study. view at source ↗

**Figure 4.** Figure 4: Time misspent on identifying unrelated build failures view at source ↗

**Figure 5.** Figure 5: An overview of the document analysis process view at source ↗

**Figure 6.** Figure 6: Dendrogram of hierarchical clustering based on Spearman correlation coef view at source ↗

**Figure 7.** Figure 7: The Process Flow for Constructing the P, Q, and N Datasets. view at source ↗

**Figure 8.** Figure 8: The distribution of representative samples (371) across each theme and view at source ↗

**Figure 9.** Figure 9: Performance metrics of the four selected models across the seven studied view at source ↗

**Figure 10.** Figure 10: Example JIRA Issue Report showing how Priority, Is Blocker and Is Dependened and Number of Parallel Issues are extracted – Number of Parallel Issues, for each issue report, we calculate the number of issues that were opened on the same day, based on the report’s creation date, and define this as the number of parallel opened issues. – Is Cross Projects: As shown in view at source ↗

**Figure 11.** Figure 11: Example of Calculating the Number of Prior Comments view at source ↗

**Figure 12.** Figure 12: Example of Retrieving the Failed Classes from Build Logs view at source ↗

**Figure 13.** Figure 13: Example of Retrieving the Has Code Patch and CI Latency view at source ↗

read the original abstract

Continuous Integration (CI) systems often run many builds concurrently. In this setting, a legitimate build failure may not be caused by the code push that triggered it. Such unrelated build failures can waste developer effort because developers must determine whether the failure is actionable for their current change. We study 77,354 CI build failures from seven open source Apache projects to understand and predict unrelated build failures. We find that developers spend a median of 4 hours identifying whether a failure is related or unrelated to their push. We also perform a document analysis of 371 confirmed unrelated build failures sampled from 10,316 potentially unrelated failures. The analysis shows that unrelated test failures account for 20% of the cases in which developers classify build failures as unrelated. To predict unrelated build failures, we extract 33 features from issue reports, issue comments, and commits associated with the triggering push. We build semi-supervised Positive and Unlabeled (PU) learning models for seven Apache projects. The models achieve precision from 0.70 to 0.88, recall from 0.30 to 1.00, F1-score from 0.44 to 0.91, and AUC from 0.63 to 0.97. Feature importance analysis shows that CI latency, repeated error messages, and the number of preceding comments are useful indicators of unrelated build failures. These results show that PU learning can help developers identify build failures that are unlikely to be caused by their current push.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This study quantifies time wasted on unrelated CI build failures and tests PU learning for prediction, but high variance in recall limits how far the results generalize.

read the letter

The punchline is that unrelated build failures cost developers real time—median four hours to diagnose—and the authors give new numbers on how often test failures are the culprit. They also show PU learning can flag some of these with usable precision on Apache projects. What the paper does well is collect a large set of 77k failures and do a document analysis on a sample of 371 confirmed unrelated cases. The 20% statistic on unrelated test failures is a fresh data point. Extracting 33 features from issues and commits and running per-project PU models produces concrete metrics, with feature importance highlighting CI latency and error repetition as useful signals. The approach matches the semi-supervised nature of the data. The soft spots are in the results themselves. Recall varies from 0.30 to 1.00, which is a wide gap and suggests the fixed feature set or the limited positives (371 total) do not produce stable models everywhere. Because everything is done within each project, there is no evidence the predictors would work on a new project without retraining. The abstract and stress test note leave open questions about how the 371 cases were sampled from 10k and whether PU assumptions were checked. This paper is for software engineering researchers who study CI practices or build tools to reduce developer friction. A reader looking for empirical baselines on build failure diagnosis would get value from the time and percentage figures. It deserves a serious referee because the scale of the data collection is substantial and the modeling is a reasonable fit for the problem, even with the variance. I would recommend sending it to peer review, with the expectation that reviewers will ask for more on cross-project testing and labeling details.

Referee Report

3 major / 1 minor

Summary. The paper empirically studies unrelated build failures in CI systems across seven Apache projects, analyzing 77,354 build failures to show that developers spend a median of 4 hours determining relatedness. It performs document analysis on 371 sampled unrelated failures (from 10,316 candidates), finding that unrelated test failures comprise 20% of cases. Using 33 features from issue reports, comments, and commits, it trains per-project semi-supervised PU learning models that achieve precision 0.70-0.88, recall 0.30-1.00, F1 0.44-0.91, and AUC 0.63-0.97, with feature importance analysis identifying CI latency, repeated error messages, and preceding comments as useful predictors.

Significance. If the models hold, the work offers a practical way to reduce wasted developer time on non-actionable CI failures by leveraging observable artifacts and PU learning to handle limited labels. Strengths include concrete metrics on real Apache project data, identification of actionable features, and addressing a common CI pain point with semi-supervised methods. The per-project evaluation and variance analysis provide a starting point for tool support, though broader impact depends on addressing generalizability.

major comments (3)

[Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.
[Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.
[Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.

minor comments (1)

[Abstract] The abstract mentions the 33 features and 371 cases but provides limited transparency on the exact feature extraction process or sampling strategy; expanding this would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach where appropriate and indicating planned revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Evaluation] The evaluation of the PU learning models reports recall ranging from 0.30 to 1.00 across projects (with F1 as low as 0.44). This variance indicates that the fixed 33-feature set may fail to capture project-specific failure patterns, especially since models are trained and evaluated separately per project without any cross-project transfer or held-out validation experiments.

Authors: We agree that the observed variance in recall and F1 scores reflects real differences in project-specific CI failure patterns, which motivated our per-project modeling strategy rather than a single global model. The 33 features were selected as commonly available signals across Apache projects to enable practical application. We will revise the evaluation section to include an explicit discussion of this variance as a key finding and its implications. Additionally, we will add a cross-project leave-one-out experiment to quantify transfer performance and report it in the revised manuscript. revision: yes
Referee: [Methodology] The sampling of 371 confirmed unrelated build failures from 10,316 potentially unrelated ones, combined with the PU learning setup, requires explicit details on labeling criteria, inter-rater agreement, and validation of PU assumptions (e.g., the positive-unlabeled distribution). Without these, the representativeness of the labeled set and potential sampling bias cannot be assessed, directly affecting the reliability of the reported metrics.

Authors: We will expand the methodology and document analysis sections to provide explicit labeling criteria, describing how unrelatedness was determined from issue reports, comments, and commit context for the sampled failures. We will also report the author review process used for the 371 cases and any agreement measures obtained. For the PU learning setup, we will add a dedicated subsection validating the assumptions by discussing the selection of positives from the 10,316 candidates and the nature of the unlabeled set, following established PU learning practices. These additions will allow readers to better assess representativeness and bias. revision: yes
Referee: [Results] The central claim that the approach can help developers identify unrelated failures rests on per-project models; however, the absence of cross-project generalization tests means the headline performance numbers do not establish applicability to new projects or varying CI conditions, as noted by the wide metric ranges.

Authors: We acknowledge that the wide metric ranges (particularly recall 0.30-1.00) indicate limited direct generalizability, and our claims are scoped to the seven studied projects where per-project models can be trained on historical data. The central contribution is demonstrating that PU learning on observable CI artifacts can reduce wasted effort within such projects. We will revise the results and threats-to-validity sections to more prominently state this scope as a limitation and to discuss the conditions under which the approach is expected to apply. No claim of universal applicability is made in the current manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical PU models report performance on collected CI data without tautological reduction.

full rationale

The paper collects 77,354 build failures, manually confirms 371 unrelated cases from a 10,316 sample, extracts 33 observable features from issues/commits/comments, and trains per-project PU classifiers whose precision/recall/F1/AUC values are measured directly on that labeled data. No derivation step equates a claimed prediction to its own fitted inputs by construction, no uniqueness theorem or ansatz is smuggled via self-citation, and the central results remain independent empirical measurements rather than definitional restatements. The observed metric variance reflects data characteristics, not circular logic.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on assumptions about data representativeness and feature relevance typical in empirical ML studies for software engineering.

free parameters (1)

PU learning model hyperparameters
Semi-supervised models require tuning parameters that are fitted to the project data.

axioms (1)

domain assumption The sampled 371 unrelated failures and 10,316 potentially unrelated cases are representative of all CI build failures.
Invoked in the document analysis and model training sections.

pith-pipeline@v0.9.0 · 5583 in / 1240 out tokens · 58887 ms · 2026-05-08T09:22:22.372428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages

[1]

Empirical analysis of practitioners’ perceptions of test flakiness factors

AHMAD, A., LEIFLER, O.,ANDSANDAHL, K. Empirical analysis of practitioners’ perceptions of test flakiness factors. Software Testing, Verificationand Reliability 31, 8 (2021), e1791

work page 2021
[2]

A., COGO, F

AJIBODE, A., BANGASH, A. A., COGO, F. R., ADAMS, B.,ANDHASSAN, A. E. Towards se- mantic versioning of open pre-trained language model releases on hugging face. Empirical Software Engineering 30, 3 (2025), 1–63

work page 2025
[3]

Continuous integration and continuous delivery pipeline automa- tion for agile software project management

ARACHCHI, S.,ANDPERERA, I. Continuous integration and continuous delivery pipeline automa- tion for agile software project management. In 2018 Moratuwa Engineering Research Conference (MERCon) (2018), IEEE, pp. 156–161

work page 2018
[4]

Deflaker: Automatically detecting flaky tests

BELL, J., LEGUNSEN, O., HILTON, M., ELOUSSI, L., YUNG, T.,ANDMARINOV, D. Deflaker: Automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), IEEE, pp. 433–444

work page 2018
[5]

H.,DACOSTA, D

BERNARDO, J. H.,DACOSTA, D. A.,ANDKULESZA, U. Studying the impact of adopting con- tinuous integration on the delivery time of pull requests. In Proceedings of the 15th International Conference on Mining Software Repositories (2018), pp. 131–141

work page 2018
[6]

BOWEN, G. A. Document analysis as a qualitative research method. Qualitative research journal (2009)

work page 2009
[7]

Random forests

BREIMAN, L. Random forests. Machine learning 45, 1 (2001), 5–32

work page 2001
[8]

Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration

CHEN, B., CHEN, L., ZHANG, C.,ANDPENG, X. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In Proceedings of the 35th IEEE/ACM international conference on automated software engineering (2020), pp. 42–53

work page 2020
[9]

K.,ANDAKSEL, G

C ¸ORBACIO ˘GLU, S ¸ . K.,ANDAKSEL, G. Receiver operating characteristic curve analysis in diag- nostic accuracy studies: A guide to interpreting the area under the curve value. Turkish Journal of Emergency Medicine 23, 4 (2023), 195

work page 2023
[10]

P., MISAROS, M., GOTA, D.,ANDMICLEA, L

DONCA, I.-C., STAN, O. P., MISAROS, M., GOTA, D.,ANDMICLEA, L. Method for continuous integration and deployment using a pipeline generator for agile software projects. Sensors 22, 12 (2022), 4637

work page 2022
[11]

M., MATYAS, S.,ANDGLOVER, A

DUVALL, P. M., MATYAS, S.,ANDGLOVER, A. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007

work page 2007
[12]

Strength of evidence in systematic reviews in software engineering

DYB ˚A, T.,ANDDINGSØYR, T. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement (2008), pp. 178–187

work page 2008
[13]

Understanding flaky tests: The developer’s perspective

ECK, M., PALOMBA, F., CASTELLUCCIO, M.,ANDBACCHELLI, A. Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019), pp. 830–840

work page 2019
[14]

E.,ANDZOU, Y

EHSAN, O., HASSAN, S., MEZOUAR, M. E.,ANDZOU, Y. An empirical study of developer discus- sions in the gitter platform. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 1 (2020), 1–39. 38 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

work page 2020
[15]

A., GERMAN, D

ELMEZOUAR, M.,DACOSTA, D. A., GERMAN, D. M.,ANDZOU, Y. Exploring the use of chatrooms by developers: An empirical study on slack and gitter. IEEE Transactions on Software Engineering 48, 10 (2021), 3988–4001

work page 2021
[16]

S., LOWLIND, D., ERNST, N

ELAZHARY, O., WERNER, C., LI, Z. S., LOWLIND, D., ERNST, N. A.,ANDSTOREY, M.-A. Uncovering the benefits and challenges of continuous integration practices. IEEE Transactions on Software Engineering 48, 7 (2021), 2570–2583

work page 2021
[17]

Techniques for improving regression testing in continuous integration development environments

ELBAUM, S., ROTHERMEL, G.,ANDPENIX, J. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014), pp. 235–245

work page 2014
[18]

Learning classifiers from only positive and unlabeled data

ELKAN, C.,ANDNOTO, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), pp. 213–220

work page 2008
[19]

Determining flaky tests from test failures

ELOUSSI, L. Determining flaky tests from test failures

work page
[20]

A., CARTAXO, B.,ANDPINTO, G

FELIDR ´E, W., FURTADO, L.,DACOSTA, D. A., CARTAXO, B.,ANDPINTO, G. Continuous in- tegration theater. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2019), IEEE, pp. 1–10

work page 2019
[21]

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

FORMAN, G.,ANDSCHOLZ, M. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. Acm Sigkdd Explorations Newsletter 12, 1 (2010), 49–57

work page 2010
[22]

Continuous integration, 2006

FOWLER, M.,ANDFOEMMEL, M. Continuous integration, 2006

work page 2006
[23]

H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R

FUSILIER, D. H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R. G. Detecting positive and negative deceptive opinions using pu-learning. Information processing & management 51, 4 (2015), 433–443

work page 2015
[24]

Improving the robustness and efficiency of continuous integration and deployment

GALLABA, K. Improving the robustness and efficiency of continuous integration and deployment. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2019), IEEE, pp. 619–623

work page 2019
[25]

A., DACOSTA, D

GHALEB, T. A., DACOSTA, D. A.,ANDZOU, Y. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering 24, 4 (2019), 2102–2139

work page 2019
[26]

A.,DACOSTA, D

GHALEB, T. A.,DACOSTA, D. A., ZOU, Y.,ANDHASSAN, A. E. Studying the impact of noises in build breakage data. IEEE Transactions on Software Engineering 47, 9 (2019), 1998–2011

work page 2019
[27]

A., HASSAN, S.,ANDZOU, Y

GHALEB, T. A., HASSAN, S.,ANDZOU, Y. Studying the interplay between the durations and breakages of continuous integration builds. IEEE Transactions on Software Engineering 49, 4 (2022), 2476–2497

work page 2022
[28]

An exploratory study of the pull-based software development model

GOUSIOS, G., PINZGER, M.,ANDDEURSEN, A.V. An exploratory study of the pull-based software development model. In Proceedings of the 36th international conference on software engineering (2014), pp. 345–355

work page 2014
[29]

Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

GRAHAM, H.,ANDOWEN, L. Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

work page 2003
[30]

A.,ANDMCNEIL, B

HANLEY, J. A.,ANDMCNEIL, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 1 (1982), 29–36

work page 1982
[31]

Tackling build failures in continuous integration

HASSAN, F. Tackling build failures in continuous integration. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), IEEE, pp. 1242–1245

work page 2019
[32]

A comparative study to benchmark cross- project defect prediction approaches

HERBOLD, S., TRAUTSCH, A.,ANDGRABOWSKI, J. A comparative study to benchmark cross- project defect prediction approaches. In Proceedings of the 40th international conference on software engineering (2018), pp. 1063–1063

work page 2018
[33]

Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel

HERTEL, G., NIEDNER, S.,ANDHERRMANN, S. Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel. Research policy 32, 7 (2003), 1159–1177

work page 2003
[34]

Pu learning for matrix completion

HSIEH, C.-J., NATARAJAN, N.,ANDDHILLON, I. Pu learning for matrix completion. In International conference on machine learning (2015), PMLR, pp. 2445–2453

work page 2015
[35]

unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

HUANG, A. unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

work page 2024
[36]

A., ZHANG, F.,ANDZOU, Y

HUANG, Y.,DACOSTA, D. A., ZHANG, F.,ANDZOU, Y. An empirical study on the issue reports with questions raised during the issue resolving process.Empirical Software Engineering 24, 2 (2019), 718–750

work page 2019
[37]

Evaluating learning algorithms: a classification perspective

JAPKOWICZ, N.,ANDSHAH, M. Evaluating learning algorithms: a classification perspective. Cam- bridge University Press, 2011

work page 2011
[38]

An ex- tended study of syntactic breaking changes in the wild

JAYASURIYA, D., OU, S., HEGDE, S., TERRAGNI, V., DIETRICH, J.,ANDBLINCOE, K. An ex- tended study of syntactic breaking changes in the wild. Empirical Software Engineering 30, 2 (2025), 1–45. Title Suppressed Due to Excessive Length 39

work page 2025
[39]

The impact of automated feature selection techniques on the interpretation of defect models

JIARPAKDEE, J., TANTITHAMTHAVORN, C.,ANDTREUDE, C. The impact of automated feature selection techniques on the interpretation of defect models. Empirical Software Engineering 25, 5 (2020), 3590–3638

work page 2020
[40]

A cost-efficient approach to building in continuous integration

JIN, X.,ANDSERVANT, F. A cost-efficient approach to building in continuous integration. In Proceedings of the ACM/IEEE 42nd International conference on software engineering (2020), pp. 13– 25

work page 2020
[41]

A systematic review of systematic review process research in software engineering

KITCHENHAM, B.,ANDBRERETON, P. A systematic review of systematic review process research in software engineering. Information and software technology 55, 12 (2013), 2049–2075

work page 2013
[42]

Support- ing continuous integration by code-churn based test selection

KNAUSS, E., STARON, M., MEDING, W., S ¨ODER, O., NILSSON, A.,ANDCASTELL, M. Support- ing continuous integration by code-churn based test selection. In 2015 IEEE/ACM 2nd International Workshop on Rapid Continuous Software Engineering (2015), IEEE, pp. 19–25

work page 2015
[43]

Measuring the cost of regression testing in practice: A study of java projects using continuous integration

LABUSCHAGNE, A., INOZEMTSEVA, L.,ANDHOLMES, R. Measuring the cost of regression testing in practice: A study of java projects using continuous integration. In Proceedings of the 2017 11th joint meeting on foundations of software engineering (2017), pp. 821–830

work page 2017
[44]

When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla

LAMPEL, J., JUST, S., APEL, S.,ANDZELLER, A. When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021), pp. 1381–1392

work page 2021
[45]

Collaboration tools for global software engineering

LANUBILE, F., EBERT, C., PRIKLADNICKI, R.,ANDVIZCA ´INO, A. Collaboration tools for global software engineering. IEEE software 27, 2 (2010), 52

work page 2010
[46]

Random forests

LEO,ANDBREIMAN. Random forests. Machine Learning (2001)

work page 2001
[47]

Weighted reward for reinforcement learning based test case prioritization in continuous integration testing

LI, G., YANG, Y., WU, Z., CAO, T., LIU, Y.,ANDLI, Z. Weighted reward for reinforcement learning based test case prioritization in continuous integration testing. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) (2021), IEEE, pp. 980–985

work page 2021
[48]

A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data

LI, W., GUO, Q.,ANDELKAN, C. A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data. IEEE transactions on geoscience and remote sensing 49, 2 (2010), 717–725

work page 2010
[49]

Learning to classify texts using positive and unlabeled data

LI, X.,ANDLIU, B. Learning to classify texts using positive and unlabeled data. In IJCAI (2003), vol. 3, Citeseer, pp. 587–592

work page 2003
[50]

X., AIKEN, A.,ANDJORDAN, M

LIBLIT, B., NAIK, M., ZHENG, A. X., AIKEN, A.,ANDJORDAN, M. I. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15–26

work page 2005
[51]

S., YU, P

LIU, B., LEE, W. S., YU, P. S.,ANDLI, X. Partially supervised classification of text documents. In ICML (2002), vol. 2, Sydney, NSW, pp. 387–394

work page 2002
[52]

M., BOURAFFA, A.,ANDMAALEJ, W

L ¨UDERS, C. M., BOURAFFA, A.,ANDMAALEJ, W. Beyond duplicates: Towards understanding and predicting link types in issue tracking systems. In Proceedings of the 19th International Conference on Mining Software Repositories (2022), pp. 48–60

work page 2022
[53]

M., PIETZ, T.,ANDMAALEJ, W

L ¨UDERS, C. M., PIETZ, T.,ANDMAALEJ, W. Automated detection of typed links in issue trackers. In 2022 IEEE 30th International Requirements Engineering Conference (RE) (2022), IEEE, pp. 26– 38

work page 2022
[54]

M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W

L ¨UDERS, C. M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W. Openreq issue link map: A tool to visualize issue links in jira. In 2019 IEEE 27th International Requirements Engineering Conference (RE) (2019), IEEE, pp. 492–493

work page 2019
[55]

An empirical analysis of flaky tests

LUO, Q., HARIRI, F., ELOUSSI, L.,ANDMARINOV, D. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering (2014), pp. 643–653

work page 2014
[56]

Predictive test selection

MACHALICA, M., SAMYLKIN, A., PORTH, M.,ANDCHANDRA, S. Predictive test selection. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2019), IEEE, pp. 91–100

work page 2019
[57]

Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction

MCINTOSH, S.,ANDKAMEI, Y. Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. In Proceedings of the 40th International Conference on Software Engineering (2018), pp. 560–560

work page 2018
[58]

MCINTOSH, S., KAMEI, Y., ADAMS, B.,ANDHASSAN, A. E. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 21, 5 (2016), 2146–2189

work page 2016
[59]

Continuous integration and its tools

MEYER, M. Continuous integration and its tools. IEEE software 31, 3 (2014), 14–16

work page 2014
[60]

N., LI, X.-L.,ANDNG, S.-K

NGUYEN, M. N., LI, X.-L.,ANDNG, S.-K. Positive unlabeled learning for time series classifica- tion. In Twenty-Second International Joint Conference on Artificial Intelligence (2011), Citeseer. 40 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

work page 2011
[61]

Towards language-independent brown build detection

OLEWICKI, D., NAYROLLES, M.,ANDADAMS, B. Towards language-independent brown build detection. In Proceedings of the 44th International Conference on Software Engineering (2022), pp. 2177–2188

work page 2022
[62]

PALOMBA, F.,ANDZAIDMAN, A. Notice of retraction: Does refactoring of test smells induce fixing flaky tests? In 2017 IEEE international conference on software maintenance and evolution (ICSME) (2017), IEEE, pp. 1–12

work page 2017
[63]

Continuous test suite failure prediction

PAN, C.,ANDPRADEL, M. Continuous test suite failure prediction. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021), pp. 553–565

work page 2021
[64]

E., SIY, H

PERRY, D. E., SIY, H. P.,ANDVOTTA, L. G. Parallel changes in large-scale software development: an observational case study.ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 3 (2001), 308–337

work page 2001
[65]

K., WANG, S., KAMEI, Y.,ANDHASSAN, A

RAJBAHADUR, G. K., WANG, S., KAMEI, Y.,ANDHASSAN, A. E. The impact of using regression models to build defect classifiers. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 135–145

work page 2017
[66]

An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

RAUSCH, T., HUMMER, W., LEITNER, P.,ANDSCHULTE, S. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 345–355

work page 2017
[67]

SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. On the prediction of continuous integration build failures using search-based software engineering. InProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (2020), pp. 313–314

work page 2020
[68]

SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. Predicting continuous integration build failures using evolutionary search. Information and Software Technology 128 (2020), 106392

work page 2020
[69]

Learning ci configuration correctness for early build feedback

SANTOLUCITO, M., ZHANG, J., ZHAI, E., CITO, J.,ANDPISKAC, R. Learning ci configuration correctness for early build feedback. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), IEEE, pp. 1006–1017

work page 2022
[70]

Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects

SANTOS, J., ALENCAR DACOSTA, D.,ANDKULESZA, U. Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2022), pp. 137–147

work page 2022
[71]

A., MCINTOSH, S.,ANDKULESZA, U

SANTOS, J.,DACOSTA, D. A., MCINTOSH, S.,ANDKULESZA, U. On the need to monitor contin- uous integration practices. Empirical Software Engineering 30, 5 (2025), 125

work page 2025
[72]

A.,ANDZHU, L

SHAHIN, M., BABAR, M. A.,ANDZHU, L. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices. IEEE Access 5 (2017), 3909–3943

work page 2017
[73]

Understanding and improving regression test selection in con- tinuous integration

SHI, A., ZHAO, P.,ANDMARINOV, D. Understanding and improving regression test selection in con- tinuous integration. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE) (2019), IEEE, pp. 228–238

work page 2019
[74]

Shake it! detecting flaky tests caused by concur- rency with shaker

SILVA, D., TEIXEIRA, L.,AND D’AMORIM, M. Shake it! detecting flaky tests caused by concur- rency with shaker. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), IEEE, pp. 301–311

work page 2020
[75]

A.,ANDKULESZA, U

SOARES, E., SIZILIO, G., SANTOS, J.,DACOSTA, D. A.,ANDKULESZA, U. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering 27, 3 (2022), 78

work page 2022
[76]

E.,ANDMATSUMOTO, K

TANTITHAMTHAVORN, C., HASSAN, A. E.,ANDMATSUMOTO, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 11 (2018), 1200–1219

work page 2018
[77]

Verifying the selected completely at random assumption in positive-unlabeled learning

TEISSEYRE, P., FURMA ´NCZYK, K.,ANDMIELNICZUK, J. Verifying the selected completely at random assumption in positive-unlabeled learning. arXiv preprint arXiv:2404.00145 (2024)

work page arXiv 2024
[78]

A container-based infrastructure for fuzzy- driven root causing of flaky tests

TERRAGNI, V., SALZA, P.,ANDFERRUCCI, F. A container-based infrastructure for fuzzy- driven root causing of flaky tests. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) (2020), IEEE, pp. 69–72

work page 2020
[79]

An empirical study of flaky tests in android apps

THORVE, S., SRESHTHA, C.,ANDMENG, N. An empirical study of flaky tests in android apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2018), IEEE, pp. 534–538

work page 2018
[80]

B.,ANDDISTEFANO, J

TURHAN, B., MENZIES, T., BENER, A. B.,ANDDISTEFANO, J. On the relative value of cross- company and within-company data for defect prediction. Empirical Software Engineering 14, 5 (2009), 540–578

work page 2009

Showing first 80 references.

[1] [1]

Empirical analysis of practitioners’ perceptions of test flakiness factors

AHMAD, A., LEIFLER, O.,ANDSANDAHL, K. Empirical analysis of practitioners’ perceptions of test flakiness factors. Software Testing, Verificationand Reliability 31, 8 (2021), e1791

work page 2021

[2] [2]

A., COGO, F

AJIBODE, A., BANGASH, A. A., COGO, F. R., ADAMS, B.,ANDHASSAN, A. E. Towards se- mantic versioning of open pre-trained language model releases on hugging face. Empirical Software Engineering 30, 3 (2025), 1–63

work page 2025

[3] [3]

Continuous integration and continuous delivery pipeline automa- tion for agile software project management

ARACHCHI, S.,ANDPERERA, I. Continuous integration and continuous delivery pipeline automa- tion for agile software project management. In 2018 Moratuwa Engineering Research Conference (MERCon) (2018), IEEE, pp. 156–161

work page 2018

[4] [4]

Deflaker: Automatically detecting flaky tests

BELL, J., LEGUNSEN, O., HILTON, M., ELOUSSI, L., YUNG, T.,ANDMARINOV, D. Deflaker: Automatically detecting flaky tests. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), IEEE, pp. 433–444

work page 2018

[5] [5]

H.,DACOSTA, D

BERNARDO, J. H.,DACOSTA, D. A.,ANDKULESZA, U. Studying the impact of adopting con- tinuous integration on the delivery time of pull requests. In Proceedings of the 15th International Conference on Mining Software Repositories (2018), pp. 131–141

work page 2018

[6] [6]

BOWEN, G. A. Document analysis as a qualitative research method. Qualitative research journal (2009)

work page 2009

[7] [7]

Random forests

BREIMAN, L. Random forests. Machine learning 45, 1 (2001), 5–32

work page 2001

[8] [8]

Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration

CHEN, B., CHEN, L., ZHANG, C.,ANDPENG, X. Buildfast: History-aware build outcome prediction for fast feedback and reduced cost in continuous integration. In Proceedings of the 35th IEEE/ACM international conference on automated software engineering (2020), pp. 42–53

work page 2020

[9] [9]

K.,ANDAKSEL, G

C ¸ORBACIO ˘GLU, S ¸ . K.,ANDAKSEL, G. Receiver operating characteristic curve analysis in diag- nostic accuracy studies: A guide to interpreting the area under the curve value. Turkish Journal of Emergency Medicine 23, 4 (2023), 195

work page 2023

[10] [10]

P., MISAROS, M., GOTA, D.,ANDMICLEA, L

DONCA, I.-C., STAN, O. P., MISAROS, M., GOTA, D.,ANDMICLEA, L. Method for continuous integration and deployment using a pipeline generator for agile software projects. Sensors 22, 12 (2022), 4637

work page 2022

[11] [11]

M., MATYAS, S.,ANDGLOVER, A

DUVALL, P. M., MATYAS, S.,ANDGLOVER, A. Continuous integration: improving software quality and reducing risk. Pearson Education, 2007

work page 2007

[12] [12]

Strength of evidence in systematic reviews in software engineering

DYB ˚A, T.,ANDDINGSØYR, T. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement (2008), pp. 178–187

work page 2008

[13] [13]

Understanding flaky tests: The developer’s perspective

ECK, M., PALOMBA, F., CASTELLUCCIO, M.,ANDBACCHELLI, A. Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019), pp. 830–840

work page 2019

[14] [14]

E.,ANDZOU, Y

EHSAN, O., HASSAN, S., MEZOUAR, M. E.,ANDZOU, Y. An empirical study of developer discus- sions in the gitter platform. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 1 (2020), 1–39. 38 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

work page 2020

[15] [15]

A., GERMAN, D

ELMEZOUAR, M.,DACOSTA, D. A., GERMAN, D. M.,ANDZOU, Y. Exploring the use of chatrooms by developers: An empirical study on slack and gitter. IEEE Transactions on Software Engineering 48, 10 (2021), 3988–4001

work page 2021

[16] [16]

S., LOWLIND, D., ERNST, N

ELAZHARY, O., WERNER, C., LI, Z. S., LOWLIND, D., ERNST, N. A.,ANDSTOREY, M.-A. Uncovering the benefits and challenges of continuous integration practices. IEEE Transactions on Software Engineering 48, 7 (2021), 2570–2583

work page 2021

[17] [17]

Techniques for improving regression testing in continuous integration development environments

ELBAUM, S., ROTHERMEL, G.,ANDPENIX, J. Techniques for improving regression testing in continuous integration development environments. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014), pp. 235–245

work page 2014

[18] [18]

Learning classifiers from only positive and unlabeled data

ELKAN, C.,ANDNOTO, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), pp. 213–220

work page 2008

[19] [19]

Determining flaky tests from test failures

ELOUSSI, L. Determining flaky tests from test failures

work page

[20] [20]

A., CARTAXO, B.,ANDPINTO, G

FELIDR ´E, W., FURTADO, L.,DACOSTA, D. A., CARTAXO, B.,ANDPINTO, G. Continuous in- tegration theater. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (2019), IEEE, pp. 1–10

work page 2019

[21] [21]

Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

FORMAN, G.,ANDSCHOLZ, M. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. Acm Sigkdd Explorations Newsletter 12, 1 (2010), 49–57

work page 2010

[22] [22]

Continuous integration, 2006

FOWLER, M.,ANDFOEMMEL, M. Continuous integration, 2006

work page 2006

[23] [23]

H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R

FUSILIER, D. H., MONTES-YG ´OMEZ, M., ROSSO, P.,ANDCABRERA, R. G. Detecting positive and negative deceptive opinions using pu-learning. Information processing & management 51, 4 (2015), 433–443

work page 2015

[24] [24]

Improving the robustness and efficiency of continuous integration and deployment

GALLABA, K. Improving the robustness and efficiency of continuous integration and deployment. In 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2019), IEEE, pp. 619–623

work page 2019

[25] [25]

A., DACOSTA, D

GHALEB, T. A., DACOSTA, D. A.,ANDZOU, Y. An empirical study of the long duration of continuous integration builds. Empirical Software Engineering 24, 4 (2019), 2102–2139

work page 2019

[26] [26]

A.,DACOSTA, D

GHALEB, T. A.,DACOSTA, D. A., ZOU, Y.,ANDHASSAN, A. E. Studying the impact of noises in build breakage data. IEEE Transactions on Software Engineering 47, 9 (2019), 1998–2011

work page 2019

[27] [27]

A., HASSAN, S.,ANDZOU, Y

GHALEB, T. A., HASSAN, S.,ANDZOU, Y. Studying the interplay between the durations and breakages of continuous integration builds. IEEE Transactions on Software Engineering 49, 4 (2022), 2476–2497

work page 2022

[28] [28]

An exploratory study of the pull-based software development model

GOUSIOS, G., PINZGER, M.,ANDDEURSEN, A.V. An exploratory study of the pull-based software development model. In Proceedings of the 36th international conference on software engineering (2014), pp. 345–355

work page 2014

[29] [29]

Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

GRAHAM, H.,ANDOWEN, L. Are there socioeconomic differentials in under-reporting of smoking in pregnancy? Tobacco Control 12, 4 (2003), 434–434

work page 2003

[30] [30]

A.,ANDMCNEIL, B

HANLEY, J. A.,ANDMCNEIL, B. J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143, 1 (1982), 29–36

work page 1982

[31] [31]

Tackling build failures in continuous integration

HASSAN, F. Tackling build failures in continuous integration. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), IEEE, pp. 1242–1245

work page 2019

[32] [32]

A comparative study to benchmark cross- project defect prediction approaches

HERBOLD, S., TRAUTSCH, A.,ANDGRABOWSKI, J. A comparative study to benchmark cross- project defect prediction approaches. In Proceedings of the 40th international conference on software engineering (2018), pp. 1063–1063

work page 2018

[33] [33]

Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel

HERTEL, G., NIEDNER, S.,ANDHERRMANN, S. Motivation of software developers in open source projects: an internet-based survey of contributors to the linux kernel. Research policy 32, 7 (2003), 1159–1177

work page 2003

[34] [34]

Pu learning for matrix completion

HSIEH, C.-J., NATARAJAN, N.,ANDDHILLON, I. Pu learning for matrix completion. In International conference on machine learning (2015), PMLR, pp. 2445–2453

work page 2015

[35] [35]

unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

HUANG, A. unrelated-build-failures-empirical-studies, 2024.https://github.com/ckeys/ unrelated-build-failures-empirical-studies(Accessed: 2024-11-26)

work page 2024

[36] [36]

A., ZHANG, F.,ANDZOU, Y

HUANG, Y.,DACOSTA, D. A., ZHANG, F.,ANDZOU, Y. An empirical study on the issue reports with questions raised during the issue resolving process.Empirical Software Engineering 24, 2 (2019), 718–750

work page 2019

[37] [37]

Evaluating learning algorithms: a classification perspective

JAPKOWICZ, N.,ANDSHAH, M. Evaluating learning algorithms: a classification perspective. Cam- bridge University Press, 2011

work page 2011

[38] [38]

An ex- tended study of syntactic breaking changes in the wild

JAYASURIYA, D., OU, S., HEGDE, S., TERRAGNI, V., DIETRICH, J.,ANDBLINCOE, K. An ex- tended study of syntactic breaking changes in the wild. Empirical Software Engineering 30, 2 (2025), 1–45. Title Suppressed Due to Excessive Length 39

work page 2025

[39] [39]

The impact of automated feature selection techniques on the interpretation of defect models

JIARPAKDEE, J., TANTITHAMTHAVORN, C.,ANDTREUDE, C. The impact of automated feature selection techniques on the interpretation of defect models. Empirical Software Engineering 25, 5 (2020), 3590–3638

work page 2020

[40] [40]

A cost-efficient approach to building in continuous integration

JIN, X.,ANDSERVANT, F. A cost-efficient approach to building in continuous integration. In Proceedings of the ACM/IEEE 42nd International conference on software engineering (2020), pp. 13– 25

work page 2020

[41] [41]

A systematic review of systematic review process research in software engineering

KITCHENHAM, B.,ANDBRERETON, P. A systematic review of systematic review process research in software engineering. Information and software technology 55, 12 (2013), 2049–2075

work page 2013

[42] [42]

Support- ing continuous integration by code-churn based test selection

KNAUSS, E., STARON, M., MEDING, W., S ¨ODER, O., NILSSON, A.,ANDCASTELL, M. Support- ing continuous integration by code-churn based test selection. In 2015 IEEE/ACM 2nd International Workshop on Rapid Continuous Software Engineering (2015), IEEE, pp. 19–25

work page 2015

[43] [43]

Measuring the cost of regression testing in practice: A study of java projects using continuous integration

LABUSCHAGNE, A., INOZEMTSEVA, L.,ANDHOLMES, R. Measuring the cost of regression testing in practice: A study of java projects using continuous integration. In Proceedings of the 2017 11th joint meeting on foundations of software engineering (2017), pp. 821–830

work page 2017

[44] [44]

When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla

LAMPEL, J., JUST, S., APEL, S.,ANDZELLER, A. When life gives you oranges: detecting and diagnosing intermittent job failures at mozilla. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021), pp. 1381–1392

work page 2021

[45] [45]

Collaboration tools for global software engineering

LANUBILE, F., EBERT, C., PRIKLADNICKI, R.,ANDVIZCA ´INO, A. Collaboration tools for global software engineering. IEEE software 27, 2 (2010), 52

work page 2010

[46] [46]

Random forests

LEO,ANDBREIMAN. Random forests. Machine Learning (2001)

work page 2001

[47] [47]

Weighted reward for reinforcement learning based test case prioritization in continuous integration testing

LI, G., YANG, Y., WU, Z., CAO, T., LIU, Y.,ANDLI, Z. Weighted reward for reinforcement learning based test case prioritization in continuous integration testing. In 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC) (2021), IEEE, pp. 980–985

work page 2021

[48] [48]

A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data

LI, W., GUO, Q.,ANDELKAN, C. A positive and unlabeled learning algorithm for one-class clas- sification of remote-sensing data. IEEE transactions on geoscience and remote sensing 49, 2 (2010), 717–725

work page 2010

[49] [49]

Learning to classify texts using positive and unlabeled data

LI, X.,ANDLIU, B. Learning to classify texts using positive and unlabeled data. In IJCAI (2003), vol. 3, Citeseer, pp. 587–592

work page 2003

[50] [50]

X., AIKEN, A.,ANDJORDAN, M

LIBLIT, B., NAIK, M., ZHENG, A. X., AIKEN, A.,ANDJORDAN, M. I. Scalable statistical bug isolation. Acm Sigplan Notices 40, 6 (2005), 15–26

work page 2005

[51] [51]

S., YU, P

LIU, B., LEE, W. S., YU, P. S.,ANDLI, X. Partially supervised classification of text documents. In ICML (2002), vol. 2, Sydney, NSW, pp. 387–394

work page 2002

[52] [52]

M., BOURAFFA, A.,ANDMAALEJ, W

L ¨UDERS, C. M., BOURAFFA, A.,ANDMAALEJ, W. Beyond duplicates: Towards understanding and predicting link types in issue tracking systems. In Proceedings of the 19th International Conference on Mining Software Repositories (2022), pp. 48–60

work page 2022

[53] [53]

M., PIETZ, T.,ANDMAALEJ, W

L ¨UDERS, C. M., PIETZ, T.,ANDMAALEJ, W. Automated detection of typed links in issue trackers. In 2022 IEEE 30th International Requirements Engineering Conference (RE) (2022), IEEE, pp. 26– 38

work page 2022

[54] [54]

M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W

L ¨UDERS, C. M., RAATIKAINEN, M., MOTGER, J.,ANDMAALEJ, W. Openreq issue link map: A tool to visualize issue links in jira. In 2019 IEEE 27th International Requirements Engineering Conference (RE) (2019), IEEE, pp. 492–493

work page 2019

[55] [55]

An empirical analysis of flaky tests

LUO, Q., HARIRI, F., ELOUSSI, L.,ANDMARINOV, D. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering (2014), pp. 643–653

work page 2014

[56] [56]

Predictive test selection

MACHALICA, M., SAMYLKIN, A., PORTH, M.,ANDCHANDRA, S. Predictive test selection. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2019), IEEE, pp. 91–100

work page 2019

[57] [57]

Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction

MCINTOSH, S.,ANDKAMEI, Y. Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. In Proceedings of the 40th International Conference on Software Engineering (2018), pp. 560–560

work page 2018

[58] [58]

MCINTOSH, S., KAMEI, Y., ADAMS, B.,ANDHASSAN, A. E. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 21, 5 (2016), 2146–2189

work page 2016

[59] [59]

Continuous integration and its tools

MEYER, M. Continuous integration and its tools. IEEE software 31, 3 (2014), 14–16

work page 2014

[60] [60]

N., LI, X.-L.,ANDNG, S.-K

NGUYEN, M. N., LI, X.-L.,ANDNG, S.-K. Positive unlabeled learning for time series classifica- tion. In Twenty-Second International Joint Conference on Artificial Intelligence (2011), Citeseer. 40 Yonghui (Andie) Huang · Daniel Alencar da Costa · Grant Dick · Mariam El Mezouar

work page 2011

[61] [61]

Towards language-independent brown build detection

OLEWICKI, D., NAYROLLES, M.,ANDADAMS, B. Towards language-independent brown build detection. In Proceedings of the 44th International Conference on Software Engineering (2022), pp. 2177–2188

work page 2022

[62] [62]

PALOMBA, F.,ANDZAIDMAN, A. Notice of retraction: Does refactoring of test smells induce fixing flaky tests? In 2017 IEEE international conference on software maintenance and evolution (ICSME) (2017), IEEE, pp. 1–12

work page 2017

[63] [63]

Continuous test suite failure prediction

PAN, C.,ANDPRADEL, M. Continuous test suite failure prediction. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021), pp. 553–565

work page 2021

[64] [64]

E., SIY, H

PERRY, D. E., SIY, H. P.,ANDVOTTA, L. G. Parallel changes in large-scale software development: an observational case study.ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 3 (2001), 308–337

work page 2001

[65] [65]

K., WANG, S., KAMEI, Y.,ANDHASSAN, A

RAJBAHADUR, G. K., WANG, S., KAMEI, Y.,ANDHASSAN, A. E. The impact of using regression models to build defect classifiers. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 135–145

work page 2017

[66] [66]

An empirical analysis of build failures in the continuous integration workflows of java-based open-source software

RAUSCH, T., HUMMER, W., LEITNER, P.,ANDSCHULTE, S. An empirical analysis of build failures in the continuous integration workflows of java-based open-source software. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (2017), IEEE, pp. 345–355

work page 2017

[67] [67]

SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. On the prediction of continuous integration build failures using search-based software engineering. InProceedings of the 2020 Genetic and Evolutionary Computation Conference Companion (2020), pp. 313–314

work page 2020

[68] [68]

SAIDANI, I., OUNI, A., CHOUCHEN, M.,ANDMKAOUER, M. W. Predicting continuous integration build failures using evolutionary search. Information and Software Technology 128 (2020), 106392

work page 2020

[69] [69]

Learning ci configuration correctness for early build feedback

SANTOLUCITO, M., ZHANG, J., ZHAI, E., CITO, J.,ANDPISKAC, R. Learning ci configuration correctness for early build feedback. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022), IEEE, pp. 1006–1017

work page 2022

[70] [70]

Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects

SANTOS, J., ALENCAR DACOSTA, D.,ANDKULESZA, U. Investigating the impact of continu- ous integration practices on the productivity and quality of open-source projects. In Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2022), pp. 137–147

work page 2022

[71] [71]

A., MCINTOSH, S.,ANDKULESZA, U

SANTOS, J.,DACOSTA, D. A., MCINTOSH, S.,ANDKULESZA, U. On the need to monitor contin- uous integration practices. Empirical Software Engineering 30, 5 (2025), 125

work page 2025

[72] [72]

A.,ANDZHU, L

SHAHIN, M., BABAR, M. A.,ANDZHU, L. Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices. IEEE Access 5 (2017), 3909–3943

work page 2017

[73] [73]

Understanding and improving regression test selection in con- tinuous integration

SHI, A., ZHAO, P.,ANDMARINOV, D. Understanding and improving regression test selection in con- tinuous integration. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE) (2019), IEEE, pp. 228–238

work page 2019

[74] [74]

Shake it! detecting flaky tests caused by concur- rency with shaker

SILVA, D., TEIXEIRA, L.,AND D’AMORIM, M. Shake it! detecting flaky tests caused by concur- rency with shaker. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), IEEE, pp. 301–311

work page 2020

[75] [75]

A.,ANDKULESZA, U

SOARES, E., SIZILIO, G., SANTOS, J.,DACOSTA, D. A.,ANDKULESZA, U. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering 27, 3 (2022), 78

work page 2022

[76] [76]

E.,ANDMATSUMOTO, K

TANTITHAMTHAVORN, C., HASSAN, A. E.,ANDMATSUMOTO, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Transactions on Software Engineering 46, 11 (2018), 1200–1219

work page 2018

[77] [77]

Verifying the selected completely at random assumption in positive-unlabeled learning

TEISSEYRE, P., FURMA ´NCZYK, K.,ANDMIELNICZUK, J. Verifying the selected completely at random assumption in positive-unlabeled learning. arXiv preprint arXiv:2404.00145 (2024)

work page arXiv 2024

[78] [78]

A container-based infrastructure for fuzzy- driven root causing of flaky tests

TERRAGNI, V., SALZA, P.,ANDFERRUCCI, F. A container-based infrastructure for fuzzy- driven root causing of flaky tests. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) (2020), IEEE, pp. 69–72

work page 2020

[79] [79]

An empirical study of flaky tests in android apps

THORVE, S., SRESHTHA, C.,ANDMENG, N. An empirical study of flaky tests in android apps. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2018), IEEE, pp. 534–538

work page 2018

[80] [80]

B.,ANDDISTEFANO, J

TURHAN, B., MENZIES, T., BENER, A. B.,ANDDISTEFANO, J. On the relative value of cross- company and within-company data for defect prediction. Empirical Software Engineering 14, 5 (2009), 540–578

work page 2009