Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

Ali Tizghadam; Francis Bordeleau; Henri A\"idasso

arxiv: 2601.22264 · v2 · submitted 2026-01-29 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

Henri A\"idasso , Francis Bordeleau , Ali Tizghadam This is my paper

Pith reviewed 2026-05-16 09:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG

keywords few-shot learninglanguage modelsintermittent failuresCI pipelinesfailure diagnosislog analysisinterpretability

0 comments

The pith

FlaXifyer uses few-shot fine-tuned language models to predict categories of intermittent job failures from execution logs alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In continuous integration pipelines, intermittent job failures waste resources and developer time. The paper introduces FlaXifyer to predict their categories using only logs and few labeled examples. It fine-tunes language models with 12 examples per category to reach 84.3% Macro F1 and 92% Top-2 accuracy on over 2400 failures. An accompanying LogSift method highlights influential log lines quickly, cutting review effort by 74%. This setup supports automated triage and diagnosis of flaky jobs.

Core claim

FlaXifyer is introduced as a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of job

What carries the argument

Few-shot fine-tuning of pre-trained language models on job execution logs, paired with LogSift for identifying influential log statements.

Load-bearing premise

The 12 labeled examples per category sufficiently represent all variations of intermittent failures and the fine-tuned model generalizes beyond the TELUS dataset without overfitting.

What would settle it

Applying the model to job logs from a different organization with a different failure distribution and measuring a Macro F1 score below 60%.

Figures

Figures reproduced from arXiv: 2601.22264 by Ali Tizghadam, Francis Bordeleau, Henri A\"idasso.

**Figure 1.** Figure 1: Predicting intermittent job failure categories from job logs using a language model and classification head. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: LogSift reduces a 359-line log to 2 lines, directly identifying the missing environment variable (IMAGE_NAME). api_gateway_deployment_error 155 → 8 lines 1 upload_zipfile_to_cloud_portal 2 test791.zip 3 curl: (26) Failed to open/read local data from file/ application 4 zip file validate API is failed 5 200 6 Publish message to PubSub success 7 {"messageIds":["1403319032815554"]}200 8 section_end:1705665980… view at source ↗

**Figure 4.** Figure 4: LogSift output for a misclassified job (true: external_file_invalid_format). Despite the incorrect prediction, the highlighted segment reveals the actual cause: a zip file read failure (curl: (26)) during upload, enabling developers to override the prediction. and 5 illustrate representative LogSift outputs, with each header showing the predicted failure category and reduction ratio. LogSift effectively i… view at source ↗

**Figure 5.** Figure 5: Example of LogSift producing a non-relevant segment. Despite a high reduction, the identified statements contain no failure indicators or host resolution information. identify the underlying issue and override the prediction. However, LogSift does not always succeed, yielding 6% outputs rated as Not relevant [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlaXifyer shows usable few-shot performance on categorizing TELUS CI failures but the single-company dataset makes generalization the main open question.

read the letter

The main point is that this work moves from detecting flaky CI jobs to actually categorizing the failure types using few-shot fine-tuned language models on execution logs. They get 84.3% macro F1 and 92% top-2 accuracy with only 12 examples per category on 2,458 TELUS cases, and add LogSift to surface key log lines fast, cutting review effort by 74% while keeping useful info in 87% of cases. That combination is new relative to prior detection-only papers and targets a real pain point in continuous integration.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FlaXifyer, a few-shot fine-tuned language model approach for predicting categories of intermittent (flaky) job failures in CI pipelines using only execution logs. It reports 84.3% Macro F1 and 92.0% Top-2 accuracy on 2,458 failures from the TELUS dataset with 12 labeled examples per category, and proposes LogSift for identifying influential log statements to aid diagnosis.

Significance. If the results hold under broader conditions, the work has practical significance for software engineering by enabling automated triage of CI failures, reducing wasted compute and developer diagnosis time. The few-shot setting and interpretability technique are strengths for real-world applicability where labels are scarce.

major comments (2)

[Evaluation] Evaluation section: All 2,458 failures are drawn from a single TELUS CI environment with no reported cross-company, cross-infrastructure, or domain-shift experiments. This directly undermines the central claim that the method 'requires only job execution logs' and generalizes for automated triage, as it leaves open overfitting to TELUS-specific log formats and failure distributions.
[Abstract] Abstract and results: No baselines, error analysis, or details on how the failure categories were defined and labeled are provided, making the reported 84.3% Macro F1 and 92.0% Top-2 accuracy difficult to interpret or compare.

minor comments (1)

[Abstract] Abstract could briefly note the dataset source and size to improve context for readers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to improve clarity, add missing details, and acknowledge limitations where appropriate.

read point-by-point responses

Referee: [Evaluation] Evaluation section: All 2,458 failures are drawn from a single TELUS CI environment with no reported cross-company, cross-infrastructure, or domain-shift experiments. This directly undermines the central claim that the method 'requires only job execution logs' and generalizes for automated triage, as it leaves open overfitting to TELUS-specific log formats and failure distributions.

Authors: We agree that the single-company nature of the TELUS dataset is a limitation for broad generalizability claims. The statement that the method 'requires only job execution logs' refers specifically to the model's input requirements (no additional features or metadata), not to proven performance across all environments. In the revised manuscript, we will expand the Threats to Validity section with a detailed discussion of potential overfitting to TELUS-specific log formats, failure distributions, and infrastructure characteristics. We will also add qualitative examples illustrating log statement variations. However, cross-company or cross-infrastructure experiments cannot be conducted without access to additional proprietary datasets, which is a common constraint in industrial CI research. revision: partial
Referee: [Abstract] Abstract and results: No baselines, error analysis, or details on how the failure categories were defined and labeled are provided, making the reported 84.3% Macro F1 and 92.0% Top-2 accuracy difficult to interpret or compare.

Authors: We will revise the abstract to reference the added baselines and labeling details. In the Evaluation section, we will insert a new subsection 'Failure Category Definition and Labeling' that explains how the categories were collaboratively defined with TELUS reliability engineers based on observed failure modes in their CI pipelines, along with the labeling protocol and any inter-rater reliability measures. We will also add baseline comparisons (zero-shot prompting of the same language models and traditional supervised classifiers using TF-IDF features) and a dedicated error analysis subsection that breaks down misclassification patterns by category. These additions will make the performance numbers more interpretable and facilitate direct comparisons. revision: yes

standing simulated objections not resolved

Cross-company, cross-infrastructure, or domain-shift experiments, as we do not have access to additional industrial CI datasets from other organizations.

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on held-out data

full rationale

The paper introduces FlaXifyer as a few-shot fine-tuned language model for classifying intermittent CI job failures and reports performance metrics (84.3% Macro F1, 92.0% Top-2 accuracy) on a held-out portion of the 2,458 TELUS failures. No equations, derivations, or parameter-fitting steps are present that would allow any claimed result to reduce to its own inputs by construction. The approach relies on standard pre-trained models, few-shot prompting, and conventional train/test splits rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation. The central claims therefore remain independent empirical statements rather than tautological restatements of the input data or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that pre-trained language models transfer effectively to log classification with minimal examples and that the proposed interpretability method surfaces causally relevant statements; no explicit free parameters beyond the reported shot count of 12 are described.

free parameters (1)

number of shots
Fixed at 12 labeled examples per category; performance numbers are reported for this specific choice.

axioms (1)

domain assumption Pre-trained language models can be fine-tuned effectively for text classification with very few labeled examples
Invoked by the few-shot learning design of FlaXifyer

invented entities (2)

FlaXifyer no independent evidence
purpose: Few-shot fine-tuned model for predicting intermittent failure categories
New named system introduced in the paper
LogSift no independent evidence
purpose: Interpretability technique that identifies influential log statements
New named technique introduced in the paper

pith-pipeline@v0.9.0 · 5551 in / 1627 out tokens · 86513 ms · 2026-05-16T09:14:54.981897+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. doi:10.1145/3292500.3330701

work page doi:10.1145/3292500.3330701 2019
[2]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

work page
[3]

In 2023 IEEE/ACM International Conference on Automation of Software Test (AST)

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. In 2023 IEEE/ACM International Conference on Automation of Software Test (AST). 140–151. doi:10.1109/AST58925.2023.00018

work page doi:10.1109/ast58925.2023.00018 2023
[4]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1572–

work page 2021
[5]

doi:10.1109/ICSE43902.2021.00140

work page doi:10.1109/icse43902.2021.00140 2021
[6]

Henri Aïdasso. 2025. FlakeRanker: Automated Identification and Prioritization of Flaky Job Failure Categories. doi:10.48550/arXiv.2503.12312 arXiv:2503.12312 [cs]

work page doi:10.48550/arxiv.2503.12312 2025
[7]

Henri Aïdasso. 2026. Artifact for Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models. https://figshare. com/s/003070f1478ba8e87869?file=61272721

work page 2026
[8]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Efficient Detec- tion of Intermittent Job Failures Using Few-Shot Learning. In2025 IEEE In- ternational Conference on Software Maintenance and Evolution (ICSME). Insti- tute of Electrical and Electronics Engineers, Auckland, New Zealand, 632–643. doi:10.1109/ICSME64153.2025.00064 ISSN: 2576-3148

work page doi:10.1109/icsme64153.2025.00064 2025
[9]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). Institute of Electrical and Electronics Engineers, Ottawa, Canada, 192–202. doi:10.1109/...

work page doi:10.1109/icse-seip66354.2025.00023 2025
[10]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Illusion of Success: An Empirical Study of Build Reruns and Silent Failures in Industrial CI. doi:10.48550/arXiv.2509.14347 arXiv:2509.14347 [cs]

work page doi:10.48550/arxiv.2509.14347 2025
[11]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Towards Build Opti- mization Using Digital Twins. InProceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, Trondheim Norway, 95–98. doi:10.1145/3727582.3728684

work page doi:10.1145/3727582.3728684 2025
[12]

Henri Aïdasso, Mohammed Sayagh, and Francis Bordeleau. 2025. Build Opti- mization: A Systematic Literature Review.ACM Comput. Surv.58, 1 (Aug. 2025), 1–38. doi:10.1145/3757912 Just Accepted

work page doi:10.1145/3757912 2025
[13]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 433–444. doi:10. 1145/3180155.3180164

work page arXiv 2018
[14]

Derya Birant. 2011. Data Mining Using RFM Analysis. InKnowledge-Oriented Applications in Data Mining, Kimito Funatsu (Ed.). InTech. doi:10.5772/13683

work page doi:10.5772/13683 2011
[15]

Polyglot and Distributed Software Repository Mining with Crossflow

Carolin E. Brandt, Annibale Panichella, Andy Zaidman, and Moritz Beller. 2020. LogChunks: A Data Set for Build Log Analysis. InProceedings of the 17th Interna- tional Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 583–587. doi:10.1145/3379597.3387485

work page doi:10.1145/3379597.3387485 2020
[16]

Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu. 2020. Empirical Study of Restarted and Flaky Builds on Travis CI. InProceedings of the 17th International Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 254–264. doi:10.1145/3379597.3387460

work page doi:10.1145/3379597.3387460 2020
[17]

Ghaleb, and Lionel Briand

Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.IEEE Transactions on Software Engineering49, 4 (April 2023), 1912–1927. doi:10.1109/TSE.2022.3201209

work page doi:10.1109/tse.2022.3201209 2023
[18]

Sakina Fatima, Hadi Hemmati, and Lionel C. Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering50, 12 (Dec. 2024), 3146–3171. doi:10.1109/TSE.2024.3472476 Conference Name: IEEE Transactions on Software Engineering

work page doi:10.1109/tse.2024.3472476 2024
[19]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. doi:10.48550/ arXiv.2002.08155 arXiv:2002.08155 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexibility. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197–207. doi:10.1145/3106237.3106270

work page doi:10.1145/3106237.3106270 2017
[21]

Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig

work page
[22]

InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16)

Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16). Association for Computing Machinery, New York, NY, USA, 426–437. doi:10.1145/2970276.2970358

work page doi:10.1145/2970276.2970358
[23]

2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.)

Jez Humble and David Farley. 2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.). Addison-Wesley Professional

work page 2010
[24]

Donald E. Knuth. 1998.The Art of Computer Programming: Sorting and Searching, Volume 3. Addison-Wesley Professional. Google-Books-ID: cYULBAAAQBAJ

work page 1998
[25]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese Neural Networks for One-shot Image Recognition. InProceedings of the 32nd Interna- tional Conference on Machine Learning. Lille, France. https://www.cs.cmu.edu/ ~rsalakhu/papers/oneshot1.pdf

work page 2015
[26]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 312–322. doi:10.1109/ICST.2019.00038

work page doi:10.1109/icst.2019.00038 2019
[27]

Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller. 2021. When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens Greece, 1381–1392. doi:10.1145/3468264.3473931

work page doi:10.1145/3468264.3473931 2021
[28]

B. W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure405, 2 (Oct. 1975), 442–451. doi:10.1016/0005-2795(75)90109-9

work page doi:10.1016/0005-2795(75)90109-9 1975
[29]

Florent Moriconi, Raphael Troncy, Aurélien Francillon, and Jihane Zouaoui. 2022. Automated Identification of Flaky Builds using Knowledge Graphs. InProceedings of the 23rd International Conference on Knowledge Engineering and Knowledge Management. Bozen-Bolzano, Italy

work page 2022
[30]

Doriane Olewicki, Mathieu Nayrolles, and Bram Adams. 2022. Towards language- independent brown build detection. InProceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh Pennsylvania, 2177–2188. doi:10.1145/3510003.3510122

work page doi:10.1145/3510003.3510122 2022
[31]

Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, and Akash Srivastava. 2024. Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs. doi:10.48550/arXiv.2412.13337 arXiv:2412.13337 [cs]

work page doi:10.48550/arxiv.2412.13337 2024
[32]

Kapfhammer, Michael Hilton, and Phil McMinn

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2020. Flake It ’Till You Make It: Using Automated Repair to Induce and Fix Latent Test Flakiness. InProceedings of the IEEE/ACM 42nd International Conference on Soft- ware Engineering Workshops (ICSEW’20). Association for Computing Machinery, New York, NY, USA, 11–12. doi:10.1145/3387940.3392177

work page doi:10.1145/3387940.3392177 2020
[33]

Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3597503.3639115

work page doi:10.1145/3597503.3639115 2024
[34]

J. E. Ramos. 2003. Using TF-IDF to Determine Word Relevance in Document Queries. https://www.semanticscholar.org/paper/ Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/ b3bf6373ff41a115197cb5b30e57830c16130c2c

work page 2003
[35]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. doi:10.1145/2939672.2939778

work page doi:10.1145/2939672.2939778 2016
[36]

Guogen Shan. 2022. Monte Carlo cross-validation for a study with binary outcome and limited sample size.BMC Medical Informatics and Decision Making22, 1 (Oct. 2022), 270. doi:10.1186/s12911-022-02016-z

work page doi:10.1186/s12911-022-02016-z 2022
[37]

Richard Simon. 2007. Resampling Strategies for Model Assessment and Selection. InFundamentals of Data Mining in Genomics and Proteomics, Werner Dubitzky, Martin Granzow, and Daniel Berrar (Eds.). Springer US, Boston, MA, 173–186. doi:10.1007/978-0-387-47509-7_8

work page doi:10.1007/978-0-387-47509-7_8 2007
[38]

Digital Report TELUS. 2021. TELUS: Keeping Canadians Connected.Digital Report (2021). https://www.juniper.net/content/dam/www/assets/articles/us/en/telus- keeping-canadians-connected.pdf

work page 2021
[39]

Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, and Oren Pereg. 2022. Efficient Few-Shot Learning Without Prompts. http://arxiv.org/abs/2209.11055 arXiv:2209.11055

work page arXiv 2022
[40]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. doi:10.48550/arXiv.2309.07597 arXiv:2309.07597 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.07597 2024

[1] [1]

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. doi:10.1145/3292500.3330701

work page doi:10.1145/3292500.3330701 2019

[2] [2]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

work page

[3] [3]

In 2023 IEEE/ACM International Conference on Automation of Software Test (AST)

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. In 2023 IEEE/ACM International Conference on Automation of Software Test (AST). 140–151. doi:10.1109/AST58925.2023.00018

work page doi:10.1109/ast58925.2023.00018 2023

[4] [4]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1572–

work page 2021

[5] [5]

doi:10.1109/ICSE43902.2021.00140

work page doi:10.1109/icse43902.2021.00140 2021

[6] [6]

Henri Aïdasso. 2025. FlakeRanker: Automated Identification and Prioritization of Flaky Job Failure Categories. doi:10.48550/arXiv.2503.12312 arXiv:2503.12312 [cs]

work page doi:10.48550/arxiv.2503.12312 2025

[7] [7]

Henri Aïdasso. 2026. Artifact for Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models. https://figshare. com/s/003070f1478ba8e87869?file=61272721

work page 2026

[8] [8]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Efficient Detec- tion of Intermittent Job Failures Using Few-Shot Learning. In2025 IEEE In- ternational Conference on Software Maintenance and Evolution (ICSME). Insti- tute of Electrical and Electronics Engineers, Auckland, New Zealand, 632–643. doi:10.1109/ICSME64153.2025.00064 ISSN: 2576-3148

work page doi:10.1109/icsme64153.2025.00064 2025

[9] [9]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). Institute of Electrical and Electronics Engineers, Ottawa, Canada, 192–202. doi:10.1109/...

work page doi:10.1109/icse-seip66354.2025.00023 2025

[10] [10]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Illusion of Success: An Empirical Study of Build Reruns and Silent Failures in Industrial CI. doi:10.48550/arXiv.2509.14347 arXiv:2509.14347 [cs]

work page doi:10.48550/arxiv.2509.14347 2025

[11] [11]

Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Towards Build Opti- mization Using Digital Twins. InProceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, Trondheim Norway, 95–98. doi:10.1145/3727582.3728684

work page doi:10.1145/3727582.3728684 2025

[12] [12]

Henri Aïdasso, Mohammed Sayagh, and Francis Bordeleau. 2025. Build Opti- mization: A Systematic Literature Review.ACM Comput. Surv.58, 1 (Aug. 2025), 1–38. doi:10.1145/3757912 Just Accepted

work page doi:10.1145/3757912 2025

[13] [13]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 433–444. doi:10. 1145/3180155.3180164

work page arXiv 2018

[14] [14]

Derya Birant. 2011. Data Mining Using RFM Analysis. InKnowledge-Oriented Applications in Data Mining, Kimito Funatsu (Ed.). InTech. doi:10.5772/13683

work page doi:10.5772/13683 2011

[15] [15]

Polyglot and Distributed Software Repository Mining with Crossflow

Carolin E. Brandt, Annibale Panichella, Andy Zaidman, and Moritz Beller. 2020. LogChunks: A Data Set for Build Log Analysis. InProceedings of the 17th Interna- tional Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 583–587. doi:10.1145/3379597.3387485

work page doi:10.1145/3379597.3387485 2020

[16] [16]

Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu. 2020. Empirical Study of Restarted and Flaky Builds on Travis CI. InProceedings of the 17th International Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 254–264. doi:10.1145/3379597.3387460

work page doi:10.1145/3379597.3387460 2020

[17] [17]

Ghaleb, and Lionel Briand

Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.IEEE Transactions on Software Engineering49, 4 (April 2023), 1912–1927. doi:10.1109/TSE.2022.3201209

work page doi:10.1109/tse.2022.3201209 2023

[18] [18]

Sakina Fatima, Hadi Hemmati, and Lionel C. Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering50, 12 (Dec. 2024), 3146–3171. doi:10.1109/TSE.2024.3472476 Conference Name: IEEE Transactions on Software Engineering

work page doi:10.1109/tse.2024.3472476 2024

[19] [19]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. doi:10.48550/ arXiv.2002.08155 arXiv:2002.08155 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexibility. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197–207. doi:10.1145/3106237.3106270

work page doi:10.1145/3106237.3106270 2017

[21] [21]

Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig

work page

[22] [22]

InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16)

Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16). Association for Computing Machinery, New York, NY, USA, 426–437. doi:10.1145/2970276.2970358

work page doi:10.1145/2970276.2970358

[23] [23]

2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.)

Jez Humble and David Farley. 2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.). Addison-Wesley Professional

work page 2010

[24] [24]

Donald E. Knuth. 1998.The Art of Computer Programming: Sorting and Searching, Volume 3. Addison-Wesley Professional. Google-Books-ID: cYULBAAAQBAJ

work page 1998

[25] [25]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese Neural Networks for One-shot Image Recognition. InProceedings of the 32nd Interna- tional Conference on Machine Learning. Lille, France. https://www.cs.cmu.edu/ ~rsalakhu/papers/oneshot1.pdf

work page 2015

[26] [26]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 312–322. doi:10.1109/ICST.2019.00038

work page doi:10.1109/icst.2019.00038 2019

[27] [27]

Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller. 2021. When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens Greece, 1381–1392. doi:10.1145/3468264.3473931

work page doi:10.1145/3468264.3473931 2021

[28] [28]

B. W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure405, 2 (Oct. 1975), 442–451. doi:10.1016/0005-2795(75)90109-9

work page doi:10.1016/0005-2795(75)90109-9 1975

[29] [29]

Florent Moriconi, Raphael Troncy, Aurélien Francillon, and Jihane Zouaoui. 2022. Automated Identification of Flaky Builds using Knowledge Graphs. InProceedings of the 23rd International Conference on Knowledge Engineering and Knowledge Management. Bozen-Bolzano, Italy

work page 2022

[30] [30]

Doriane Olewicki, Mathieu Nayrolles, and Bram Adams. 2022. Towards language- independent brown build detection. InProceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh Pennsylvania, 2177–2188. doi:10.1145/3510003.3510122

work page doi:10.1145/3510003.3510122 2022

[31] [31]

Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, and Akash Srivastava. 2024. Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs. doi:10.48550/arXiv.2412.13337 arXiv:2412.13337 [cs]

work page doi:10.48550/arxiv.2412.13337 2024

[32] [32]

Kapfhammer, Michael Hilton, and Phil McMinn

Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2020. Flake It ’Till You Make It: Using Automated Repair to Induce and Fix Latent Test Flakiness. InProceedings of the IEEE/ACM 42nd International Conference on Soft- ware Engineering Workshops (ICSEW’20). Association for Computing Machinery, New York, NY, USA, 11–12. doi:10.1145/3387940.3392177

work page doi:10.1145/3387940.3392177 2020

[33] [33]

Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3597503.3639115

work page doi:10.1145/3597503.3639115 2024

[34] [34]

J. E. Ramos. 2003. Using TF-IDF to Determine Word Relevance in Document Queries. https://www.semanticscholar.org/paper/ Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/ b3bf6373ff41a115197cb5b30e57830c16130c2c

work page 2003

[35] [35]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. doi:10.1145/2939672.2939778

work page doi:10.1145/2939672.2939778 2016

[36] [36]

Guogen Shan. 2022. Monte Carlo cross-validation for a study with binary outcome and limited sample size.BMC Medical Informatics and Decision Making22, 1 (Oct. 2022), 270. doi:10.1186/s12911-022-02016-z

work page doi:10.1186/s12911-022-02016-z 2022

[37] [37]

Richard Simon. 2007. Resampling Strategies for Model Assessment and Selection. InFundamentals of Data Mining in Genomics and Proteomics, Werner Dubitzky, Martin Granzow, and Daniel Berrar (Eds.). Springer US, Boston, MA, 173–186. doi:10.1007/978-0-387-47509-7_8

work page doi:10.1007/978-0-387-47509-7_8 2007

[38] [38]

Digital Report TELUS. 2021. TELUS: Keeping Canadians Connected.Digital Report (2021). https://www.juniper.net/content/dam/www/assets/articles/us/en/telus- keeping-canadians-connected.pdf

work page 2021

[39] [39]

Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, and Oren Pereg. 2022. Efficient Few-Shot Learning Without Prompts. http://arxiv.org/abs/2209.11055 arXiv:2209.11055

work page arXiv 2022

[40] [40]

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. doi:10.48550/arXiv.2309.07597 arXiv:2309.07597 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.07597 2024