pith. sign in

arxiv: 2601.22264 · v2 · submitted 2026-01-29 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

Pith reviewed 2026-05-16 09:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG
keywords few-shot learninglanguage modelsintermittent failuresCI pipelinesfailure diagnosislog analysisinterpretability
0
0 comments X

The pith

FlaXifyer uses few-shot fine-tuned language models to predict categories of intermittent job failures from execution logs alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In continuous integration pipelines, intermittent job failures waste resources and developer time. The paper introduces FlaXifyer to predict their categories using only logs and few labeled examples. It fine-tunes language models with 12 examples per category to reach 84.3% Macro F1 and 92% Top-2 accuracy on over 2400 failures. An accompanying LogSift method highlights influential log lines quickly, cutting review effort by 74%. This setup supports automated triage and diagnosis of flaky jobs.

Core claim

FlaXifyer is introduced as a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of job

What carries the argument

Few-shot fine-tuning of pre-trained language models on job execution logs, paired with LogSift for identifying influential log statements.

Load-bearing premise

The 12 labeled examples per category sufficiently represent all variations of intermittent failures and the fine-tuned model generalizes beyond the TELUS dataset without overfitting.

What would settle it

Applying the model to job logs from a different organization with a different failure distribution and measuring a Macro F1 score below 60%.

Figures

Figures reproduced from arXiv: 2601.22264 by Ali Tizghadam, Francis Bordeleau, Henri A\"idasso.

Figure 1
Figure 1. Figure 1: Predicting intermittent job failure categories from job logs using a language model and classification head. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: LogSift reduces a 359-line log to 2 lines, directly identifying the missing environment variable (IMAGE_NAME). api_gateway_deployment_error 155 → 8 lines 1 upload_zipfile_to_cloud_portal 2 test791.zip 3 curl: (26) Failed to open/read local data from file/ application 4 zip file validate API is failed 5 200 6 Publish message to PubSub success 7 {"messageIds":["1403319032815554"]}200 8 section_end:1705665980… view at source ↗
Figure 4
Figure 4. Figure 4: LogSift output for a misclassified job (true: external_file_invalid_format). Despite the incorrect pre￾diction, the highlighted segment reveals the actual cause: a zip file read failure (curl: (26)) during upload, enabling developers to override the prediction. and 5 illustrate representative LogSift outputs, with each header showing the predicted failure category and reduction ratio. LogSift effectively i… view at source ↗
Figure 5
Figure 5. Figure 5: Example of LogSift producing a non-relevant seg￾ment. Despite a high reduction, the identified statements contain no failure indicators or host resolution information. identify the underlying issue and override the prediction. However, LogSift does not always succeed, yielding 6% outputs rated as Not relevant [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FlaXifyer, a few-shot fine-tuned language model approach for predicting categories of intermittent (flaky) job failures in CI pipelines using only execution logs. It reports 84.3% Macro F1 and 92.0% Top-2 accuracy on 2,458 failures from the TELUS dataset with 12 labeled examples per category, and proposes LogSift for identifying influential log statements to aid diagnosis.

Significance. If the results hold under broader conditions, the work has practical significance for software engineering by enabling automated triage of CI failures, reducing wasted compute and developer diagnosis time. The few-shot setting and interpretability technique are strengths for real-world applicability where labels are scarce.

major comments (2)
  1. [Evaluation] Evaluation section: All 2,458 failures are drawn from a single TELUS CI environment with no reported cross-company, cross-infrastructure, or domain-shift experiments. This directly undermines the central claim that the method 'requires only job execution logs' and generalizes for automated triage, as it leaves open overfitting to TELUS-specific log formats and failure distributions.
  2. [Abstract] Abstract and results: No baselines, error analysis, or details on how the failure categories were defined and labeled are provided, making the reported 84.3% Macro F1 and 92.0% Top-2 accuracy difficult to interpret or compare.
minor comments (1)
  1. [Abstract] Abstract could briefly note the dataset source and size to improve context for readers.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to improve clarity, add missing details, and acknowledge limitations where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: All 2,458 failures are drawn from a single TELUS CI environment with no reported cross-company, cross-infrastructure, or domain-shift experiments. This directly undermines the central claim that the method 'requires only job execution logs' and generalizes for automated triage, as it leaves open overfitting to TELUS-specific log formats and failure distributions.

    Authors: We agree that the single-company nature of the TELUS dataset is a limitation for broad generalizability claims. The statement that the method 'requires only job execution logs' refers specifically to the model's input requirements (no additional features or metadata), not to proven performance across all environments. In the revised manuscript, we will expand the Threats to Validity section with a detailed discussion of potential overfitting to TELUS-specific log formats, failure distributions, and infrastructure characteristics. We will also add qualitative examples illustrating log statement variations. However, cross-company or cross-infrastructure experiments cannot be conducted without access to additional proprietary datasets, which is a common constraint in industrial CI research. revision: partial

  2. Referee: [Abstract] Abstract and results: No baselines, error analysis, or details on how the failure categories were defined and labeled are provided, making the reported 84.3% Macro F1 and 92.0% Top-2 accuracy difficult to interpret or compare.

    Authors: We will revise the abstract to reference the added baselines and labeling details. In the Evaluation section, we will insert a new subsection 'Failure Category Definition and Labeling' that explains how the categories were collaboratively defined with TELUS reliability engineers based on observed failure modes in their CI pipelines, along with the labeling protocol and any inter-rater reliability measures. We will also add baseline comparisons (zero-shot prompting of the same language models and traditional supervised classifiers using TF-IDF features) and a dedicated error analysis subsection that breaks down misclassification patterns by category. These additions will make the performance numbers more interpretable and facilitate direct comparisons. revision: yes

standing simulated objections not resolved
  • Cross-company, cross-infrastructure, or domain-shift experiments, as we do not have access to additional industrial CI datasets from other organizations.

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on held-out data

full rationale

The paper introduces FlaXifyer as a few-shot fine-tuned language model for classifying intermittent CI job failures and reports performance metrics (84.3% Macro F1, 92.0% Top-2 accuracy) on a held-out portion of the 2,458 TELUS failures. No equations, derivations, or parameter-fitting steps are present that would allow any claimed result to reduce to its own inputs by construction. The approach relies on standard pre-trained models, few-shot prompting, and conventional train/test splits rather than any self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation. The central claims therefore remain independent empirical statements rather than tautological restatements of the input data or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that pre-trained language models transfer effectively to log classification with minimal examples and that the proposed interpretability method surfaces causally relevant statements; no explicit free parameters beyond the reported shot count of 12 are described.

free parameters (1)
  • number of shots
    Fixed at 12 labeled examples per category; performance numbers are reported for this specific choice.
axioms (1)
  • domain assumption Pre-trained language models can be fine-tuned effectively for text classification with very few labeled examples
    Invoked by the few-shot learning design of FlaXifyer
invented entities (2)
  • FlaXifyer no independent evidence
    purpose: Few-shot fine-tuned model for predicting intermittent failure categories
    New named system introduced in the paper
  • LogSift no independent evidence
    purpose: Interpretability technique that identifies influential log statements
    New named technique introduced in the paper

pith-pipeline@v0.9.0 · 5551 in / 1627 out tokens · 86513 ms · 2026-05-16T09:14:54.981897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Frame- work. InProceedings of the 25th ACM SIGKDD International Conference on Knowl- edge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. doi:10.1145/3292500.3330701

  2. [2]

    Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

  3. [3]

    In 2023 IEEE/ACM International Conference on Automation of Software Test (AST)

    FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning. In 2023 IEEE/ACM International Conference on Automation of Software Test (AST). 140–151. doi:10.1109/AST58925.2023.00018

  4. [4]

    Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. FlakeFlagger: Predicting Flakiness Without Rerunning Tests. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1572–

  5. [5]

    doi:10.1109/ICSE43902.2021.00140

  6. [6]

    Henri Aïdasso. 2025. FlakeRanker: Automated Identification and Prioritization of Flaky Job Failure Categories. doi:10.48550/arXiv.2503.12312 arXiv:2503.12312 [cs]

  7. [7]

    Henri Aïdasso. 2026. Artifact for Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models. https://figshare. com/s/003070f1478ba8e87869?file=61272721

  8. [8]

    Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Efficient Detec- tion of Intermittent Job Failures Using Few-Shot Learning. In2025 IEEE In- ternational Conference on Software Maintenance and Evolution (ICSME). Insti- tute of Electrical and Electronics Engineers, Auckland, New Zealand, 632–643. doi:10.1109/ICSME64153.2025.00064 ISSN: 2576-3148

  9. [9]

    Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Diagnosis of Flaky Job Failures: Understanding and Prioritizing Failure Categories. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engi- neering in Practice (ICSE-SEIP). Institute of Electrical and Electronics Engineers, Ottawa, Canada, 192–202. doi:10.1109/...

  10. [10]

    Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. On the Illusion of Success: An Empirical Study of Build Reruns and Silent Failures in Industrial CI. doi:10.48550/arXiv.2509.14347 arXiv:2509.14347 [cs]

  11. [11]

    Henri Aïdasso, Francis Bordeleau, and Ali Tizghadam. 2025. Towards Build Opti- mization Using Digital Twins. InProceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, Trondheim Norway, 95–98. doi:10.1145/3727582.3728684

  12. [12]

    Henri Aïdasso, Mohammed Sayagh, and Francis Bordeleau. 2025. Build Opti- mization: A Systematic Literature Review.ACM Comput. Surv.58, 1 (Aug. 2025), 1–38. doi:10.1145/3757912 Just Accepted

  13. [13]

    Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: automatically detecting flaky tests. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 433–444. doi:10. 1145/3180155.3180164

  14. [14]

    Derya Birant. 2011. Data Mining Using RFM Analysis. InKnowledge-Oriented Applications in Data Mining, Kimito Funatsu (Ed.). InTech. doi:10.5772/13683

  15. [15]

    Polyglot and Distributed Software Repository Mining with Crossflow

    Carolin E. Brandt, Annibale Panichella, Andy Zaidman, and Moritz Beller. 2020. LogChunks: A Data Set for Build Log Analysis. InProceedings of the 17th Interna- tional Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 583–587. doi:10.1145/3379597.3387485

  16. [16]

    Thomas Durieux, Claire Le Goues, Michael Hilton, and Rui Abreu. 2020. Empirical Study of Restarted and Flaky Builds on Travis CI. InProceedings of the 17th International Conference on Mining Software Repositories. ACM, Seoul Republic of Korea, 254–264. doi:10.1145/3379597.3387460

  17. [17]

    Ghaleb, and Lionel Briand

    Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. 2023. Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests.IEEE Transactions on Software Engineering49, 4 (April 2023), 1912–1927. doi:10.1109/TSE.2022.3201209

  18. [18]

    Sakina Fatima, Hadi Hemmati, and Lionel C. Briand. 2024. FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair.IEEE Transactions on Software Engineering50, 12 (Dec. 2024), 3146–3171. doi:10.1109/TSE.2024.3472476 Conference Name: IEEE Transactions on Software Engineering

  19. [19]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. doi:10.48550/ arXiv.2002.08155 arXiv:2002.08155 [cs]

  20. [20]

    Michael Hilton, Nicholas Nelson, Timothy Tunnell, Darko Marinov, and Danny Dig. 2017. Trade-offs in continuous integration: assurance, security, and flexibility. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 197–207. doi:10.1145/3106237.3106270

  21. [21]

    Michael Hilton, Timothy Tunnell, Kai Huang, Darko Marinov, and Danny Dig

  22. [22]

    InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16)

    Usage, costs, and benefits of continuous integration in open-source projects. InProceedings of the 31st IEEE/ACM International Conference on Automated Soft- ware Engineering (ASE ’16). Association for Computing Machinery, New York, NY, USA, 426–437. doi:10.1145/2970276.2970358

  23. [23]

    2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.)

    Jez Humble and David Farley. 2010.Continuous Delivery: Reliable Software Re- leases through Build, Test, and Deployment Automation(1st ed.). Addison-Wesley Professional

  24. [24]

    Donald E. Knuth. 1998.The Art of Computer Programming: Sorting and Searching, Volume 3. Addison-Wesley Professional. Google-Books-ID: cYULBAAAQBAJ

  25. [25]

    Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese Neural Networks for One-shot Image Recognition. InProceedings of the 32nd Interna- tional Conference on Machine Learning. Lille, France. https://www.cs.cmu.edu/ ~rsalakhu/papers/oneshot1.pdf

  26. [26]

    Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests. In2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). 312–322. doi:10.1109/ICST.2019.00038

  27. [27]

    Johannes Lampel, Sascha Just, Sven Apel, and Andreas Zeller. 2021. When life gives you oranges: detecting and diagnosing intermittent job failures at Mozilla. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens Greece, 1381–1392. doi:10.1145/3468264.3473931

  28. [28]

    B. W. Matthews. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.Biochimica et Biophysica Acta (BBA) - Protein Structure405, 2 (Oct. 1975), 442–451. doi:10.1016/0005-2795(75)90109-9

  29. [29]

    Florent Moriconi, Raphael Troncy, Aurélien Francillon, and Jihane Zouaoui. 2022. Automated Identification of Flaky Builds using Knowledge Graphs. InProceedings of the 23rd International Conference on Knowledge Engineering and Knowledge Management. Bozen-Bolzano, Italy

  30. [30]

    Doriane Olewicki, Mathieu Nayrolles, and Bram Adams. 2022. Towards language- independent brown build detection. InProceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh Pennsylvania, 2177–2188. doi:10.1145/3510003.3510122

  31. [31]

    Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, Ligong Han, Luke Inglis, and Akash Srivastava. 2024. Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs. doi:10.48550/arXiv.2412.13337 arXiv:2412.13337 [cs]

  32. [32]

    Kapfhammer, Michael Hilton, and Phil McMinn

    Owain Parry, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2020. Flake It ’Till You Make It: Using Automated Repair to Induce and Fix Latent Test Flakiness. InProceedings of the IEEE/ACM 42nd International Conference on Soft- ware Engineering Workshops (ICSEW’20). Association for Computing Machinery, New York, NY, USA, 11–12. doi:10.1145/3387940.3392177

  33. [33]

    Shanto Rahman and August Shi. 2024. FlakeSync: Automatically Repairing Async Flaky Tests. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3597503.3639115

  34. [34]

    J. E. Ramos. 2003. Using TF-IDF to Determine Word Relevance in Document Queries. https://www.semanticscholar.org/paper/ Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/ b3bf6373ff41a115197cb5b30e57830c16130c2c

  35. [35]

    Why Should I Trust You?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. doi:10.1145/2939672.2939778

  36. [36]

    Guogen Shan. 2022. Monte Carlo cross-validation for a study with binary outcome and limited sample size.BMC Medical Informatics and Decision Making22, 1 (Oct. 2022), 270. doi:10.1186/s12911-022-02016-z

  37. [37]

    Richard Simon. 2007. Resampling Strategies for Model Assessment and Selection. InFundamentals of Data Mining in Genomics and Proteomics, Werner Dubitzky, Martin Granzow, and Daniel Berrar (Eds.). Springer US, Boston, MA, 173–186. doi:10.1007/978-0-387-47509-7_8

  38. [38]

    Digital Report TELUS. 2021. TELUS: Keeping Canadians Connected.Digital Report (2021). https://www.juniper.net/content/dam/www/assets/articles/us/en/telus- keeping-canadians-connected.pdf

  39. [39]

    Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, and Oren Pereg. 2022. Efficient Few-Shot Learning Without Prompts. http://arxiv.org/abs/2209.11055 arXiv:2209.11055

  40. [40]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embeddings. doi:10.48550/arXiv.2309.07597 arXiv:2309.07597 [cs]