arxiv: 2605.11482 · v1 · submitted 2026-05-12 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification

Khondaker Tasnia Hoque , Toukir Ahammed

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:04 UTC · model grok-4.3

classification 💻 cs.SE

keywords flaky testsneuro-symbolicLLMtest classificationsoftware testingattention mechanismDiscriminative Token Miningregression testing

0 comments

The pith

NeuroFlake mines statistically significant code tokens and injects them into LLM attention to classify flaky tests at 69.34 percent F1-score while resisting semantic perturbations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flaky tests produce inconsistent pass or fail results on identical code, which disrupts reliable regression testing at scale. Standard LLMs tend to overfit to superficial cues such as variable names rather than the underlying logic like concurrency primitives or async waits. NeuroFlake adds a Discriminative Token Mining module that automatically extracts high-fidelity tokens from source code and feeds them directly into the model's attention layers. This neuro-symbolic fusion raises F1-score from the prior best of 65.79 percent to 69.34 percent on the imbalanced FlakeBench dataset and limits accuracy loss under code-preserving augmentations to 4-7 points instead of 8-18 points. The improvement matters for automated testing pipelines that must handle real-world imbalance without brittle manual rules or fragile pattern matching.

Core claim

NeuroFlake integrates a Discriminative Token Mining module to automatically identify high-fidelity, statistically significant source code tokens such as concurrency primitives and async waits, then injects these signals directly into the LLM's attention mechanism to combine neural intuition with symbolic precision, delivering an F1-score of 69.34 percent on FlakeBench and stable performance under semantic-preserving augmentations where baselines degrade sharply.

What carries the argument

The Discriminative Token Mining (DTM) module, which automates discovery of high-fidelity source code tokens and injects them into the LLM attention mechanism to bridge neural pattern recognition with symbolic precision.

If this is right

Flaky test classification becomes more reliable on highly imbalanced, real-world test suites.
Models show greater stability when code is changed in ways that preserve semantics but alter surface features.
Automated regression testing pipelines can reduce wasted effort on misclassified non-deterministic tests.
Hybrid neuro-symbolic designs outperform both pure LLM and manual-rule baselines for this classification task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-mining and injection pattern could extend to related software engineering tasks such as defect prediction or security vulnerability detection.
Further gains may come from routing the mined tokens into additional model components beyond attention.
Widespread use could cut the developer time spent diagnosing and rerunning flaky tests in continuous integration systems.

Load-bearing premise

The tokens discovered by the mining module must be causally responsible for flakiness rather than merely correlated with it, so that direct injection into attention transfers usable symbolic precision without creating new overfitting routes.

What would settle it

Replace the Discriminative Token Mining module with random or non-informative tokens, retrain and evaluate on the same augmented test sets, and check whether the reported F1-score gain and robustness improvement both disappear.

Figures

Figures reproduced from arXiv: 2605.11482 by Khondaker Tasnia Hoque, Toukir Ahammed.

**Figure 2.** Figure 2: Top 10 predictor tokens for asynchronous wait flak [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of mined token groups across flaky [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix for FlakyLens. • RQ1: How effectively does NeuroFlake classify rare-class flaky tests compared to existing classifiers? • RQ2: What is the contribution of the Adaptive Symbolic Channel to the overall performance? • RQ3: How robust is NeuroFlake against unseen syntactic perturbations (e.g., deadcode, renaming)? 5.1 RQ1: NeuroFlake Effectiveness To quantify how effectively our proposed frame… view at source ↗

**Figure 6.** Figure 6: Adversarial Perturbation. In (b), we inject unreach [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Semantic Masking Example. In (b), we rename time [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Flaky tests, which exhibit non-deterministic pass/fail behavior for the same version of code, pose significant challenges to reliable regression testing. While large language models (LLMs) promise for automated flaky test classification, they often fail to comprehend the actual logic behind test flakiness, instead overfitting to superficial textual artifacts (e.g., specific variable names). This semantic fragility leads to poor generalization on real-world imbalance dataset and vulnerability to perturbations. In this paper, we introduce NeuroFlake, a novel neuro-Symbolic framework for classifying flaky tests on highly imbalanced, real-world datasets (FlakeBench). Unlike prior approaches that rely on brittle manual rule and black box learning, NeuroFlake integrates a Discriminative Token Mining (DTM) module to automate the discovery of high-fidelity, statistically significant source code tokens (e.g., specific concurrency primitives or async waits). By injecting these strong latent signals directly into LLM's attention mechanism, we bridge the gap between neural intuition and symbolic precision. Our experiments demonstrate that neuro-symbolic fusion significantly improves classification performance by leveraging classification F1-score to 69.34% while prior state-of-art shows best F1-score 65.79%. However, we rigorously evaluate NeuroFlake's robustness through adversarial stress testing, introducing semantic preserving augmentations (e.g., dead code injection, variable renaming). While baseline models exhibit performance degradation of 8-18 percentage points (pp) on perturbed tests, NeuroFlake maintains performance stability on unseen augmentations dropping only 4-7 pp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuroFlake adds a token-mining step and attention injection to LLMs for flaky test classification, claiming a modest F1 lift and better robustness, but the abstract supplies almost no experimental details to back it up.

read the letter

The paper's main move is to pair an LLM with a Discriminative Token Mining module that pulls out statistically notable code tokens (concurrency primitives, async waits) and feeds them straight into the attention layers. The goal is to steer the model toward actual flakiness signals instead of variable names or other surface noise on the imbalanced FlakeBench data. It reports an F1 of 69.34% against a prior best of 65.79% and a smaller performance drop (4-7 pp) under semantic-preserving changes like dead-code insertion or renaming, while baselines fall 8-18 pp. That combination of automated mining plus direct injection for this specific task is the concrete new piece; earlier neuro-symbolic work exists but not in this exact form for flaky tests. The practical framing around CI/CD pain points is also clear and useful. The soft spots sit in the missing methods. The abstract states the gains and the robustness test but gives no protocol, no baseline list, no split details, no statistical tests, and no description of how the mining actually selects or ranks tokens. Without those, it is impossible to tell whether the lift is real, whether the augmentations are balanced, or whether the tokens are causally tied to flakiness rather than just correlated. The assumption that attention injection transfers symbolic precision cleanly also needs evidence that it does not create fresh overfitting routes. This work is for software-engineering researchers and test-automation practitioners who already care about flaky-test detection and LLM applications in code. A reader looking for architecture ideas or a starting point on neuro-symbolic hybrids in SE could extract value from the high-level design. It deserves a serious referee because the problem is real, the idea is straightforward, and the reported numbers are at least falsifiable once the full experimental section appears. I would send it out for review with a clear request for the missing protocol, ablations, and reproducibility materials rather than desk-rejecting it.

Referee Report

3 major / 3 minor

Summary. The manuscript presents NeuroFlake, a neuro-symbolic framework for flaky test classification on imbalanced real-world datasets such as FlakeBench. It introduces a Discriminative Token Mining (DTM) module to automatically identify high-fidelity source code tokens (e.g., concurrency primitives) and injects them directly into the LLM attention mechanism to combine neural and symbolic signals. Experiments report an F1-score improvement to 69.34% (vs. prior SOTA 65.79%) and greater robustness under semantic-preserving augmentations (4-7 pp drop vs. 8-18 pp for baselines).

Significance. If the reported gains are reproducible and attributable to the DTM injection rather than confounding factors, the work offers a practical advance in software engineering for reliable regression testing. The combination of real-world imbalance handling, explicit robustness evaluation via augmentations, and neuro-symbolic bridging addresses documented weaknesses of pure LLM approaches on code tasks. Reproducible code or detailed ablation tables would further strengthen its contribution.

major comments (3)

[Methods/Experimental Setup] Methods/Experimental Setup: The abstract and results claim a 3.55 pp F1 lift and superior robustness, yet the manuscript supplies no information on train/validation/test splits, exact LLM backbone and fine-tuning protocol, baseline re-implementations, or statistical significance testing (e.g., McNemar or bootstrap intervals). These details are load-bearing for evaluating whether the improvement is genuine or due to post-hoc selection.
[Section 3.2] DTM Module (Section 3.2): The claim that mined tokens are 'high-fidelity' and 'statistically significant' requires the precise mining procedure (feature selection metric, significance threshold, handling of class imbalance and multiple comparisons). Without an ablation that isolates the injection step (DTM tokens removed vs. present), it remains unclear whether the tokens carry causal flakiness information or merely correlated surface features.
[Section 4.3] Robustness Evaluation (Section 4.3): The reported 4-7 pp drop under 'unseen augmentations' (dead-code injection, variable renaming) is central to the generalization claim, but the paper must specify the number and distribution of augmentations, confirm label preservation, and provide per-augmentation breakdown plus examples. Otherwise the stability advantage over baselines cannot be verified.

minor comments (3)

[Abstract] Abstract: The phrasing 'leveraging classification F1-score to 69.34%' is nonstandard; rephrase to 'achieving an F1-score of 69.34%'.
[Section 3] Notation: Ensure DTM, LLM, and FlakeBench are defined at first use and that the attention-injection mechanism is given a concise formal description (e.g., modified attention weights equation).
[References] References: Include the exact citation for the prior SOTA method reporting 65.79% F1 so readers can compare experimental conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested details for improved reproducibility and clarity.

read point-by-point responses

Referee: [Methods/Experimental Setup] Methods/Experimental Setup: The abstract and results claim a 3.55 pp F1 lift and superior robustness, yet the manuscript supplies no information on train/validation/test splits, exact LLM backbone and fine-tuning protocol, baseline re-implementations, or statistical significance testing (e.g., McNemar or bootstrap intervals). These details are load-bearing for evaluating whether the improvement is genuine or due to post-hoc selection.

Authors: We agree these details are critical for reproducibility and validating the reported gains. In the revised manuscript, we will add a dedicated Experimental Setup subsection specifying: the train/validation/test splits (stratified 70/15/15 to address imbalance), the exact LLM backbone and fine-tuning protocol (including hyperparameters, learning rate, epochs, and optimizer), baseline re-implementations with references to original implementations, and statistical significance results using McNemar's test and bootstrap confidence intervals to confirm the 3.55 pp F1 improvement is not attributable to post-hoc selection. revision: yes
Referee: [Section 3.2] DTM Module (Section 3.2): The claim that mined tokens are 'high-fidelity' and 'statistically significant' requires the precise mining procedure (feature selection metric, significance threshold, handling of class imbalance and multiple comparisons). Without an ablation that isolates the injection step (DTM tokens removed vs. present), it remains unclear whether the tokens carry causal flakiness information or merely correlated surface features.

Authors: We acknowledge the need for greater precision on the DTM procedure. We will expand Section 3.2 to detail the mining process, including the feature selection metric (chi-squared with class-weighted frequencies), significance threshold (p < 0.05 with Bonferroni correction), and imbalance handling (oversampling minority class during token selection). We will also add an ablation experiment comparing the full model to a no-DTM variant (standard LLM without token injection) to isolate the contribution and demonstrate that the tokens provide more than surface correlations. revision: yes
Referee: [Section 4.3] Robustness Evaluation (Section 4.3): The reported 4-7 pp drop under 'unseen augmentations' (dead-code injection, variable renaming) is central to the generalization claim, but the paper must specify the number and distribution of augmentations, confirm label preservation, and provide per-augmentation breakdown plus examples. Otherwise the stability advantage over baselines cannot be verified.

Authors: We agree additional specifics are required to substantiate the robustness results. In the revision of Section 4.3, we will specify the number of augmentations (e.g., 500 samples per type across the test set), their distribution, explicit confirmation of label preservation (as all transformations are semantic-preserving and were manually validated on a subset), a per-augmentation performance table for NeuroFlake versus baselines, and illustrative examples of original and augmented tests placed in an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents NeuroFlake as an empirical neuro-symbolic framework whose central claims (F1-score lift from 65.79% to 69.34% and robustness under augmentations) are reported as direct experimental outcomes measured on the external FlakeBench dataset. No derivation chain, equations, or first-principles results are supplied that reduce by construction to fitted parameters, self-definitions, or self-citations. The DTM module and attention-injection mechanism are described as architectural choices whose value is validated externally rather than assumed or renamed internally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, background axioms, or newly postulated entities. The central claim rests on the unstated premise that mined tokens carry causal flakiness signals and that attention injection is an effective neuro-symbolic bridge; these are treated as domain assumptions rather than derived results.

pith-pipeline@v0.9.0 · 5585 in / 1200 out tokens · 56011 ms · 2026-05-13T02:04:27.144171+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Discriminative Token Mining (DTM) ... Chi-Square (χ²) Test of Independence ... Cross-Project Consistency Filter
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

neuro-symbolic fusion ... injecting these strong latent signals directly into LLM's attention mechanism

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon

work page
[2]

In2023 IEEE/ACM International Conference on Automation of Software Test (AST)

FlakyCat: Predicting flaky tests categories using few-shot learning. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 140–151

work page
[3]

Nauman Bin Ali, Emelie Engström, Masoumeh Taromirad, Mohammad Reza Mousavi, Nasir Mehmood Minhas, Daniel Helgesson, Sebastian Kunze, and Mahsa Varshosaz. 2019. On the search for industry-relevant regression testing research. Empirical Software Engineering24, 4 (2019), 2020–2055

work page 2019
[4]

Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. Flakeflagger: Predicting flakiness without rerunning tests. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1572–1584

work page 2021
[5]

Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In Proceedings of the 40th international conference on software engineering. 433–444

work page 2018
[6]

Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2019. Autofocus: interpreting attention-based neural networks by code perturbation. In2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 38–41

work page 2019
[7]

Junkai Chen, Li Zhenhao, Hu Xing, and Xia Xin. 2024. Nlperturbator: Studying the robustness of code llms to natural language variations.ACM Transactions on Software Engineering and Methodology(2024)

work page 2024
[8]

Yang Chen and Reyhaneh Jabbarvand. 2024. Neurosymbolic repair of test flaki- ness. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1402–1414

work page 2024
[9]

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class- balanced loss based on effective number of samples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9268–9277

work page 2019
[10]

Hercules Dalianis. 2018. Evaluation metrics and evaluation. InClinical Text Mining: secondary use of electronic patient records. Springer, 45–53

work page 2018
[11]

2025.Detection, Categorization and Repair of Flaky Tests Using Large Language Models

Sakina Fatima. 2025.Detection, Categorization and Repair of Flaky Tests Using Large Language Models. Ph. D. Dissertation. Université d’Ottawa/University of Ottawa

work page 2025
[12]

Sakina Fatima, Taher A Ghaleb, and Lionel Briand. 2022. Flakify: A black-box, language model-based predictor for flaky tests.IEEE Transactions on Software Engineering49, 4 (2022), 1912–1927

work page 2022
[13]

Z Feng. 2020. Codebert: A pre-trained model for program-ming and natural languages.arXiv preprint arXiv:2002.08155(2020)

work page internal anchor Pith review arXiv 2020
[14]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

work page internal anchor Pith review arXiv 2020
[15]

Negar Hashemi, Amjed Tahir, Shawn Rasheed, August Shi, and Rachel Blagojevic

work page
[16]

In 2025 IEEE Conference on Software Testing, Verification and Validation (ICST)

Detecting and evaluating order-dependent flaky tests in javascript. In 2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 13–24

work page 2025
[17]

Pascal Hitzler, Aaron Eberhart, Monireh Ebrahimi, Md Kamruzzaman Sarker, and Lu Zhou. 2022. Neuro-symbolic approaches in artificial intelligence.National Science Review9, 6 (2022), nwac035

work page 2022
[18]

Imen Jaoua, Oussama Ben Sghaier, and Houari Sahraoui. 2025. Combining Large Language Models with Static Analyzers for Code Review Generation. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 174–186

work page 2025
[19]

Wing Lam, Patrice Godefroid, Suman Nath, Anirudh Santhiar, and Suresh Thum- malapenta. 2019. Root causing flaky tests in a large-scale industrial setting. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 101–111

work page 2019
[20]

Wing Lam, Reed Oei, August Shi, Darko Marinov, and Tao Xie. 2019. iDFlakies: A framework for detecting and partially classifying flaky tests. In2019 12th ieee conference on software testing, validation and verification (icst). IEEE, 312–322

work page 2019
[21]

Tanakorn Leesatapornwongsa, Xiang Ren, and Suman Nath. 2022. FlakeRepro: Automated and efficient reproduction of concurrency-related flaky tests. InPro- ceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1509–1520

work page 2022
[22]

Fabian Leinen, Daniel Elsner, Alexander Pretschner, Andreas Stahlbauer, Michael Sailer, and Elmar Jürgens. 2024. Cost of flaky tests in continuous integration: An industrial case study. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 329–340

work page 2024
[23]

Nate Levin, Chengpeng Li, Yule Zhang, August Shi, and Wing Lam. 2025. Takuan: Using Dynamic Invariants to Debug Order-Dependent Flaky Tests. In2025 IEEE/ACM 47th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 81–85

work page 2025
[24]

Chengpeng Li and August Shi. 2022. Evolution-aware detection of order- dependent flaky tests. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 114–125

work page 2022
[25]

Haonan Li, Yu Hao, Yizhuo Zhai, and Zhiyun Qian. 2023. Assisting static analysis with large language models: A chatgpt experiment. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2107–2111

work page 2023
[26]

Shizhe Lin, Ryan Zheng He Liu, and Ladan Tahvildari. 2024. FlaKat: A Ma- chine Learning-Based Categorization Framework for Flaky Tests.arXiv preprint arXiv:2403.01003(2024)

work page arXiv 2024
[27]

Xinyue Liu, Zihe Song, Weike Fang, Wei Yang, and Weihang Wang. 2024. Wefix: Intelligent automatic generation of explicit waits for efficient web end-to-end flaky tests. InProceedings of the ACM Web Conference 2024. 3043–3052

work page 2024
[28]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empir- ical analysis of flaky tests. InProceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering. 643–653

work page 2014
[30]

Wei Ma, Shangqing Liu, Zhihao Lin, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, Li Li, and Yang Liu. 2023. Lms: Understanding code syntax and semantics for code analysis.arXiv preprint arXiv:2305.12138(2023)

work page arXiv 2023
[31]

2008.Introduction to information retrieval

Christopher D Manning. 2008.Introduction to information retrieval. Syngress Publishing,

work page 2008
[32]

Amane Meibuki, Renshu Nanao, and Mugen Outa. 2024. Improving learning efficiency in large language models through shortcut learning. (2024)

work page 2024
[33]

Riddhi More and Jeremy S Bradbury. 2025. An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 349–359

work page 2025
[34]

Gireen Naidu, Tranos Zuva, and Elias Mmbongeni Sibanda. 2023. A review of evaluation metrics in machine learning algorithms. InComputer science on-line conference. Springer, 15–25

work page 2023
[35]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[36]

Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2021. A survey of flaky tests.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 1 (2021), 1–74

work page 2021
[37]

Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2023. Empirically evaluating flaky test detection techniques combining test case rerun- ning and machine learning models.Empirical Software Engineering28, 3 (2023), 72

work page 2023
[38]

Yu Pei, Sarra Habchi, Renaud Rwemalika, Jeongju Sohn, and Mike Papadakis. 2022. An empirical study of async wait flakiness in front-end testing.. InBENEVOL

work page 2022
[39]

Yu Pei, Jeongju Sohn, Sarra Habchi, and Mike Papadakis. 2025. Non-flaky and nearly optimal time-based treatment of asynchronous wait web tests.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–29

work page 2025
[40]

Shanto Rahman, Abdelrahman Baz, Sasa Misailovic, and August Shi. 2024. Quan- tizing large-language models for predicting flaky tests. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 93–104

work page 2024
[41]

Shanto Rahman, Bala Naren Chanumolu, Suzzana Rafi, August Shi, and Wing Lam. 2025. Ranking Relevant Tests for Order-Dependent Flaky Tests. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 715–715

work page 2025
[42]

Shanto Rahman, Saikat Dutta, and August Shi. 2025. Understanding and Improv- ing Flaky Test Classification.Proceedings of the ACM on Programming Languages 9, OOPSLA2 (2025), 1345–1371

work page 2025
[43]

Shanto Rahman and August Shi. 2024. FlakeSync: Automatically repairing async flaky tests. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12

work page 2024
[44]

Denini Silva, Leopoldo Teixeira, and Marcelo d’Amorim. 2020. Shake it! detect- ing flaky tests caused by concurrency with shaker. In2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 301–311

work page 2020
[45]

Jiaguo Wang, Yan Lei, Maojin Li, Guanyu Ren, Huan Xie, Shifeng Jin, Junchao Li, and Jian Hu. 2024. Flakyrank: Predicting Flaky Tests Using Augmented Learning to Rank. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 872–883

work page 2024
[46]

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi

work page
[47]

InProceedings of the 2023 conference on empirical methods in natural language processing

Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing. 1069–1088

work page 2023
[48]

Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey.Software testing, verification and reliability22, 2 (2012), 67–120

work page 2012
[49]

Yu Yuan, Lili Zhao, Kai Zhang, Guangting Zheng, and Qi Liu. 2024. Do llms over- come shortcut learning? an evaluation of shortcut challenges in large language models.arXiv preprint arXiv:2410.13343(2024)

work page arXiv 2024