arxiv: 2605.14568 · v1 · submitted 2026-05-14 · 💻 cs.SE · cs.CL· cs.LG

Recognition: no theorem link

Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal , Noor Fatima , Muhammad Bilal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:29 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.LG

keywords BDDtest refactoringsubscenario miningXGBoostclusteringGherkinduplication detectionLLM evaluation

0 comments

The pith

Machine learning classifiers can identify which duplicated step sequences in BDD test suites are worth refactoring more accurately than rules or large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a system that scans large numbers of Gherkin test files to find repeated sequences of steps called slices. They cluster similar slices using embeddings and then train a model to decide which ones should be extracted into reusable components using one of three standard refactoring patterns. The work reports that such opportunities appear in the majority of scenarios and that their classifier performs better than simpler methods when tested against human labels.

Core claim

By clustering paraphrase-equivalent step subsequences across a corpus of 339 BDD repositories, the authors show that an XGBoost model trained on a small labelled set can classify slices as extraction-worthy with an F1 score of 0.891, surpassing a rule-based baseline and LLM judges, while estimating that 75 percent of scenarios contain within-file background candidates, 59.5 percent within-repo reusable scenarios, and 11.7 percent cross-organisational shared steps.

What carries the argument

The paraphrase-robust slice miner that groups contiguous step windows by Sentence-BERT embeddings reduced via UMAP and clustered with HDBSCAN, combined with the XGBoost classifier that predicts both worthiness and the appropriate refactoring mechanism.

Load-bearing premise

The labels provided by the three authors on the stratified sample of 200 slices serve as reliable ground truth for determining which slices are extraction-worthy.

What would settle it

Re-labelling the same 200 slices by an independent panel of BDD experts and retraining or re-evaluating the classifier to check if the F1 score remains above 0.85.

read the original abstract

Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a workable pipeline for spotting reusable subscenarios in large Gherkin suites via clustering plus XGBoost, with released artifacts and clear baseline wins, but moderate label agreement is the main limit on how far the numbers can be trusted.

read the letter

The core takeaway is that the authors built and evaluated a full pipeline to find recurring step subsequences worth refactoring in BDD tests, then mapped them to one of three published patterns and measured how common each is across hundreds of repositories. They processed 339 repos, extracted over five million slices, clustered paraphrases with SBERT plus UMAP and HDBSCAN, labeled a 200-slice sample, and trained an XGBoost classifier that reaches 0.891 F1 out-of-fold while beating both a tuned rule baseline and the stronger LLM judge at statistical significance. The prevalence numbers (75 percent within-file background candidates, 59.5 percent reusable-scenario, 11.7 percent cross-org shared step) are the first corpus-wide census of this kind, and releasing the labeled pool, rubric, and code makes the work immediately usable by others.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical pipeline for discovering refactoring opportunities in BDD Gherkin test suites. It clusters step subsequences using SBERT embeddings, UMAP, and HDBSCAN across a 339-repository corpus, labels a stratified 200-slice sample with three authors (Fleiss' kappa 0.56 for extraction-worthy), trains an XGBoost classifier under 5-fold CV to predict extraction-worthy slices, and benchmarks it against a rule baseline and LLM judges. The paper reports out-of-fold F1=0.891 (95% CI [0.852,0.927]) for the classifier, prevalence estimates of 75.0%/59.5%/11.7% for the three refactoring patterns, and releases the pipeline, labelled pool, and rubric.

Significance. If the results hold, the work delivers a scalable, paraphrase-robust method for identifying maintainability improvements in large BDD test suites, supported by statistical comparisons, confidence intervals, and artifact release. It fills a gap between existing refactoring patterns and automated detection, enabling corpus-wide census of opportunities.

major comments (2)

[Results] Results section: Fleiss' kappa=0.56 on the 200-slice extraction-worthy labels indicates only moderate agreement. Because these labels constitute the ground truth for both training and 5-fold evaluation of the XGBoost classifier (F1=0.891), label noise on borderline slices directly affects the reported performance and prevalence figures; the manuscript should quantify the impact via label-flip sensitivity or additional rater analysis.
[Method] Method section: Window length L is chosen post-hoc from the interval [2,18]. The central F1 claim and prevalence estimates depend on this choice; the paper must either justify the range a priori or supply a sensitivity table showing that classifier F1 and pattern prevalences remain stable across plausible L values.

minor comments (2)

[Abstract] The abstract states that three refactoring patterns are 'published' but provides no citations; adding the specific references would improve traceability for readers unfamiliar with the patterns.
[Results] Table or figure presenting the 5-fold CV results should explicitly list the per-fold F1 values alongside the aggregate 0.891 to allow readers to assess variance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the robustness of the reported results.

read point-by-point responses

Referee: [Results] Results section: Fleiss' kappa=0.56 on the 200-slice extraction-worthy labels indicates only moderate agreement. Because these labels constitute the ground truth for both training and 5-fold evaluation of the XGBoost classifier (F1=0.891), label noise on borderline slices directly affects the reported performance and prevalence figures; the manuscript should quantify the impact via label-flip sensitivity or additional rater analysis.

Authors: We agree that Fleiss' kappa of 0.56 reflects moderate agreement and that label noise could influence the classifier metrics. In the revision we will add a label-flip sensitivity analysis: we will randomly flip 10% and 20% of the extraction-worthy labels in the 200-slice pool, retrain the XGBoost model under the same 5-fold protocol, and report the resulting changes in out-of-fold F1 (with 95% CI) and the three pattern prevalence estimates. This will directly quantify the impact of the observed inter-rater variability on the central claims. revision: yes
Referee: [Method] Method section: Window length L is chosen post-hoc from the interval [2,18]. The central F1 claim and prevalence estimates depend on this choice; the paper must either justify the range a priori or supply a sensitivity table showing that classifier F1 and pattern prevalences remain stable across plausible L values.

Authors: The interval [2,18] was chosen to encompass the observed distribution of step-subsequence lengths in the 339-repository corpus (median scenario length 7 steps). To remove any post-hoc concern, the revised manuscript will include a sensitivity table (new Table X) that reports out-of-fold F1 and the three refactoring-pattern prevalences for every even L from 2 to 18. The table will show that both the classifier F1 and prevalence figures remain stable (within 3 percentage points) across the range, thereby justifying the reported central results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical pipeline with independent human labels

full rationale

The paper's core results derive from applying SBERT/UMAP/HDBSCAN clustering to raw Gherkin text slices, followed by independent three-author labeling of a 200-slice stratified pool (Fleiss' κ reported but not used in metric computation), 5-fold CV training of XGBoost, and direct comparison to a rule baseline and LLM judges. No equation or step reduces the out-of-fold F1, prevalence estimates, or statistical tests to quantities defined by the same fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the pipeline remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the assumption that paraphrase-robust clustering recovers semantically equivalent test steps and that the three-author labels on 200 slices constitute reliable ground truth. No new physical entities or ad-hoc constants are introduced; the main free parameters are the window length L and the clustering hyperparameters.

free parameters (2)

window length L
Chosen in the range [2,18]; affects which subsequences are considered and therefore the entire slice inventory and classifier training set.
HDBSCAN hyperparameters
min_cluster_size and related density parameters control how many paraphrase variants are grouped together; not enumerated in the abstract.

axioms (2)

domain assumption Sentence-BERT embeddings plus UMAP+HDBSCAN produce clusters that correspond to human-judged semantic equivalence of Gherkin steps.
Invoked when the pipeline treats clustered slices as interchangeable for refactoring purposes.
domain assumption The written rubric used by three authors produces labels that generalise to the full 5.3 million slices.
Required for the classifier F1 and prevalence statistics to be meaningful beyond the 200-slice pool.

pith-pipeline@v0.9.0 · 5712 in / 1716 out tokens · 56803 ms · 2026-05-15T01:29:08.630014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

[1]

Information and Software Technology , year =

Mughal, Ali Hassaan and Fatima, Noor and Bilal, Muhammad , title =. Information and Software Technology , year =

work page
[2]

2024 , howpublished =

Mughal, Ali Hassaan , title =. 2024 , howpublished =. doi:10.48550/arXiv.2402.15928 , url =

work page doi:10.48550/arxiv.2402.15928 2024
[3]

and Konstantinou, Nikolaos , title =

Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. IEEE Workshop on Validation, Analysis and Evolution of Software Tests (VST) , year =

work page
[4]

and Konstantinou, Nikolaos , title =

Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) , year =

work page
[5]

and Konstantinou, Nikolaos , title =

Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. 21st International Conference on Agile Software Development (XP) , series =. 2020 , pages =

work page 2020
[6]

Journal of Systems and Software , year =

Binamungu, Leonard Peter and Maro, Salome , title =. Journal of Systems and Software , year =

work page
[7]

Supporting refactoring of

Irshad, Mohsin and B. Supporting refactoring of. Information and Software Technology , year =

work page
[8]

Journal of Systems and Software , year =

Irshad, Mohsin and Britto, Ricardo and Petersen, Kai , title =. Journal of Systems and Software , year =

work page
[9]

Proceedings of the Evaluation and Assessment in Software Engineering (EASE) , year =

Irshad, Mohsin and Petersen, Kai , title =. Proceedings of the Evaluation and Assessment in Software Engineering (EASE) , year =

work page
[10]

Bad smells in behavior-driven development scenarios , booktitle =

Diniz, Jo. Bad smells in behavior-driven development scenarios , booktitle =. 2018 , publisher =

work page 2018
[11]

IEEE 25th International Requirements Engineering Conference Workshops (REW) , year =

Oliveira, Gabriel and Marczak, Sabrina , title =. IEEE 25th International Requirements Engineering Conference Workshops (REW) , year =

work page
[12]

Proceedings of the 33rd Brazilian Symposium on Software Engineering (SBES) , year =

Oliveira, Gabriel and Marczak, Sabrina and Moralles, Cleidson , title =. Proceedings of the 33rd Brazilian Symposium on Software Engineering (SBES) , year =

work page
[13]

The Practice of Enterprise Modeling (PoEM) , series =

Wautelet, Yves and Nassiri, Soheil and Tsilionis, Konstantinos , title =. The Practice of Enterprise Modeling (PoEM) , series =. 2023 , publisher =

work page 2023
[14]

International Conference on Product-Focused Software Process Improvement (PROFES) , year =

Sears, Connor and Tsilionis, Konstantinos and Wautelet, Yves , title =. International Conference on Product-Focused Software Process Improvement (PROFES) , year =

work page
[15]

Proceedings of the 19th International Conference on Agile Software Development (

Pereira, Luiz and Sharp, Helen and de Souza, Cleidson and Oliveira, Gabriel and Marczak, Sabrina and Bastos, Renata , title =. Proceedings of the 19th International Conference on Agile Software Development (. 2018 , publisher =

work page 2018
[16]

Scandaroli, Arthur and Leite, Rhuan and Kiosia, Athena S. G. and Coelho, Sandro , title =. Proceedings of the 14th IEEE International Conference on Global Software Engineering (ICGSE) , year =

work page
[17]

37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) , year =

Solis, Carlos and Wang, Xiaofeng , title =. 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) , year =

work page
[18]

Better Software Magazine , year =

North, Dan , title =. Better Software Magazine , year =

work page
[19]

Proceedings of the 23rd International Conference on Mining Software Repositories (MSR), Data and Tool Showcase Track , year =

Alc. Proceedings of the 23rd International Conference on Mining Software Repositories (MSR), Data and Tool Showcase Track , year =

work page
[20]

Advances in Database Technology ---

Srikant, Ramakrishnan and Agrawal, Rakesh , title =. Advances in Database Technology ---. 1996 , pages =

work page 1996
[21]

Proceedings of the 17th International Conference on Data Engineering (ICDE) , year =

Pei, Jian and Han, Jiawei and Mortazavi-Asl, Behzad and Pinto, Helen and Chen, Qiming and Dayal, Umeshwar and Hsu, Mei-Chun , title =. Proceedings of the 17th International Conference on Data Engineering (ICDE) , year =

work page
[22]

, title =

Zaki, Mohammed J. , title =. Machine Learning , volume =. 2001 , publisher =

work page 2001
[23]

Inkeri , title =

Mannila, Heikki and Toivonen, Hannu and Verkamo, A. Inkeri , title =. Data Mining and Knowledge Discovery , volume =. 1997 , publisher =

work page 1997
[24]

Data Science and Pattern Recognition , volume =

Fournier-Viger, Philippe and Lin, Jerry Chun-Wei and Kiran, Rage Uday and Koh, Yun Sing and Thomas, Rincy , title =. Data Science and Pattern Recognition , volume =

work page
[25]

Proceedings of the 11th International Conference on Data Engineering (ICDE) , year =

Agrawal, Rakesh and Srikant, Ramakrishnan , title =. Proceedings of the 11th International Conference on Data Engineering (ICDE) , year =

work page
[26]

IEEE Transactions on Software Engineering , volume =

Kamiya, Toshihiro and Kusumoto, Shinji and Inoue, Katsuro , title =. IEEE Transactions on Software Engineering , volume =. 2002 , doi =

work page 2002
[27]

and Yahin, Andrew and Moura, Leonardo and Sant'Anna, Marcelo and Bier, Lorraine , title =

Baxter, Ira D. and Yahin, Andrew and Moura, Leonardo and Sant'Anna, Marcelo and Bier, Lorraine , title =. Proceedings of the International Conference on Software Maintenance (ICSM) , year =

work page
[28]

Proceedings of the 29th International Conference on Software Engineering (ICSE) , year =

Jiang, Lingxiao and Misherghi, Ghassan and Su, Zhendong and Glondu, Stephane , title =. Proceedings of the 29th International Conference on Software Engineering (ICSE) , year =

work page
[29]

and Lopes, Cristina V

Sajnani, Hitesh and Saini, Vaibhav and Svajlenko, Jeffrey and Roy, Chanchal K. and Lopes, Cristina V. , title =. Proceedings of the 38th International Conference on Software Engineering (ICSE) , year =

work page
[30]

Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI) , year =

Li, Zhenmin and Lu, Shan and Myagmar, Suvda and Zhou, Yuanyuan , title =. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI) , year =

work page
[31]

Information and Software Technology , volume =

Rattan, Dhavleesh and Bhatia, Rajesh and Singh, Maninder , title =. Information and Software Technology , volume =. 2013 , publisher =

work page 2013
[32]

and Cordy, James R

Roy, Chanchal K. and Cordy, James R. and Koschke, Rainer , title =. Science of Computer Programming , volume =. 2009 , doi =

work page 2009
[33]

Software Testing, Verification and Reliability , volume =

Yoo, Shin and Harman, Mark , title =. Software Testing, Verification and Reliability , volume =. 2012 , doi =

work page 2012
[34]

Smells in software test code: A survey of knowledge in industry and academia , journal =

Garousi, Vahid and K. Smells in software test code: A survey of knowledge in industry and academia , journal =. 2018 , doi =

work page 2018
[35]

28th IEEE International Conference on Software Maintenance (ICSM) , year =

Bavota, Gabriele and Qusef, Abdallah and Oliveto, Rocco and De Lucia, Andrea and Binkley, David , title =. 28th IEEE International Conference on Software Maintenance (ICSM) , year =

work page
[36]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Reimers, Nils and Gurevych, Iryna , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2019
[37]

arXiv preprint , year =

McInnes, Leland and Healy, John and Melville, James , title =. arXiv preprint , year =

work page
[38]

Campello, Ricardo J. G. B. and Moulavi, Davoud and Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates , booktitle =. 2013 , pages =

work page 2013
[39]

, title =

Fleiss, Joseph L. , title =. Psychological Bulletin , volume =. 1971 , doi =

work page 1971
[40]

Richard and Koch, Gary G

Landis, J. Richard and Koch, Gary G. , title =. Biometrics , volume =. 1977 , doi =

work page 1977
[41]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =

work page
[42]

, title =

Efron, Bradley and Tibshirani, Robert J. , title =

work page
[43]

and Damian, Daniela , title =

Kalliamvakou, Eirini and Gousios, Georgios and Blincoe, Kelly and Singer, Leif and German, Daniel M. and Damian, Daniela , title =. Empirical Software Engineering , volume =. 2016 , doi =

work page 2016
[44]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Advances in Neural Information Processing Systems 36 (NeurIPS) Datasets and Benchmarks Track , year =

work page
[45]

A Survey on LLM-as-a-Judge

Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , title =. 2024 , howpublished =. doi:10.48550/arXiv.2411.15594 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2024
[46]

Empirical Software Engineering , year =

Spadini, Davide and Palomba, Fabio and Bacchelli, Alberto and Lanza, Michele and Zaidman, Andy , title =. Empirical Software Engineering , year =

work page
[47]

Software Quality Journal , year =

Soares, Eduardo and Ribeiro, M\'arcio and Ferreira, Felipe and Bonif\'acio, Rohit , title =. Software Quality Journal , year =

work page
[48]

Applied Computing and Intelligence , year =

Walunj, Vibhuti and Trabelsi, Asma and Sallaberry, Christian , title =. Applied Computing and Intelligence , year =

work page
[49]

IEEE Transactions on Software Engineering , year =

Bellon, Stefan and Koschke, Rainer and Antoniol, Giuliano and Krinke, Jens and Merlo, Ettore , title =. IEEE Transactions on Software Engineering , year =

work page
[50]

, title =

Saini, Vaibhav and Farmahinifarahani, Farima and Lu, Yadong and Baldi, Pierre and Lopes, Cristina V. , title =. Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) , year =

work page 2018
[51]

IEEE Transactions on Software Engineering , year =

Kim, Miryung and Zimmermann, Thomas and Nagappan, Nachiappan , title =. IEEE Transactions on Software Engineering , year =

work page
[52]

2024 , howpublished =

Liu, Bo and Jiang, Yanjie and Zhang, Yuxia and Niu, Nan and Li, Guangjie and Liu, Hui , title =. 2024 , howpublished =. doi:10.48550/arXiv.2411.04444 , url =

work page doi:10.48550/arxiv.2411.04444 2024
[53]

41st IEEE International Conference on Software Maintenance and Evolution (ICSME), New Ideas and Emerging Results Track , year =

Horikawa, Kosei and Kashiwa, Yutaro and Lin, Bin and Fujiwara, Kenji and Iida, Hajimu , title =. 41st IEEE International Conference on Software Maintenance and Evolution (ICSME), New Ideas and Emerging Results Track , year =

work page
[54]

Empirical Software Engineering , year =

Spadini, Davide and Schvarcbacher, Mara and Oprescu, Ana-Maria and Bruntink, Magiel and Bacchelli, Alberto , title =. Empirical Software Engineering , year =

work page
[55]

Empirical Software Engineering , year =

Pontillo, Valeria and Palomba, Fabio and Ferrucci, Filomena , title =. Empirical Software Engineering , year =

work page
[56]

2024 , howpublished =

Krinke, Jens and Ragkhitwetsagul, Chaiyong , title =. 2024 , howpublished =. doi:10.48550/arXiv.2505.04311 , url =

work page doi:10.48550/arxiv.2505.04311 2024
[57]

IEEE Access , year =

Farooq, Muhammad Shoaib and Omer, Uzma and Ramzan, Adel and Rasheed, Muhammad Aon and Atal, Zabihullah , title =. IEEE Access , year =

work page
[58]

Arredondo-Reyes, V. M. and Dom\'inguez-Isidro, S. and S\'anchez-Garc\'ia, \'A. J. , title =. Programming and Computer Software , year =

work page