Recognition: no theorem link
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Pith reviewed 2026-05-15 01:29 UTC · model grok-4.3
The pith
Machine learning classifiers can identify which duplicated step sequences in BDD test suites are worth refactoring more accurately than rules or large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By clustering paraphrase-equivalent step subsequences across a corpus of 339 BDD repositories, the authors show that an XGBoost model trained on a small labelled set can classify slices as extraction-worthy with an F1 score of 0.891, surpassing a rule-based baseline and LLM judges, while estimating that 75 percent of scenarios contain within-file background candidates, 59.5 percent within-repo reusable scenarios, and 11.7 percent cross-organisational shared steps.
What carries the argument
The paraphrase-robust slice miner that groups contiguous step windows by Sentence-BERT embeddings reduced via UMAP and clustered with HDBSCAN, combined with the XGBoost classifier that predicts both worthiness and the appropriate refactoring mechanism.
Load-bearing premise
The labels provided by the three authors on the stratified sample of 200 slices serve as reliable ground truth for determining which slices are extraction-worthy.
What would settle it
Re-labelling the same 200 slices by an independent panel of BDD experts and retraining or re-evaluating the classifier to check if the F1 score remains above 0.85.
read the original abstract
Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p < 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical pipeline for discovering refactoring opportunities in BDD Gherkin test suites. It clusters step subsequences using SBERT embeddings, UMAP, and HDBSCAN across a 339-repository corpus, labels a stratified 200-slice sample with three authors (Fleiss' kappa 0.56 for extraction-worthy), trains an XGBoost classifier under 5-fold CV to predict extraction-worthy slices, and benchmarks it against a rule baseline and LLM judges. The paper reports out-of-fold F1=0.891 (95% CI [0.852,0.927]) for the classifier, prevalence estimates of 75.0%/59.5%/11.7% for the three refactoring patterns, and releases the pipeline, labelled pool, and rubric.
Significance. If the results hold, the work delivers a scalable, paraphrase-robust method for identifying maintainability improvements in large BDD test suites, supported by statistical comparisons, confidence intervals, and artifact release. It fills a gap between existing refactoring patterns and automated detection, enabling corpus-wide census of opportunities.
major comments (2)
- [Results] Results section: Fleiss' kappa=0.56 on the 200-slice extraction-worthy labels indicates only moderate agreement. Because these labels constitute the ground truth for both training and 5-fold evaluation of the XGBoost classifier (F1=0.891), label noise on borderline slices directly affects the reported performance and prevalence figures; the manuscript should quantify the impact via label-flip sensitivity or additional rater analysis.
- [Method] Method section: Window length L is chosen post-hoc from the interval [2,18]. The central F1 claim and prevalence estimates depend on this choice; the paper must either justify the range a priori or supply a sensitivity table showing that classifier F1 and pattern prevalences remain stable across plausible L values.
minor comments (2)
- [Abstract] The abstract states that three refactoring patterns are 'published' but provides no citations; adding the specific references would improve traceability for readers unfamiliar with the patterns.
- [Results] Table or figure presenting the 5-fold CV results should explicitly list the per-fold F1 values alongside the aggregate 0.891 to allow readers to assess variance.
Simulated Author's Rebuttal
Thank you for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript accordingly to strengthen the robustness of the reported results.
read point-by-point responses
-
Referee: [Results] Results section: Fleiss' kappa=0.56 on the 200-slice extraction-worthy labels indicates only moderate agreement. Because these labels constitute the ground truth for both training and 5-fold evaluation of the XGBoost classifier (F1=0.891), label noise on borderline slices directly affects the reported performance and prevalence figures; the manuscript should quantify the impact via label-flip sensitivity or additional rater analysis.
Authors: We agree that Fleiss' kappa of 0.56 reflects moderate agreement and that label noise could influence the classifier metrics. In the revision we will add a label-flip sensitivity analysis: we will randomly flip 10% and 20% of the extraction-worthy labels in the 200-slice pool, retrain the XGBoost model under the same 5-fold protocol, and report the resulting changes in out-of-fold F1 (with 95% CI) and the three pattern prevalence estimates. This will directly quantify the impact of the observed inter-rater variability on the central claims. revision: yes
-
Referee: [Method] Method section: Window length L is chosen post-hoc from the interval [2,18]. The central F1 claim and prevalence estimates depend on this choice; the paper must either justify the range a priori or supply a sensitivity table showing that classifier F1 and pattern prevalences remain stable across plausible L values.
Authors: The interval [2,18] was chosen to encompass the observed distribution of step-subsequence lengths in the 339-repository corpus (median scenario length 7 steps). To remove any post-hoc concern, the revised manuscript will include a sensitivity table (new Table X) that reports out-of-fold F1 and the three refactoring-pattern prevalences for every even L from 2 to 18. The table will show that both the classifier F1 and prevalence figures remain stable (within 3 percentage points) across the range, thereby justifying the reported central results. revision: yes
Circularity Check
No significant circularity: empirical pipeline with independent human labels
full rationale
The paper's core results derive from applying SBERT/UMAP/HDBSCAN clustering to raw Gherkin text slices, followed by independent three-author labeling of a 200-slice stratified pool (Fleiss' κ reported but not used in metric computation), 5-fold CV training of XGBoost, and direct comparison to a rule baseline and LLM judges. No equation or step reduces the out-of-fold F1, prevalence estimates, or statistical tests to quantities defined by the same fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the pipeline remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- window length L
- HDBSCAN hyperparameters
axioms (2)
- domain assumption Sentence-BERT embeddings plus UMAP+HDBSCAN produce clusters that correspond to human-judged semantic equivalence of Gherkin steps.
- domain assumption The written rubric used by three authors produces labels that generalise to the full 5.3 million slices.
Reference graph
Works this paper leans on
-
[1]
Information and Software Technology , year =
Mughal, Ali Hassaan and Fatima, Noor and Bilal, Muhammad , title =. Information and Software Technology , year =
-
[2]
Mughal, Ali Hassaan , title =. 2024 , howpublished =. doi:10.48550/arXiv.2402.15928 , url =
-
[3]
and Konstantinou, Nikolaos , title =
Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. IEEE Workshop on Validation, Analysis and Evolution of Software Tests (VST) , year =
-
[4]
and Konstantinou, Nikolaos , title =
Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) , year =
-
[5]
and Konstantinou, Nikolaos , title =
Binamungu, Leonard Peter and Embury, Suzanne M. and Konstantinou, Nikolaos , title =. 21st International Conference on Agile Software Development (XP) , series =. 2020 , pages =
work page 2020
-
[6]
Journal of Systems and Software , year =
Binamungu, Leonard Peter and Maro, Salome , title =. Journal of Systems and Software , year =
-
[7]
Irshad, Mohsin and B. Supporting refactoring of. Information and Software Technology , year =
-
[8]
Journal of Systems and Software , year =
Irshad, Mohsin and Britto, Ricardo and Petersen, Kai , title =. Journal of Systems and Software , year =
-
[9]
Proceedings of the Evaluation and Assessment in Software Engineering (EASE) , year =
Irshad, Mohsin and Petersen, Kai , title =. Proceedings of the Evaluation and Assessment in Software Engineering (EASE) , year =
-
[10]
Bad smells in behavior-driven development scenarios , booktitle =
Diniz, Jo. Bad smells in behavior-driven development scenarios , booktitle =. 2018 , publisher =
work page 2018
-
[11]
IEEE 25th International Requirements Engineering Conference Workshops (REW) , year =
Oliveira, Gabriel and Marczak, Sabrina , title =. IEEE 25th International Requirements Engineering Conference Workshops (REW) , year =
-
[12]
Proceedings of the 33rd Brazilian Symposium on Software Engineering (SBES) , year =
Oliveira, Gabriel and Marczak, Sabrina and Moralles, Cleidson , title =. Proceedings of the 33rd Brazilian Symposium on Software Engineering (SBES) , year =
-
[13]
The Practice of Enterprise Modeling (PoEM) , series =
Wautelet, Yves and Nassiri, Soheil and Tsilionis, Konstantinos , title =. The Practice of Enterprise Modeling (PoEM) , series =. 2023 , publisher =
work page 2023
-
[14]
International Conference on Product-Focused Software Process Improvement (PROFES) , year =
Sears, Connor and Tsilionis, Konstantinos and Wautelet, Yves , title =. International Conference on Product-Focused Software Process Improvement (PROFES) , year =
-
[15]
Proceedings of the 19th International Conference on Agile Software Development (
Pereira, Luiz and Sharp, Helen and de Souza, Cleidson and Oliveira, Gabriel and Marczak, Sabrina and Bastos, Renata , title =. Proceedings of the 19th International Conference on Agile Software Development (. 2018 , publisher =
work page 2018
-
[16]
Scandaroli, Arthur and Leite, Rhuan and Kiosia, Athena S. G. and Coelho, Sandro , title =. Proceedings of the 14th IEEE International Conference on Global Software Engineering (ICGSE) , year =
-
[17]
37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) , year =
Solis, Carlos and Wang, Xiaofeng , title =. 37th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA) , year =
- [18]
-
[19]
Alc. Proceedings of the 23rd International Conference on Mining Software Repositories (MSR), Data and Tool Showcase Track , year =
-
[20]
Advances in Database Technology ---
Srikant, Ramakrishnan and Agrawal, Rakesh , title =. Advances in Database Technology ---. 1996 , pages =
work page 1996
-
[21]
Proceedings of the 17th International Conference on Data Engineering (ICDE) , year =
Pei, Jian and Han, Jiawei and Mortazavi-Asl, Behzad and Pinto, Helen and Chen, Qiming and Dayal, Umeshwar and Hsu, Mei-Chun , title =. Proceedings of the 17th International Conference on Data Engineering (ICDE) , year =
- [22]
-
[23]
Mannila, Heikki and Toivonen, Hannu and Verkamo, A. Inkeri , title =. Data Mining and Knowledge Discovery , volume =. 1997 , publisher =
work page 1997
-
[24]
Data Science and Pattern Recognition , volume =
Fournier-Viger, Philippe and Lin, Jerry Chun-Wei and Kiran, Rage Uday and Koh, Yun Sing and Thomas, Rincy , title =. Data Science and Pattern Recognition , volume =
-
[25]
Proceedings of the 11th International Conference on Data Engineering (ICDE) , year =
Agrawal, Rakesh and Srikant, Ramakrishnan , title =. Proceedings of the 11th International Conference on Data Engineering (ICDE) , year =
-
[26]
IEEE Transactions on Software Engineering , volume =
Kamiya, Toshihiro and Kusumoto, Shinji and Inoue, Katsuro , title =. IEEE Transactions on Software Engineering , volume =. 2002 , doi =
work page 2002
-
[27]
and Yahin, Andrew and Moura, Leonardo and Sant'Anna, Marcelo and Bier, Lorraine , title =
Baxter, Ira D. and Yahin, Andrew and Moura, Leonardo and Sant'Anna, Marcelo and Bier, Lorraine , title =. Proceedings of the International Conference on Software Maintenance (ICSM) , year =
-
[28]
Proceedings of the 29th International Conference on Software Engineering (ICSE) , year =
Jiang, Lingxiao and Misherghi, Ghassan and Su, Zhendong and Glondu, Stephane , title =. Proceedings of the 29th International Conference on Software Engineering (ICSE) , year =
-
[29]
Sajnani, Hitesh and Saini, Vaibhav and Svajlenko, Jeffrey and Roy, Chanchal K. and Lopes, Cristina V. , title =. Proceedings of the 38th International Conference on Software Engineering (ICSE) , year =
-
[30]
Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI) , year =
Li, Zhenmin and Lu, Shan and Myagmar, Suvda and Zhou, Yuanyuan , title =. Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI) , year =
-
[31]
Information and Software Technology , volume =
Rattan, Dhavleesh and Bhatia, Rajesh and Singh, Maninder , title =. Information and Software Technology , volume =. 2013 , publisher =
work page 2013
-
[32]
Roy, Chanchal K. and Cordy, James R. and Koschke, Rainer , title =. Science of Computer Programming , volume =. 2009 , doi =
work page 2009
-
[33]
Software Testing, Verification and Reliability , volume =
Yoo, Shin and Harman, Mark , title =. Software Testing, Verification and Reliability , volume =. 2012 , doi =
work page 2012
-
[34]
Smells in software test code: A survey of knowledge in industry and academia , journal =
Garousi, Vahid and K. Smells in software test code: A survey of knowledge in industry and academia , journal =. 2018 , doi =
work page 2018
-
[35]
28th IEEE International Conference on Software Maintenance (ICSM) , year =
Bavota, Gabriele and Qusef, Abdallah and Oliveto, Rocco and De Lucia, Andrea and Binkley, David , title =. 28th IEEE International Conference on Software Maintenance (ICSM) , year =
-
[36]
Reimers, Nils and Gurevych, Iryna , title =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =
work page 2019
-
[37]
McInnes, Leland and Healy, John and Melville, James , title =. arXiv preprint , year =
-
[38]
Campello, Ricardo J. G. B. and Moulavi, Davoud and Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates , booktitle =. 2013 , pages =
work page 2013
- [39]
-
[40]
Landis, J. Richard and Koch, Gary G. , title =. Biometrics , volume =. 1977 , doi =
work page 1977
-
[41]
Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , year =
- [42]
-
[43]
Kalliamvakou, Eirini and Gousios, Georgios and Blincoe, Kelly and Singer, Leif and German, Daniel M. and Damian, Daniela , title =. Empirical Software Engineering , volume =. 2016 , doi =
work page 2016
-
[44]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Advances in Neural Information Processing Systems 36 (NeurIPS) Datasets and Benchmarks Track , year =
-
[45]
Gu, Jiawei and Jiang, Xuhui and Shi, Zhichao and Tan, Hexiang and Zhai, Xuehao and Xu, Chengjin and Li, Wei and Shen, Yinghan and Ma, Shengjie and Liu, Honghao and Wang, Saizhuo and Zhang, Kun and Wang, Yuanzhuo and Gao, Wen and Ni, Lionel and Guo, Jian , title =. 2024 , howpublished =. doi:10.48550/arXiv.2411.15594 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15594 2024
-
[46]
Empirical Software Engineering , year =
Spadini, Davide and Palomba, Fabio and Bacchelli, Alberto and Lanza, Michele and Zaidman, Andy , title =. Empirical Software Engineering , year =
-
[47]
Software Quality Journal , year =
Soares, Eduardo and Ribeiro, M\'arcio and Ferreira, Felipe and Bonif\'acio, Rohit , title =. Software Quality Journal , year =
-
[48]
Applied Computing and Intelligence , year =
Walunj, Vibhuti and Trabelsi, Asma and Sallaberry, Christian , title =. Applied Computing and Intelligence , year =
-
[49]
IEEE Transactions on Software Engineering , year =
Bellon, Stefan and Koschke, Rainer and Antoniol, Giuliano and Krinke, Jens and Merlo, Ettore , title =. IEEE Transactions on Software Engineering , year =
-
[50]
Saini, Vaibhav and Farmahinifarahani, Farima and Lu, Yadong and Baldi, Pierre and Lopes, Cristina V. , title =. Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) , year =
work page 2018
-
[51]
IEEE Transactions on Software Engineering , year =
Kim, Miryung and Zimmermann, Thomas and Nagappan, Nachiappan , title =. IEEE Transactions on Software Engineering , year =
-
[52]
Liu, Bo and Jiang, Yanjie and Zhang, Yuxia and Niu, Nan and Li, Guangjie and Liu, Hui , title =. 2024 , howpublished =. doi:10.48550/arXiv.2411.04444 , url =
-
[53]
Horikawa, Kosei and Kashiwa, Yutaro and Lin, Bin and Fujiwara, Kenji and Iida, Hajimu , title =. 41st IEEE International Conference on Software Maintenance and Evolution (ICSME), New Ideas and Emerging Results Track , year =
-
[54]
Empirical Software Engineering , year =
Spadini, Davide and Schvarcbacher, Mara and Oprescu, Ana-Maria and Bruntink, Magiel and Bacchelli, Alberto , title =. Empirical Software Engineering , year =
-
[55]
Empirical Software Engineering , year =
Pontillo, Valeria and Palomba, Fabio and Ferrucci, Filomena , title =. Empirical Software Engineering , year =
-
[56]
Krinke, Jens and Ragkhitwetsagul, Chaiyong , title =. 2024 , howpublished =. doi:10.48550/arXiv.2505.04311 , url =
-
[57]
Farooq, Muhammad Shoaib and Omer, Uzma and Ramzan, Adel and Rasheed, Muhammad Aon and Atal, Zabihullah , title =. IEEE Access , year =
-
[58]
Arredondo-Reyes, V. M. and Dom\'inguez-Isidro, S. and S\'anchez-Garc\'ia, \'A. J. , title =. Programming and Computer Software , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.