pith. sign in

arxiv: 2606.27601 · v1 · pith:HIJJOLUBnew · submitted 2026-06-25 · 💻 cs.SE

Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code

Pith reviewed 2026-06-29 00:45 UTC · model grok-4.3

classification 💻 cs.SE
keywords test case selectionLLMs for codereplication studyuncertainty-based featuresrepresentation-based featuresaccuracy estimationearly failure discoverycode classification tasks
0
0 comments X

The pith

Only a subset of test case selection findings from vision DNNs generalize to LLMs for code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This replication study applies test case selection techniques previously tested on vision-based deep neural networks to large language models fine-tuned for code-related classification tasks. It evaluates twelve feature-aware strategies plus random sampling across seven predictive features, three tasks including clone detection and vulnerability detection, and seventeen model instances. The work measures effectiveness on two dimensions: how well selected test cases estimate overall model accuracy and how quickly they surface failures. Results show uncertainty-based features support early failure discovery while representation-based features support more stable accuracy estimates, yet both vary substantially by task and model. The study therefore establishes that prior vision DNN patterns transfer only partially and that TCS performance for code LLMs is context-dependent.

Core claim

The central claim is that only a subset of findings reported for vision-based DNNs generalize when TCS is applied to LLMs for code. In particular, uncertainty-based features are effective for early failure discovery, while representation-based features are more robust for accuracy estimation. At the same time, performance varies substantially across tasks and models, indicating that TCS effectiveness is context-dependent.

What carries the argument

The comparison of uncertainty-based versus representation-based predictive features within thirteen selection strategies evaluated for both accuracy estimation and early failure discovery on code classification tasks.

If this is right

  • Uncertainty-based features can be prioritized when the operational goal is rapid discovery of failures under limited labeling budgets.
  • Representation-based features should be preferred when the goal is reliable estimation of overall model accuracy.
  • Selection strategies must be adapted to the specific task and model because effectiveness is strongly context-dependent.
  • Simple random sampling remains a viable baseline but is outperformed by informed strategies in targeted use cases.
  • Operational evaluation of LLMs for code benefits from task-specific choice of TCS rather than direct reuse of vision DNN defaults.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed split between feature types suggests that hybrid selection strategies combining uncertainty and representation signals could improve both goals simultaneously.
  • The context dependence implies that organizations maintaining multiple code LLMs will need per-task calibration of their testing pipelines.
  • Extending the same replication design to generation tasks or larger foundation models would test whether the partial-transfer pattern holds more broadly.
  • These results point to the value of maintaining separate test-case pools for failure hunting versus accuracy auditing in LLM development workflows.

Load-bearing premise

The three chosen code classification tasks and the seventeen fine-tuned model instances are sufficiently representative of LLM code models in general.

What would settle it

Finding the reverse pattern—representation features outperforming uncertainty features for early failure discovery and uncertainty features outperforming representation features for accuracy estimation—on additional code tasks or model families would falsify the reported generalization limits.

Figures

Figures reproduced from arXiv: 2606.27601 by Ali Asgari, Annibale Panichella, Mitchell Olsthoorn.

Figure 1
Figure 1. Figure 1: Friedman + post-hoc Nemenyi comparisons for error estimation (Top 10). [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Friedman test with post-hoc Nemenyi analysis for failure detection (Top 10). [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

Recently, test case selection (TCS) techniques have been explored to support the operational evaluation of deep neural networks (DNNs) under limited testing budgets, where labeling cost is a primary concern and uncovering model failures early is a key objective. Although prior studies report promising results, existing empirical evaluations focus almost exclusively on vision-based DNNs and datasets, leaving it unclear whether prior findings generalize to LLM code models. This paper presents a large-scale replication study of TCS techniques in the context of LLM code models. We re-examine established TCS strategies originally proposed for DNNs and complement them with statistical sampling strategies not previously evaluated for TCS. We assess their effectiveness on three code-related classification tasks: clone detection, vulnerability detection, and technical debt prediction. The study spans 17 task-specific fine-tuned model instances, 7 predictive features, and 13 selection strategies, including 12 feature-aware strategies and simple random sampling (SRS) as a feature-agnostic baseline. We evaluate performance along two dimensions: accuracy estimation and early failure discovery. The results indicate that only a subset of findings reported for vision-based DNNs generalize when TCS is applied to LLMs for code. In particular, uncertainty-based features are effective for early failure discovery, while representation-based features are more robust for accuracy estimation. At the same time, performance varies substantially across tasks and models, indicating that TCS effectiveness is context-dependent. Overall, this study provides empirical evidence on the replicability of TCS techniques beyond vision-based deep learning and offers insights into their use for the operational evaluation of LLMs for code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a replication study of test case selection (TCS) techniques originally developed for vision-based DNNs, now applied to LLMs for code. It evaluates 13 strategies (12 feature-aware plus random sampling) using 7 predictive features across 17 fine-tuned model instances on three classification tasks (clone detection, vulnerability detection, technical debt prediction). Effectiveness is measured on two axes: accuracy estimation and early failure discovery. The central claim is that only a subset of prior vision-DNN findings generalize, with uncertainty-based features effective for early failure discovery and representation-based features more robust for accuracy estimation, while overall performance is highly context-dependent across tasks and models.

Significance. If the results hold under rigorous statistical controls, the work supplies empirical evidence on the replicability of TCS methods outside vision domains and supplies concrete guidance for budgeted operational evaluation of code LLMs. The explicit comparison of uncertainty versus representation features and the documentation of task/model variance are useful contributions.

major comments (2)
  1. [Abstract] Abstract: the generalization statement that 'only a subset of findings reported for vision-based DNNs generalize when TCS is applied to LLMs for code' is grounded exclusively in three classification tasks. The manuscript does not examine generative tasks (completion, repair, summarization) that dominate many LLM-for-code applications; this scope limitation directly affects whether the reported feature rankings can be treated as informative for the broader class of models and tasks.
  2. [Abstract] Abstract (and implied Methods): the study reports differential effectiveness across 17 models, 7 features and 13 strategies yet supplies no information on data splits, cross-validation procedure, statistical significance tests, or error bars. Without these elements it is impossible to rule out post-hoc selection or to quantify the reliability of the claimed context-dependence.
minor comments (1)
  1. The abstract lists 'simple random sampling (SRS) as a feature-agnostic baseline' but does not state whether SRS performance is reported with the same evaluation protocol as the feature-aware strategies; adding an explicit comparison table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our replication study. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the generalization statement that 'only a subset of findings reported for vision-based DNNs generalize when TCS is applied to LLMs for code' is grounded exclusively in three classification tasks. The manuscript does not examine generative tasks (completion, repair, summarization) that dominate many LLM-for-code applications; this scope limitation directly affects whether the reported feature rankings can be treated as informative for the broader class of models and tasks.

    Authors: We agree the study is limited to three classification tasks and does not cover generative tasks. Prior vision-DNN TCS work also focused on classification, so our replication preserves that scope. We will revise the abstract to explicitly state the scope (classification tasks only) and note that extension to generative tasks is left for future work, preventing overgeneralization of the feature rankings. revision: yes

  2. Referee: [Abstract] Abstract (and implied Methods): the study reports differential effectiveness across 17 models, 7 features and 13 strategies yet supplies no information on data splits, cross-validation procedure, statistical significance tests, or error bars. Without these elements it is impossible to rule out post-hoc selection or to quantify the reliability of the claimed context-dependence.

    Authors: The Methods section details dataset construction and model fine-tuning, including train/test splits. To strengthen the manuscript, we will add explicit reporting of any cross-validation used, apply and report statistical significance tests (e.g., paired Wilcoxon tests with p-values) for strategy comparisons, and include error bars or confidence intervals in figures/tables. This directly addresses concerns about reliability and post-hoc selection. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no derivation chain or self-referential definitions

full rationale

The paper is a purely empirical replication study evaluating existing TCS techniques on three code classification tasks using 17 fine-tuned models. It reports observed performance differences for accuracy estimation and early failure discovery without any equations, fitted parameters, predictions derived from inputs, or self-citations that serve as load-bearing premises. All claims rest on direct experimental results rather than quantities defined in terms of the study's own outputs. No patterns of self-definitional, fitted-input-as-prediction, or ansatz-smuggling circularity apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical replication study. No mathematical derivations, no new free parameters fitted to data, and no invented entities are introduced. All claims rest on experimental observations rather than axioms or postulates.

pith-pipeline@v0.9.1-grok · 5823 in / 1168 out tokens · 24699 ms · 2026-06-29T00:45:13.006116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 14 canonical work pages · 9 internal anchors

  1. [1]

    Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, and Dayi Lin. 2025. Metasel: A test selection approach for fine-tuned dnn models.IEEE Transactions on Software Engineering(2025)

  2. [2]

    Zohreh Aghababaeyan, Manel Abdellatif, Mahboubeh Dadkhah, and Lionel Briand. 2024. Deepgd: A multi-objective black-box test selection approach for deep neural networks.ACM Transactions on Software Engineering and Methodology 33, 6 (2024), 1–29

  3. [3]

    Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. 2655–2668

  4. [4]

    Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness.ACM Computing Surveys (CSUR)51, 4 (2018), 1–37

  5. [5]

    2006.k-means++: The advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. 2006.k-means++: The advantages of careful seeding. Technical Report. Stanford

  6. [6]

    David Arthur and Sergei Vassilvitskii. 2007. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. InSociety for Industrial and Applied Mathematics

  7. [7]

    Ali Asgari, Milan de Koning, Pouria Derakhshanfar, and Annibale Panichella. [n. d.]. Metamorphic Testing of Deep Code Models: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology([n. d.])

  8. [8]

    Ali Asgari, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo, et al. 2025. Adaptive Probabilistic Operational Testing for Large Language Models Evaluation. InThe 6th ACM/IEEE International Conference on Automation of Software Test

  9. [9]

    Merve Astekin, Arda Goknil, Sagar Sen, Simeon Tverdal, and Phu Nguyen. 2025. Detecting Technical Debt in Source Code Changes Using Large Language Models. InInternational Conference on Product-Focused Software Process Improvement. Springer, 334–352

  10. [10]

    Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey.IEEE transactions on software engineering41, 5 (2014), 507–525

  11. [11]

    Giovanni Capobianco, Andrea De Lucia, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2013. Improving IR-based traceability recovery via noun-based indexing of software artifacts.Journal of Software: Evolution and Process 25, 7 (2013), 743–762

  12. [12]

    Alexandra Carpentier and Rémi Munos. 2012. Adaptive stratified sampling for Monte-Carlo integration of differentiable functions.Advances in neural information processing systems25 (2012)

  13. [13]

    Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner. 2023. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. InProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 654–668

  14. [14]

    Jean-Claude Deville and Yves Tille. 1998. Unequal probability sampling without replacement through a splitting method.Biometrika85, 1 (1998), 89–101

  15. [15]

    Jean-Claude Deville and Yves Tillé. 1998. Unequal probability sampling without replacement.Survey Methodology24, 2 (1998), 157–168

  16. [16]

    Xavier Devroey, Alessio Gambi, Juan Pablo Galeotti, René Just, Fitsum Kifetew, Annibale Panichella, and Sebastiano Panichella. 2023. JUGE: An infrastructure for benchmarking Java unit test generators.Software Testing, Verification and Reliability33, 3 (2023), e1838

  17. [17]

    Melanie Ducoffe and Frederic Precioso. 2018. Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841(2018)

  18. [18]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

  19. [19]

    Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 177–188

  20. [20]

    Z Feng. 2020. Codebert: A pre-trained model for program-ming and natural languages.arXiv preprint arXiv:2002.08155 (2020)

  21. [21]

    Xinyu Gao, Yang Feng, Yining Yin, Zixi Liu, Zhenyu Chen, and Baowen Xu. 2022. Adaptive test selection for deep neural networks. InProceedings of the 44th international conference on software engineering. 73–85. 22 Ali Asgari, Mitchell Olsthoorn, and Annibale Panichella

  22. [22]

    Salvador García, Daniel Molina, Manuel Lozano, and Francisco Herrera. 2009. A Study on the Use of Non-parametric Tests for Analyzing the Evolutionary Algorithms’ Behaviour: A Case Study on the CEC’2005 Special Session on Real Parameter Optimization.Journal of Heuristics15, 6 (Dec. 2009), 617–644

  23. [23]

    Teofilo F Gonzalez. 1985. Clustering to minimize the maximum intercluster distance.Theoretical computer science38 (1985), 293–306

  24. [24]

    Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2021. Operation is the hardest teacher: estimating DNN accuracy looking for mispredictions. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 348–358

  25. [25]

    Antonio Guerriero, Roberto Pietrantuono, and Stefano Russo. 2024. DeepSample: DNN sampling-based testing for operational accuracy assessment. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE). ACM, 1–12

  26. [26]

    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850(2022)

  27. [27]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

  28. [28]

    Lianghong Guo, Yanlin Wang, Ensheng Shi, Wanjun Zhong, Hongyu Zhang, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2024. When to stop? towards efficient code generation in llms with excess token prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1073–1085

  29. [29]

    Morris H Hansen and William N Hurwitz. 1943. On the theory of sampling from finite populations.The Annals of Mathematical Statistics14, 4 (1943), 333–362

  30. [30]

    Horvitz and Donovan J

    Daniel G. Horvitz and Donovan J. Thompson. 1952. A generalization of sampling without replacement from a finite universe.Journal of the American statistical Association47, 260 (1952), 663–685

  31. [31]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang

  32. [32]

    Large language models for software engineering: A systematic literature review.arXiv preprint arXiv:2308.10620 (2023)

  33. [33]

    Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, and Yves Le Traon. 2024. Test optimization in dnn testing: a survey.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–42

  34. [34]

    Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Wei Ma, Mike Papadakis, Lei Ma, and Yves Le Traon. 2025. Assessing the Robustness of Test Selection Methods for Deep Neural Networks.ACM Transactions on Software Engineering and Methodology(2025)

  35. [35]

    An essay on the logical foundations of survey sampling, Part One

    Jaroslav Hájek. 1971. Comment on “An essay on the logical foundations of survey sampling, Part One”. InFoundations of Statistical Inference. Holt, Rinehart and Winston

  36. [36]

    Japkowicz and M

    N. Japkowicz and M. Shah. 2011.Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press. https://books.google.com/books?id=VoWIIOKVzR4C

  37. [37]

    Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al . 2023. Phi-2: The surprising power of small language models.Microsoft Research Blog1, 3 (2023), 3

  38. [38]

    Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. 2018. Predicting the generalization gap in deep networks with margin distributions.arXiv preprint arXiv:1810.00113(2018)

  39. [39]

    Mohamad Khajezade, Jie JW Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, and Mohamed Sami Shehata. 2024. Investigating the Efficacy of Large Language Models for Code Clone Detection. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 161–165

  40. [40]

    Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 1039–1049

  41. [41]

    Jinhan Kim, Robert Feldt, and Shin Yoo. 2023. Evaluating surprise adequacy for deep learning system testing.ACM Transactions on Software Engineering and Methodology32, 2 (2023), 1–29

  42. [42]

    Roham Koohestani, Philippe de Bekker, Begüm Koç, and Maliheh Izadi. 2025. Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality.IEEE Transactions on Software Engineering(2025)

  43. [43]

    Nam Hai Le and collaborators. 2024. Tesoro Code Dataset. https://huggingface.co/datasets/NamCyan/tesoro-code. Accessed: 2026-01-22

  44. [44]

    Nam Le Hai, Dung Manh Nguyen, and Nghi DQ Bui. 2025. On the impacts of contexts on repository-level code generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 1496–1524

  45. [45]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161 (2023). Test Case Selection for Deep Neural Networks: A Replication Study on LLMs for Code 23

  46. [46]

    Zenan Li, Xiaoxing Ma, Chang Xu, Chun Cao, Jingwei Xu, and Jian Lü. 2019. Boosting operational DNN testing efficiency through conditioning. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 499–509

  47. [47]

    Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, and Domenico Cotroneo. 2023. Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators.Expert Systems with Applications 225 (2023), 120073

  48. [48]

    Sharon L. Lohr. 2021.Sampling: design and analysis. Chapman and Hall/CRC, New York

  49. [49]

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173(2024)

  50. [50]

    Junpeng Lv, Bei-Bei Yin, and Kai-Yuan Cai. 2014. On the asymptotic behavior of adaptive testing strategy for software reliability assessment.IEEE transactions on Software Engineering40, 4 (2014), 396–412

  51. [51]

    Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–22

  52. [52]

    William G Madow. 1949. On the theory of systematic sampling, II.The Annals of Mathematical Statistics20, 3 (1949), 333–354

  53. [53]

    Debajyoti Mondal, Hadi Hemmati, and Stephane Durocher. 2015. Exploring test suite diversification and code coverage in multi-objective test case selection. In2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST). IEEE, 1–10

  54. [54]

    Chao Ni, Xin Yin, Liyu Shen, and Shaohua Wang. 2026. Learning-based models for vulnerability detection: An extensive study.Empirical Software Engineering31, 1 (2026), 18

  55. [55]

    Tanzeem Bin Noor and Hadi Hemmati. 2015. A similarity-based approach for test case prioritization using historical failure data. In2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 58–68

  56. [56]

    Annibale Panichella. 2021. A systematic comparison of search-based approaches for LDA hyperparameter tuning. Information and Software Technology130 (2021), 106411

  57. [57]

    Roberto Pietrantuono and Stefano Russo. 2016. On adaptive sampling-based testing for software reliability assessment. In2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 1–11

  58. [58]

    J NoK Rao, HO Hartley, and WG Cochran. 1962. On a simple procedure of unequal probability sampling without replacement.Journal of the Royal Statistical Society Series B: Statistical Methodology24, 2 (1962), 482–491

  59. [59]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

  60. [60]

    Tim Sonnekalb, Bernd Gruner, Clemens-Alexander Brust, and Patrick Mäder. 2022. Generalizability of code clone detection on codebert. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–3

  61. [61]

    Weifeng Sun, Meng Yan, Zhongxin Liu, and David Lo. 2023. Robust test selection for deep neural networks.IEEE Transactions on Software Engineering49, 12 (2023), 5250–5278

  62. [62]

    Zhensu Sun, Xiaoning Du, Zhou Yang, Li Li, and David Lo. 2024. Ai coders are among us: Rethinking programming language grammar towards efficient code generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1124–1136

  63. [63]

    Jeffrey Svajlenko and Chanchal K Roy. 2021. Bigclonebench. InCode Clone Analysis: Research, Tools, and Practices. Springer, 93–105

  64. [64]

    2006.Sampling Algorithms

    Yves Tillé. 2006.Sampling Algorithms. Springer

  65. [65]

    Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. 2016. Cost-effective active learning for deep image classification.IEEE Transactions on Circuits and Systems for Video Technology27, 12 (2016), 2591–2600

  66. [66]

    Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 conference on empirical methods in natural language processing. 1069–1088

  67. [67]

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. 2025. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. InProceedings of the 63rd Annual Meeting of the Ass...

  68. [68]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023)

  69. [69]

    Berry Weinstein, Shai Fine, and Yacov Hel-Or. 2020. Margin-based regularization and selective sampling in deep neural networks.arXiv preprint arXiv:2009.06011(2020). 24 Ali Asgari, Mitchell Olsthoorn, and Annibale Panichella

  70. [70]

    Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey.Software testing, verification and reliability22, 2 (2012), 67–120

  71. [71]

    Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large language models meet NL2Code: A survey. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7443–7464

  72. [72]

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385(2024)

  73. [73]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen

  74. [74]

    A survey on large language models for software engineering.arXiv preprint arXiv:2312.15223(2023)

  75. [75]

    Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372(2023)

  76. [76]

    Xin Zhou, Ting Zhang, and David Lo. 2024. Large language model for vulnerability detection: Emerging results and future directions. InProceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results. 47–51

  77. [77]

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks.Advances in neural information processing systems32 (2019)