pith. sign in

arxiv: 2605.22175 · v1 · pith:TVKWGMVXnew · submitted 2026-05-21 · 💻 cs.SE · cs.AI

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

Pith reviewed 2026-05-22 05:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords test suite generationLLM evaluationmutation testingsoftware engineering benchmarkprogram repairdiscriminative testsmulti-language code
0
0 comments X

The pith

Current LLMs produce test suites that fail to catch most mutated code errors, with even top models verifying only 10 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-Mutation as a benchmark that evaluates LLM-generated test suites by pitting them against systematically created mutated code variants designed to pass incorrect tests. It develops an agentic framework that automatically produces complex, language-agnostic mutants from original solutions across nine programming languages, yielding 2636 variants from 800 base instances. Experiments across seven LLMs show low performance, with the strongest model reaching just 10.20 percent verification and 36.15 percent detection rates. The agentic approach creates harder challenges than standard mutation techniques, lowering average detection from 71 percent to 40 percent. These results indicate that LLM test suites remain superficial and lack the power needed for reliable feedback in program repair or reinforcement learning.

Core claim

SWE-Mutation establishes a benchmark of 2636 mutated variants from 800 original instances across nine languages, generated via an agentic framework that produces mutants intended to fool test suites while still passing validation. Testing seven LLMs reveals that even DeepSeek-V3.1 reaches only 10.20 percent verification and 36.15 percent detection rates on these variants, while the agentic strategy reduces detection rates from 71.04 percent to 39.81 percent relative to conventional methods.

What carries the argument

The agentic language-agnostic framework for generating complex mutants that attempt to pass validation while evading detection by test suites.

If this is right

  • Reliable test suites from LLMs would enable better synthesis of program repair trajectories.
  • Discriminative test feedback would improve reinforcement learning signals for code models.
  • The multilingual setup allows evaluation of test generation quality across different programming languages.
  • Agentic mutation methods create more challenging benchmarks than conventional random or rule-based approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained or fine-tuned specifically against this benchmark could improve at producing tests that survive harder mutations.
  • The framework might extend naturally to generating test suites for security vulnerabilities or performance edge cases.
  • If the mutants prove representative, scaling current LLMs without new test-generation techniques will leave persistent gaps in software engineering capabilities.

Load-bearing premise

The agentically generated mutants are realistic enough to represent the kinds of errors that matter in actual software development.

What would settle it

A direct comparison in which human experts rate the realism of the mutants or measure how often real bugs from open-source repositories evade LLM-generated test suites at rates similar to or higher than the mutated variants.

Figures

Figures reproduced from arXiv: 2605.22175 by Jinbo Wang, Kai Zhang, Mengdi Zhang, Yao Du, Yufeng Wang, Yuxuan Sun, Yuze Zhao, Zhenya Huang, Zhiyuan Ma.

Figure 1
Figure 1. Figure 1: The pivotal role of test suites and high-quality [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of our framework. Starting with the golden solution and golden test suite in a repository, we [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance gains achieved by switching from Mini-Swe-Agent to Claude Code. cialized “Edit” tools, which facilitate the generation and modification of large-scale files. Interestingly, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RDR performance across 9 programming languages for different LLMs. and JS/TS. Our analysis identifies the root cause. Models encounter major hurdles when synthesiz￾ing tests involving memory management (typical in C/C++) and event-driven mechanisms (typical in JS/TS). In Appendix F, we provide an analysis of representative failure cases. 4.6 Comparison between Mutation Strategies We compare three mutation … view at source ↗
Figure 5
Figure 5. Figure 5: VRR performance across 9 programming languages for different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The simplified system prompt for the Mutation Agent. The agent is tasked with introducing specific types [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt for the test generation task. The model is tasked with creating new test files from [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The system prompt for the Test Repair task. Unlike Test Generation, this task focuses on creating a [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
read the original abstract

Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SWE-Mutation, a benchmark for evaluating LLM-generated test suites via systematically mutated solutions designed to fool them. It proposes an agentic, language-agnostic framework for generating complex mutants, yielding 2,636 variants from 800 original instances (with a multilingual subset across nine languages). Experiments across seven LLMs report low performance, e.g., DeepSeek-V3.1 at 10.20% verification and 36.15% detection rates, and show the agentic strategy reduces average detection from 71.04% (conventional methods) to 39.81%, concluding that current LLMs produce superficial and non-discriminative test suites.

Significance. If the agentic mutants are shown to be realistic proxies for practical errors, the work would be significant for software engineering by pinpointing a key limitation in LLM test generation, a bottleneck for program repair and RL-based approaches. The concrete benchmark, multilingual scope, and internal comparison of mutation strategies provide a useful empirical tool and baseline for future LLM-SE research. The explicit numerical results and framework description are strengths that support reproducibility.

major comments (2)
  1. [§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.
  2. [§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.
minor comments (3)
  1. [Abstract] Abstract: The seven evaluated LLMs are not named; listing them (with versions) would improve clarity and allow readers to contextualize the results immediately.
  2. [§2] §2 (Related Work): Consider adding citations to prior mutation-testing literature in software engineering to better situate the agentic framework relative to conventional approaches.
  3. [Tables] Figures/Tables: Ensure all tables reporting detection/verification rates include sample sizes per language or per model to aid interpretation of the multilingual results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and statistical presentation that we address below.

read point-by-point responses
  1. Referee: [§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.

    Authors: We agree that external validation against real bug distributions would provide stronger grounding for the mutants as proxies for practical errors. Our agentic framework was developed to generate more complex, semantically impactful mutants than conventional methods, as demonstrated by the substantial reduction in average detection rates from 71.04% to 39.81%. This work primarily introduces the benchmark and framework; a full correlation study with SWE-bench or Defects4J was outside its scope. We will add a new discussion subsection acknowledging this limitation and outlining future validation plans, including expert labeling and comparison to real bug reports. revision: partial

  2. Referee: [§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.

    Authors: We concur that statistical details are necessary for assessing robustness. The 800 instances were sampled from SWE-bench to ensure coverage across diverse tasks and languages. We will revise the experiments section to report error bars, confidence intervals, and statistical significance tests for the key rates. We will also expand the description of the instance selection criteria and sampling procedure to support generalizability claims. revision: yes

Circularity Check

0 steps flagged

Minor internal-comparison risk but no definitional or self-referential circularity

full rationale

The paper introduces SWE-Mutation as a new benchmark and an agentic mutation framework, then reports empirical detection/verification rates on seven LLMs. These rates (e.g., DeepSeek-V3.1 at 10.20% verification) are direct experimental outputs rather than quantities derived by construction from the same inputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked to force the central claim. The comparison of agentic vs. conventional mutation (71.04% to 39.81%) is an internal ablation, but the headline inadequacy conclusion rests on the benchmark results themselves and does not reduce to a tautology. The realism assumption is a methodological limitation, not a circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that mutated solutions can serve as effective proxies for real-world bugs that test suites should catch. No free parameters or invented entities are explicitly introduced in the abstract; the framework is described as language-agnostic but its internal parameters are not detailed.

axioms (1)
  • domain assumption Systematically mutated solutions can effectively fool and thereby measure the discriminative power of LLM-generated test suites.
    This assumption underpins the entire benchmark construction and the interpretation of low detection rates as evidence of LLM inadequacy.

pith-pipeline@v0.9.0 · 5819 in / 1421 out tokens · 34321 ms · 2026-05-22T05:15:07.384457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 9 internal anchors

  1. [1]

    and Li, Junnan and Hoi, Steven

    Wang, Yue and Le, Hung and Gotmare, Akhilesh and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven. C ode T 5+: Open Code Large Language Models for Code Understanding and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    C ode T : Code Generation with Generated Tests

    Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  4. [4]

    Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification

    Shao, Etsuko and Wang, Yiyang. Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

  5. [5]

    Communications of the ACM , volume=

    Symbolic execution for software testing: three decades later , author=. Communications of the ACM , volume=. 2013 , publisher=

  6. [6]

    Communications of the ACM , volume=

    Symbolic execution and program testing , author=. Communications of the ACM , volume=. 1976 , publisher=

  7. [7]

    Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

    No more manual tests? Evaluating and improving chatgpt for unit test generation , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

  8. [8]

    Self-Edit: Fault-Aware Code Editor for Code Generation

    Zhang, Kechi and Li, Zhuo and Li, Jia and Li, Ge and Jin, Zhi. Self-Edit: Fault-Aware Code Editor for Code Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  9. [9]

    The Twelfth International Conference on Learning Representations , year=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. The Twelfth International Conference on Learning Representations , year=

  10. [10]

    Is Your Code Generated by Chat

    Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat

  11. [11]

    Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

    Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

  12. [12]

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , booktitle=. Re

  13. [13]

    Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. Multi. 2023 , publisher=

  14. [14]

    Zheng, Qinkai and Xia, Xiao and Zou, Xu and Dong, Yuxiao and Wang, Shan and Xue, Yufei and Wang, Zihan and Shen, Lei and Wang, Andi and Li, Yang and others , booktitle=. Code

  15. [15]

    C ode T : Code Generation with Generated Tests

    Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 11th International Conference on Learning Representations. 2023

  16. [16]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36. 2023

  17. [18]

    SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =

    M. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =. 2024 , url =

  18. [19]

    The Thirteenth International Conference on Learning Representations , year =

    Jain, Kush and Synnaeve, Gabriel and Rozi\`ere, Baptiste , title =. The Thirteenth International Conference on Learning Representations , year =

  19. [20]

    Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

    Zhang, Quanjun and Shang, Ye and Fang, Chunrong and Gu, Siqi and Zhou, Jianyi and Chen, Zhenyu , title =. Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

  20. [21]

    arXiv preprint arXiv:2505.05283 , year =

    Wang, Kaixin and Li, Tianlin and Zhang, Xiaoyu and Wang, Chong and Sun, Weisong and Liu, Yang and Shi, Bin , title =. arXiv preprint arXiv:2505.05283 , year =

  21. [22]

    and Spahar-McClure, Justin and Anderson, Carolyn Jane

    Cassano, Federico and Gouwar, John and Huebner, Daniel and O'Toole, Kelley and Lee, Edward E. and Spahar-McClure, Justin and Anderson, Carolyn Jane. M ulti PL-E : A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering. 2023

  22. [23]

    and Liu, Kui , title =

    Wang, Guancheng and Xu, Qinghua and Briand, Lionel C. and Liu, Kui , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '25) , year =

  23. [24]

    Mutation-Guided LLM-based Test Generation at Meta , booktitle =

    Foster, Christopher and Gulati, Abhishek and Harman, Mark and Harper, Inna and Mao, Ke and Ritchey, Jillian and Robert, Herv. Mutation-Guided LLM-based Test Generation at Meta , booktitle =. 2025 , publisher =

  24. [25]

    Proceedings of the 39th International Conference on Software Engineering,

    Thierry Titcheu Chekam and Mike Papadakis and Yves Le Traon and Mark Harman , title =. Proceedings of the 39th International Conference on Software Engineering,

  25. [26]

    An Analysis and Survey of the Development of Mutation Testing

    Yue Jia and Mark Harman. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering

  26. [27]

    naturalness

    Matthieu Jimenez and Thierry Titcheu Chekam and Maxime Cordy and Mike Papadakis and Marinos Kintis and Yves Le Traon and Mark Harman , editor =. Are mutants really natural?: a study on how "naturalness" helps mutant selection , booktitle =. 2018 , url =. doi:10.1145/3239235.3240500 , timestamp =

  27. [29]

    Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

    Ibrahimzada, Ali Reza and Chen, Yang and Rong, Ryan and Jabbarvand, Reyhaneh , title =. Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

  28. [30]

    2025 , eprint=

    Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving , author=. 2025 , eprint=

  29. [31]

    2025 , eprint=

    SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

  30. [32]

    2025 , publisher =

    Wang, Xingyao and Zhu, Boxuan and Odibat, Fangkai and Liu, Yian and Liu, Boyuan and Li, Zhuoer and Zhou, Shuyan and Neubig, Graham , booktitle =. 2025 , publisher =

  31. [33]

    Findings of the Association for Computational Linguistics: NAACL 2025 , year =

    Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =

  32. [34]

    Large Language Models are Few-Shot Testers: Exploring

    Kang, Sungmin and Yoon, Juyeon and Yoo, Shin , booktitle =. Large Language Models are Few-Shot Testers: Exploring. 2024 , url =

  33. [35]

    Yang, Zheyuan and Kuang, Zexi and Xia, Xue and Zhao, Yilun , booktitle =. Can. 2025 , publisher =

  34. [37]

    Siddiq, Mohammed Latif and Santos, Joanna C. S. , title =. Proceedings of the 46th International Conference on Software Engineering (ICSE '24) , year =. doi:10.1145/3597503.3639106 , url =

  35. [38]

    Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

    Yuan, Zhiqiang and Lou, Yiling and Liu, Mingwei and Ding, Shiji and Li, Kaixuan and Liang, Chendong and Peng, Xin , title =. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =. doi:10.1145/3691620.3695037 , url =

  36. [39]

    Advances in Computers , volume =

    Mutation Testing Advances: An Analysis and Survey , author =. Advances in Computers , volume =. 2019 , publisher =

  37. [40]

    2016 , publisher =

    Coles, Henry and Laurent, Thomas and Henard, Christopher and Papadakis, Mike and Ventresque, Anthony , booktitle =. 2016 , publisher =

  38. [41]

    Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =

    Ren. Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =. 2011 , publisher =

  39. [42]

    Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =

    Are Mutants a Valid Substitute for Real Faults in Software Testing? , author =. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =. 2014 , publisher =

  40. [43]

    2020 , publisher =

    Tufano, Michele and Kimko, Jason and Wang, Shiya and Watson, Cody and Bavota, Gabriele and Di Penta, Massimiliano and Poshyvanyk, Denys , booktitle =. 2020 , publisher =

  41. [44]

    2022 , publisher =

    Degiovanni, Renzo and Papadakis, Mike , booktitle =. 2022 , publisher =

  42. [46]

    2024 , url =

    Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. 2024 , url =

  43. [47]

    2023 , pages =

    Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Muralidharan, Huchao and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and Xiang, Bing , booktitle =. 2023 , pages =

  44. [48]

    2025 , url =

    Anthropic , title =. 2025 , url =

  45. [49]

    2025 , url =

    DeepSeek-AI , title =. 2025 , url =

  46. [50]

    2025 , eprint=

    Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

  47. [51]

    2025 , url =

    OpenAI , title =. 2025 , url =

  48. [52]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  49. [53]

    Introducing GLM-4.6 , year =

  50. [54]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  51. [55]

    2024 , url=

    John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

  52. [56]

    Benchmarking Practices in

    Happe, Andreas and Cito, J. Benchmarking Practices in. 2024 IEEE Secure Development Conference (SecDev) , year =. doi:10.1109/SecDev60338.2024.00013 , url =

  53. [57]

    Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

    Zhu, Hengcheng and Yang, Zhou and Wang, Kailong and Li, Li and Ren, Ziyou and Liu, Yan and Lo, David and Wang, Haoyu , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

  54. [58]

    2024 , url =

    Tian, Yongqiang and Wu, Yuxiang and Wan, Yao and Zhang, Hongyu , journal =. 2024 , url =

  55. [59]

    2024 , url =

    OpenAI , title =. 2024 , url =

  56. [60]

    Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo

    Earl T. Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering. 2015 , month=

  57. [62]

    The Thirteenth International Conference on Learning Representations , year=

    Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=

  58. [63]

    Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

    Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

  59. [66]

    Anthropic. 2025. https://www.anthropic.com/ Ai research and products that put safety at the frontier

  60. [67]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732

  61. [68]

    Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

    Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering , 41(5):507--525

  62. [69]

    Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82--90

  63. [70]

    Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. Multi PL-E : a scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49(7):3675--3691

  64. [71]

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. C ode T : Code generation with generated tests. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462--6477. Association for Computational Linguistics

  65. [72]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374

  66. [73]

    Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. https://doi.org/10.1145/2931037.2946338 PIT : A practical mutation testing tool for Java . In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA '16), pages 449--452. ACM

  67. [74]

    DeepSeek-AI. 2025. https://api-docs.deepseek.com/news/news250821 Deepseek-v3.1 release

  68. [75]

    Renzo Degiovanni and Mike Papadakis. 2022. https://doi.org/10.1109/ICSTW55395.2022.00047 BERT : Mutation testing using pre-trained language models . In Proceedings of the 15th IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW '22), pages 160--169. IEEE

  69. [76]

    Yanmin Dong, Zhenya Huang, Zheng Zhang, Guanhao Zhao, Likang Wu, Hongke Zhao, Binbin Jin, and Qi Liu. 2025. https://doi.org/10.1145/3701551.3703537 Enhancing code search intent with programming context exploration . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 596–605, New York, NY, USA. Assoc...

  70. [77]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. https://doi.org/10.1145/3695988 Large language models for software engineering: A systematic literature review . ACM Transactions on Software Engineering and Methodology, 33(8):1--79

  71. [78]

    Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2025. https://arxiv.org/abs/2310.02407 Challenging bug prediction and repair models with synthetic bugs . In Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25). IEEE

  72. [79]

    Kush Jain, Gabriel Synnaeve, and Baptiste Rozi\`ere. 2025. https://openreview.net/forum?id=Agqf3qX150 Testgeneval: A real world unit test generation and test completion benchmark . In The Thirteenth International Conference on Learning Representations

  73. [80]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations

  74. [81]

    Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 a . Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 654--665

  75. [82]

    Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 b . https://doi.org/10.1145/2635868.2635929 Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14), pages 654--665. ACM

  76. [83]

    Kapfhammer

    Ren \' e Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. https://doi.org/10.1109/ASE.2011.6100138 MAJOR: an efficient and extensible tool for mutation analysis in a java compiler . In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11), pages 612--615. ACM

  77. [84]

    Kimi Team , Yifan Bai, and 1 others. 2025. https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

  78. [85]

    James C King. 1976. Symbolic execution and program testing. Communications of the ACM, 19(7):385--394

  79. [86]

    Hung Le, Yue Wang, Akhilesh D Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314--21328

  80. [87]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36

Showing first 80 references.