SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?
Pith reviewed 2026-05-22 05:15 UTC · model grok-4.3
The pith
Current LLMs produce test suites that fail to catch most mutated code errors, with even top models verifying only 10 percent of cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-Mutation establishes a benchmark of 2636 mutated variants from 800 original instances across nine languages, generated via an agentic framework that produces mutants intended to fool test suites while still passing validation. Testing seven LLMs reveals that even DeepSeek-V3.1 reaches only 10.20 percent verification and 36.15 percent detection rates on these variants, while the agentic strategy reduces detection rates from 71.04 percent to 39.81 percent relative to conventional methods.
What carries the argument
The agentic language-agnostic framework for generating complex mutants that attempt to pass validation while evading detection by test suites.
If this is right
- Reliable test suites from LLMs would enable better synthesis of program repair trajectories.
- Discriminative test feedback would improve reinforcement learning signals for code models.
- The multilingual setup allows evaluation of test generation quality across different programming languages.
- Agentic mutation methods create more challenging benchmarks than conventional random or rule-based approaches.
Where Pith is reading between the lines
- Models trained or fine-tuned specifically against this benchmark could improve at producing tests that survive harder mutations.
- The framework might extend naturally to generating test suites for security vulnerabilities or performance edge cases.
- If the mutants prove representative, scaling current LLMs without new test-generation techniques will leave persistent gaps in software engineering capabilities.
Load-bearing premise
The agentically generated mutants are realistic enough to represent the kinds of errors that matter in actual software development.
What would settle it
A direct comparison in which human experts rate the realism of the mutants or measure how often real bugs from open-source repositories evade LLM-generated test suites at rates similar to or higher than the mutated variants.
Figures
read the original abstract
Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWE-Mutation, a benchmark for evaluating LLM-generated test suites via systematically mutated solutions designed to fool them. It proposes an agentic, language-agnostic framework for generating complex mutants, yielding 2,636 variants from 800 original instances (with a multilingual subset across nine languages). Experiments across seven LLMs report low performance, e.g., DeepSeek-V3.1 at 10.20% verification and 36.15% detection rates, and show the agentic strategy reduces average detection from 71.04% (conventional methods) to 39.81%, concluding that current LLMs produce superficial and non-discriminative test suites.
Significance. If the agentic mutants are shown to be realistic proxies for practical errors, the work would be significant for software engineering by pinpointing a key limitation in LLM test generation, a bottleneck for program repair and RL-based approaches. The concrete benchmark, multilingual scope, and internal comparison of mutation strategies provide a useful empirical tool and baseline for future LLM-SE research. The explicit numerical results and framework description are strengths that support reproducibility.
major comments (2)
- [§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.
- [§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.
minor comments (3)
- [Abstract] Abstract: The seven evaluated LLMs are not named; listing them (with versions) would improve clarity and allow readers to contextualize the results immediately.
- [§2] §2 (Related Work): Consider adding citations to prior mutation-testing literature in software engineering to better situate the agentic framework relative to conventional approaches.
- [Tables] Figures/Tables: Ensure all tables reporting detection/verification rates include sample sizes per language or per model to aid interpretation of the multilingual results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and statistical presentation that we address below.
read point-by-point responses
-
Referee: [§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.
Authors: We agree that external validation against real bug distributions would provide stronger grounding for the mutants as proxies for practical errors. Our agentic framework was developed to generate more complex, semantically impactful mutants than conventional methods, as demonstrated by the substantial reduction in average detection rates from 71.04% to 39.81%. This work primarily introduces the benchmark and framework; a full correlation study with SWE-bench or Defects4J was outside its scope. We will add a new discussion subsection acknowledging this limitation and outlining future validation plans, including expert labeling and comparison to real bug reports. revision: partial
-
Referee: [§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.
Authors: We concur that statistical details are necessary for assessing robustness. The 800 instances were sampled from SWE-bench to ensure coverage across diverse tasks and languages. We will revise the experiments section to report error bars, confidence intervals, and statistical significance tests for the key rates. We will also expand the description of the instance selection criteria and sampling procedure to support generalizability claims. revision: yes
Circularity Check
Minor internal-comparison risk but no definitional or self-referential circularity
full rationale
The paper introduces SWE-Mutation as a new benchmark and an agentic mutation framework, then reports empirical detection/verification rates on seven LLMs. These rates (e.g., DeepSeek-V3.1 at 10.20% verification) are direct experimental outputs rather than quantities derived by construction from the same inputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked to force the central claim. The comparison of agentic vs. conventional mutation (71.04% to 39.81%) is an internal ablation, but the headline inadequacy conclusion rests on the benchmark results themselves and does not reduce to a tautology. The realism assumption is a methodological limitation, not a circularity pattern.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Systematically mutated solutions can effectively fool and thereby measure the discriminative power of LLM-generated test suites.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites... agentic, language-agnostic framework for automatically generating complex mutants.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
and Li, Junnan and Hoi, Steven
Wang, Yue and Le, Hung and Gotmare, Akhilesh and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven. C ode T 5+: Open Code Large Language Models for Code Understanding and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[2]
Advances in Neural Information Processing Systems , volume=
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
C ode T : Code Generation with Generated Tests
Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
work page 2023
-
[4]
Shao, Etsuko and Wang, Yiyang. Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[5]
Communications of the ACM , volume=
Symbolic execution for software testing: three decades later , author=. Communications of the ACM , volume=. 2013 , publisher=
work page 2013
-
[6]
Communications of the ACM , volume=
Symbolic execution and program testing , author=. Communications of the ACM , volume=. 1976 , publisher=
work page 1976
-
[7]
No more manual tests? Evaluating and improving chatgpt for unit test generation , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=
-
[8]
Self-Edit: Fault-Aware Code Editor for Code Generation
Zhang, Kechi and Li, Zhuo and Li, Jia and Li, Ge and Jin, Zhi. Self-Edit: Fault-Aware Code Editor for Code Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
work page 2023
-
[9]
The Twelfth International Conference on Learning Representations , year=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. The Twelfth International Conference on Learning Representations , year=
-
[10]
Is Your Code Generated by Chat
Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat
-
[11]
Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=
-
[12]
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , booktitle=. Re
-
[13]
Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. Multi. 2023 , publisher=
work page 2023
-
[14]
Zheng, Qinkai and Xia, Xiao and Zou, Xu and Dong, Yuxiao and Wang, Shan and Xue, Yufei and Wang, Zihan and Shen, Lei and Wang, Andi and Li, Yang and others , booktitle=. Code
-
[15]
C ode T : Code Generation with Generated Tests
Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 11th International Conference on Learning Representations. 2023
work page 2023
-
[16]
Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36. 2023
work page 2023
-
[18]
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =
M. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =. 2024 , url =
work page 2024
-
[19]
The Thirteenth International Conference on Learning Representations , year =
Jain, Kush and Synnaeve, Gabriel and Rozi\`ere, Baptiste , title =. The Thirteenth International Conference on Learning Representations , year =
-
[20]
Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =
Zhang, Quanjun and Shang, Ye and Fang, Chunrong and Gu, Siqi and Zhou, Jianyi and Chen, Zhenyu , title =. Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =
-
[21]
arXiv preprint arXiv:2505.05283 , year =
Wang, Kaixin and Li, Tianlin and Zhang, Xiaoyu and Wang, Chong and Sun, Weisong and Liu, Yang and Shi, Bin , title =. arXiv preprint arXiv:2505.05283 , year =
-
[22]
and Spahar-McClure, Justin and Anderson, Carolyn Jane
Cassano, Federico and Gouwar, John and Huebner, Daniel and O'Toole, Kelley and Lee, Edward E. and Spahar-McClure, Justin and Anderson, Carolyn Jane. M ulti PL-E : A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering. 2023
work page 2023
-
[23]
Wang, Guancheng and Xu, Qinghua and Briand, Lionel C. and Liu, Kui , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '25) , year =
-
[24]
Mutation-Guided LLM-based Test Generation at Meta , booktitle =
Foster, Christopher and Gulati, Abhishek and Harman, Mark and Harper, Inna and Mao, Ke and Ritchey, Jillian and Robert, Herv. Mutation-Guided LLM-based Test Generation at Meta , booktitle =. 2025 , publisher =
work page 2025
-
[25]
Proceedings of the 39th International Conference on Software Engineering,
Thierry Titcheu Chekam and Mike Papadakis and Yves Le Traon and Mark Harman , title =. Proceedings of the 39th International Conference on Software Engineering,
-
[26]
An Analysis and Survey of the Development of Mutation Testing
Yue Jia and Mark Harman. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering
-
[27]
Matthieu Jimenez and Thierry Titcheu Chekam and Maxime Cordy and Mike Papadakis and Marinos Kintis and Yves Le Traon and Mark Harman , editor =. Are mutants really natural?: a study on how "naturalness" helps mutant selection , booktitle =. 2018 , url =. doi:10.1145/3239235.3240500 , timestamp =
-
[29]
Ibrahimzada, Ali Reza and Chen, Yang and Rong, Ryan and Jabbarvand, Reyhaneh , title =. Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =
-
[30]
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving , author=. 2025 , eprint=
work page 2025
-
[31]
SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=
work page 2025
-
[32]
Wang, Xingyao and Zhu, Boxuan and Odibat, Fangkai and Liu, Yian and Liu, Boyuan and Li, Zhuoer and Zhou, Shuyan and Neubig, Graham , booktitle =. 2025 , publisher =
work page 2025
-
[33]
Findings of the Association for Computational Linguistics: NAACL 2025 , year =
Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =
work page 2025
-
[34]
Large Language Models are Few-Shot Testers: Exploring
Kang, Sungmin and Yoon, Juyeon and Yoo, Shin , booktitle =. Large Language Models are Few-Shot Testers: Exploring. 2024 , url =
work page 2024
-
[35]
Yang, Zheyuan and Kuang, Zexi and Xia, Xue and Zhao, Yilun , booktitle =. Can. 2025 , publisher =
work page 2025
-
[37]
Siddiq, Mohammed Latif and Santos, Joanna C. S. , title =. Proceedings of the 46th International Conference on Software Engineering (ICSE '24) , year =. doi:10.1145/3597503.3639106 , url =
-
[38]
Yuan, Zhiqiang and Lou, Yiling and Liu, Mingwei and Ding, Shiji and Li, Kaixuan and Liang, Chendong and Peng, Xin , title =. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =. doi:10.1145/3691620.3695037 , url =
-
[39]
Advances in Computers , volume =
Mutation Testing Advances: An Analysis and Survey , author =. Advances in Computers , volume =. 2019 , publisher =
work page 2019
-
[40]
Coles, Henry and Laurent, Thomas and Henard, Christopher and Papadakis, Mike and Ventresque, Anthony , booktitle =. 2016 , publisher =
work page 2016
-
[41]
Ren. Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =. 2011 , publisher =
work page 2011
-
[42]
Are Mutants a Valid Substitute for Real Faults in Software Testing? , author =. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =. 2014 , publisher =
work page 2014
-
[43]
Tufano, Michele and Kimko, Jason and Wang, Shiya and Watson, Cody and Bavota, Gabriele and Di Penta, Massimiliano and Poshyvanyk, Denys , booktitle =. 2020 , publisher =
work page 2020
-
[44]
Degiovanni, Renzo and Papadakis, Mike , booktitle =. 2022 , publisher =
work page 2022
-
[46]
Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. 2024 , url =
work page 2024
-
[47]
Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Muralidharan, Huchao and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and Xiang, Bing , booktitle =. 2023 , pages =
work page 2023
- [48]
- [49]
- [50]
- [51]
- [52]
-
[53]
Introducing GLM-4.6 , year =
- [54]
-
[55]
John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=
work page 2024
-
[56]
Happe, Andreas and Cito, J. Benchmarking Practices in. 2024 IEEE Secure Development Conference (SecDev) , year =. doi:10.1109/SecDev60338.2024.00013 , url =
-
[57]
Zhu, Hengcheng and Yang, Zhou and Wang, Kailong and Li, Li and Ren, Ziyou and Liu, Yan and Lo, David and Wang, Haoyu , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =
-
[58]
Tian, Yongqiang and Wu, Yuxiang and Wan, Yao and Zhang, Hongyu , journal =. 2024 , url =
work page 2024
- [59]
-
[60]
Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo
Earl T. Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering. 2015 , month=
work page 2015
-
[62]
The Thirteenth International Conference on Learning Representations , year=
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=
-
[63]
Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=
-
[66]
Anthropic. 2025. https://www.anthropic.com/ Ai research and products that put safety at the frontier
work page 2025
-
[67]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[68]
Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering , 41(5):507--525
work page 2015
-
[69]
Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82--90
work page 2013
-
[70]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. Multi PL-E : a scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49(7):3675--3691
work page 2023
-
[71]
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. C ode T : Code generation with generated tests. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462--6477. Association for Computational Linguistics
work page 2023
-
[72]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. https://doi.org/10.1145/2931037.2946338 PIT : A practical mutation testing tool for Java . In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA '16), pages 449--452. ACM
-
[74]
DeepSeek-AI. 2025. https://api-docs.deepseek.com/news/news250821 Deepseek-v3.1 release
work page 2025
-
[75]
Renzo Degiovanni and Mike Papadakis. 2022. https://doi.org/10.1109/ICSTW55395.2022.00047 BERT : Mutation testing using pre-trained language models . In Proceedings of the 15th IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW '22), pages 160--169. IEEE
-
[76]
Yanmin Dong, Zhenya Huang, Zheng Zhang, Guanhao Zhao, Likang Wu, Hongke Zhao, Binbin Jin, and Qi Liu. 2025. https://doi.org/10.1145/3701551.3703537 Enhancing code search intent with programming context exploration . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 596–605, New York, NY, USA. Assoc...
-
[77]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. https://doi.org/10.1145/3695988 Large language models for software engineering: A systematic literature review . ACM Transactions on Software Engineering and Methodology, 33(8):1--79
-
[78]
Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2025. https://arxiv.org/abs/2310.02407 Challenging bug prediction and repair models with synthetic bugs . In Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25). IEEE
-
[79]
Kush Jain, Gabriel Synnaeve, and Baptiste Rozi\`ere. 2025. https://openreview.net/forum?id=Agqf3qX150 Testgeneval: A real world unit test generation and test completion benchmark . In The Thirteenth International Conference on Learning Representations
work page 2025
-
[80]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations
work page 2024
-
[81]
Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 a . Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 654--665
work page 2014
-
[82]
Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 b . https://doi.org/10.1145/2635868.2635929 Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14), pages 654--665. ACM
-
[83]
Ren \' e Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. https://doi.org/10.1109/ASE.2011.6100138 MAJOR: an efficient and extensible tool for mutation analysis in a java compiler . In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11), pages 612--615. ACM
-
[84]
Kimi Team , Yifan Bai, and 1 others. 2025. https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[85]
James C King. 1976. Symbolic execution and program testing. Communications of the ACM, 19(7):385--394
work page 1976
-
[86]
Hung Le, Yue Wang, Akhilesh D Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314--21328
work page 2022
-
[87]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.