SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

Jinbo Wang; Kai Zhang; Mengdi Zhang; Yao Du; Yufeng Wang; Yuxuan Sun; Yuze Zhao; Zhenya Huang; Zhiyuan Ma

arxiv: 2605.22175 · v1 · pith:TVKWGMVXnew · submitted 2026-05-21 · 💻 cs.SE · cs.AI

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

Yuxuan Sun , Yuze Zhao , Yufeng Wang , Yao Du , Zhiyuan Ma , Jinbo Wang , Mengdi Zhang , Kai Zhang

show 1 more author

Zhenya Huang

This is my paper

Pith reviewed 2026-05-22 05:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords test suite generationLLM evaluationmutation testingsoftware engineering benchmarkprogram repairdiscriminative testsmulti-language code

0 comments

The pith

Current LLMs produce test suites that fail to catch most mutated code errors, with even top models verifying only 10 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-Mutation as a benchmark that evaluates LLM-generated test suites by pitting them against systematically created mutated code variants designed to pass incorrect tests. It develops an agentic framework that automatically produces complex, language-agnostic mutants from original solutions across nine programming languages, yielding 2636 variants from 800 base instances. Experiments across seven LLMs show low performance, with the strongest model reaching just 10.20 percent verification and 36.15 percent detection rates. The agentic approach creates harder challenges than standard mutation techniques, lowering average detection from 71 percent to 40 percent. These results indicate that LLM test suites remain superficial and lack the power needed for reliable feedback in program repair or reinforcement learning.

Core claim

SWE-Mutation establishes a benchmark of 2636 mutated variants from 800 original instances across nine languages, generated via an agentic framework that produces mutants intended to fool test suites while still passing validation. Testing seven LLMs reveals that even DeepSeek-V3.1 reaches only 10.20 percent verification and 36.15 percent detection rates on these variants, while the agentic strategy reduces detection rates from 71.04 percent to 39.81 percent relative to conventional methods.

What carries the argument

The agentic language-agnostic framework for generating complex mutants that attempt to pass validation while evading detection by test suites.

If this is right

Reliable test suites from LLMs would enable better synthesis of program repair trajectories.
Discriminative test feedback would improve reinforcement learning signals for code models.
The multilingual setup allows evaluation of test generation quality across different programming languages.
Agentic mutation methods create more challenging benchmarks than conventional random or rule-based approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained or fine-tuned specifically against this benchmark could improve at producing tests that survive harder mutations.
The framework might extend naturally to generating test suites for security vulnerabilities or performance edge cases.
If the mutants prove representative, scaling current LLMs without new test-generation techniques will leave persistent gaps in software engineering capabilities.

Load-bearing premise

The agentically generated mutants are realistic enough to represent the kinds of errors that matter in actual software development.

What would settle it

A direct comparison in which human experts rate the realism of the mutants or measure how often real bugs from open-source repositories evade LLM-generated test suites at rates similar to or higher than the mutated variants.

Figures

Figures reproduced from arXiv: 2605.22175 by Jinbo Wang, Kai Zhang, Mengdi Zhang, Yao Du, Yufeng Wang, Yuxuan Sun, Yuze Zhao, Zhenya Huang, Zhiyuan Ma.

**Figure 2.** Figure 2: The overview of our framework. Starting with the golden solution and golden test suite in a repository, we [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance gains achieved by switching from Mini-Swe-Agent to Claude Code. cialized “Edit” tools, which facilitate the generation and modification of large-scale files. Interestingly, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: RDR performance across 9 programming languages for different LLMs. and JS/TS. Our analysis identifies the root cause. Models encounter major hurdles when synthesizing tests involving memory management (typical in C/C++) and event-driven mechanisms (typical in JS/TS). In Appendix F, we provide an analysis of representative failure cases. 4.6 Comparison between Mutation Strategies We compare three mutation … view at source ↗

**Figure 5.** Figure 5: VRR performance across 9 programming languages for different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The simplified system prompt for the Mutation Agent. The agent is tasked with introducing specific types [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: The system prompt for the test generation task. The model is tasked with creating new test files from [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: The system prompt for the Test Repair task. Unlike Test Generation, this task focuses on creating a [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SWE-Mutation, a benchmark for evaluating LLM-generated test suites via systematically mutated solutions designed to fool them. It proposes an agentic, language-agnostic framework for generating complex mutants, yielding 2,636 variants from 800 original instances (with a multilingual subset across nine languages). Experiments across seven LLMs report low performance, e.g., DeepSeek-V3.1 at 10.20% verification and 36.15% detection rates, and show the agentic strategy reduces average detection from 71.04% (conventional methods) to 39.81%, concluding that current LLMs produce superficial and non-discriminative test suites.

Significance. If the agentic mutants are shown to be realistic proxies for practical errors, the work would be significant for software engineering by pinpointing a key limitation in LLM test generation, a bottleneck for program repair and RL-based approaches. The concrete benchmark, multilingual scope, and internal comparison of mutation strategies provide a useful empirical tool and baseline for future LLM-SE research. The explicit numerical results and framework description are strengths that support reproducibility.

major comments (2)

[§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.
[§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.

minor comments (3)

[Abstract] Abstract: The seven evaluated LLMs are not named; listing them (with versions) would improve clarity and allow readers to contextualize the results immediately.
[§2] §2 (Related Work): Consider adding citations to prior mutation-testing literature in software engineering to better situate the agentic framework relative to conventional approaches.
[Tables] Figures/Tables: Ensure all tables reporting detection/verification rates include sample sizes per language or per model to aid interpretation of the multilingual results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validation and statistical presentation that we address below.

read point-by-point responses

Referee: [§3] §3 (Agentic Mutation Framework): The central claim that LLMs are inadequate for reliable test suites rests on the mutated variants serving as realistic proxies for errors that matter in practice. The manuscript provides no external validation of this premise, such as correlation with real bug distributions from SWE-bench, Defects4J, or expert semantic-impact labeling. Without such grounding, the reported drop to 39.81% detection (and 36.15% for DeepSeek-V3.1) may reflect artificial constructs rather than genuine deficiencies.

Authors: We agree that external validation against real bug distributions would provide stronger grounding for the mutants as proxies for practical errors. Our agentic framework was developed to generate more complex, semantically impactful mutants than conventional methods, as demonstrated by the substantial reduction in average detection rates from 71.04% to 39.81%. This work primarily introduces the benchmark and framework; a full correlation study with SWE-bench or Defects4J was outside its scope. We will add a new discussion subsection acknowledging this limitation and outlining future validation plans, including expert labeling and comparison to real bug reports. revision: partial
Referee: [§4] §4 (Experiments): The headline rates (10.20% verification, 36.15% detection for DeepSeek-V3.1) and the conventional-vs-agentic comparison are presented without error bars, confidence intervals, statistical significance tests, or details on how the 800 original instances were selected. These omissions undermine assessment of whether the inadequacy conclusion is robust or generalizable.

Authors: We concur that statistical details are necessary for assessing robustness. The 800 instances were sampled from SWE-bench to ensure coverage across diverse tasks and languages. We will revise the experiments section to report error bars, confidence intervals, and statistical significance tests for the key rates. We will also expand the description of the instance selection criteria and sampling procedure to support generalizability claims. revision: yes

Circularity Check

0 steps flagged

Minor internal-comparison risk but no definitional or self-referential circularity

full rationale

The paper introduces SWE-Mutation as a new benchmark and an agentic mutation framework, then reports empirical detection/verification rates on seven LLMs. These rates (e.g., DeepSeek-V3.1 at 10.20% verification) are direct experimental outputs rather than quantities derived by construction from the same inputs or fitted parameters. No equations, self-citations, or uniqueness theorems are invoked to force the central claim. The comparison of agentic vs. conventional mutation (71.04% to 39.81%) is an internal ablation, but the headline inadequacy conclusion rests on the benchmark results themselves and does not reduce to a tautology. The realism assumption is a methodological limitation, not a circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that mutated solutions can serve as effective proxies for real-world bugs that test suites should catch. No free parameters or invented entities are explicitly introduced in the abstract; the framework is described as language-agnostic but its internal parameters are not detailed.

axioms (1)

domain assumption Systematically mutated solutions can effectively fool and thereby measure the discriminative power of LLM-generated test suites.
This assumption underpins the entire benchmark construction and the interpretation of low detection rates as evidence of LLM inadequacy.

pith-pipeline@v0.9.0 · 5819 in / 1421 out tokens · 34321 ms · 2026-05-22T05:15:07.384457+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites... agentic, language-agnostic framework for automatically generating complex mutants.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 9 internal anchors

[1]

and Li, Junnan and Hoi, Steven

Wang, Yue and Le, Hung and Gotmare, Akhilesh and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven. C ode T 5+: Open Code Large Language Models for Code Understanding and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[2]

Advances in Neural Information Processing Systems , volume=

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

C ode T : Code Generation with Generated Tests

Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[4]

Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification

Shao, Etsuko and Wang, Yiyang. Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[5]

Communications of the ACM , volume=

Symbolic execution for software testing: three decades later , author=. Communications of the ACM , volume=. 2013 , publisher=

work page 2013
[6]

Communications of the ACM , volume=

Symbolic execution and program testing , author=. Communications of the ACM , volume=. 1976 , publisher=

work page 1976
[7]

Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

No more manual tests? Evaluating and improving chatgpt for unit test generation , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

work page
[8]

Self-Edit: Fault-Aware Code Editor for Code Generation

Zhang, Kechi and Li, Zhuo and Li, Jia and Li, Ge and Jin, Zhi. Self-Edit: Fault-Aware Code Editor for Code Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[9]

The Twelfth International Conference on Learning Representations , year=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. The Twelfth International Conference on Learning Representations , year=

work page
[10]

Is Your Code Generated by Chat

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat

work page
[11]

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

work page
[12]

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , booktitle=. Re

work page
[13]

Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. Multi. 2023 , publisher=

work page 2023
[14]

Zheng, Qinkai and Xia, Xiao and Zou, Xu and Dong, Yuxiao and Wang, Shan and Xue, Yufei and Wang, Zihan and Shen, Lei and Wang, Andi and Li, Yang and others , booktitle=. Code

work page
[15]

C ode T : Code Generation with Generated Tests

Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 11th International Conference on Learning Representations. 2023

work page 2023
[16]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36. 2023

work page 2023
[18]

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =

M. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =. 2024 , url =

work page 2024
[19]

The Thirteenth International Conference on Learning Representations , year =

Jain, Kush and Synnaeve, Gabriel and Rozi\`ere, Baptiste , title =. The Thirteenth International Conference on Learning Representations , year =

work page
[20]

Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

Zhang, Quanjun and Shang, Ye and Fang, Chunrong and Gu, Siqi and Zhou, Jianyi and Chen, Zhenyu , title =. Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

work page
[21]

arXiv preprint arXiv:2505.05283 , year =

Wang, Kaixin and Li, Tianlin and Zhang, Xiaoyu and Wang, Chong and Sun, Weisong and Liu, Yang and Shi, Bin , title =. arXiv preprint arXiv:2505.05283 , year =

work page arXiv
[22]

and Spahar-McClure, Justin and Anderson, Carolyn Jane

Cassano, Federico and Gouwar, John and Huebner, Daniel and O'Toole, Kelley and Lee, Edward E. and Spahar-McClure, Justin and Anderson, Carolyn Jane. M ulti PL-E : A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering. 2023

work page 2023
[23]

and Liu, Kui , title =

Wang, Guancheng and Xu, Qinghua and Briand, Lionel C. and Liu, Kui , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '25) , year =

work page
[24]

Mutation-Guided LLM-based Test Generation at Meta , booktitle =

Foster, Christopher and Gulati, Abhishek and Harman, Mark and Harper, Inna and Mao, Ke and Ritchey, Jillian and Robert, Herv. Mutation-Guided LLM-based Test Generation at Meta , booktitle =. 2025 , publisher =

work page 2025
[25]

Proceedings of the 39th International Conference on Software Engineering,

Thierry Titcheu Chekam and Mike Papadakis and Yves Le Traon and Mark Harman , title =. Proceedings of the 39th International Conference on Software Engineering,

work page
[26]

An Analysis and Survey of the Development of Mutation Testing

Yue Jia and Mark Harman. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering

work page
[27]

naturalness

Matthieu Jimenez and Thierry Titcheu Chekam and Maxime Cordy and Mike Papadakis and Marinos Kintis and Yves Le Traon and Mark Harman , editor =. Are mutants really natural?: a study on how "naturalness" helps mutant selection , booktitle =. 2018 , url =. doi:10.1145/3239235.3240500 , timestamp =

work page doi:10.1145/3239235.3240500 2018
[29]

Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

Ibrahimzada, Ali Reza and Chen, Yang and Rong, Ryan and Jabbarvand, Reyhaneh , title =. Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

work page
[30]

2025 , eprint=

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving , author=. 2025 , eprint=

work page 2025
[31]

2025 , eprint=

SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

work page 2025
[32]

2025 , publisher =

Wang, Xingyao and Zhu, Boxuan and Odibat, Fangkai and Liu, Yian and Liu, Boyuan and Li, Zhuoer and Zhou, Shuyan and Neubig, Graham , booktitle =. 2025 , publisher =

work page 2025
[33]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =

work page 2025
[34]

Large Language Models are Few-Shot Testers: Exploring

Kang, Sungmin and Yoon, Juyeon and Yoo, Shin , booktitle =. Large Language Models are Few-Shot Testers: Exploring. 2024 , url =

work page 2024
[35]

Yang, Zheyuan and Kuang, Zexi and Xia, Xue and Zhao, Yilun , booktitle =. Can. 2025 , publisher =

work page 2025
[37]

Siddiq, Mohammed Latif and Santos, Joanna C. S. , title =. Proceedings of the 46th International Conference on Software Engineering (ICSE '24) , year =. doi:10.1145/3597503.3639106 , url =

work page doi:10.1145/3597503.3639106
[38]

Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

Yuan, Zhiqiang and Lou, Yiling and Liu, Mingwei and Ding, Shiji and Li, Kaixuan and Liang, Chendong and Peng, Xin , title =. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =. doi:10.1145/3691620.3695037 , url =

work page doi:10.1145/3691620.3695037
[39]

Advances in Computers , volume =

Mutation Testing Advances: An Analysis and Survey , author =. Advances in Computers , volume =. 2019 , publisher =

work page 2019
[40]

2016 , publisher =

Coles, Henry and Laurent, Thomas and Henard, Christopher and Papadakis, Mike and Ventresque, Anthony , booktitle =. 2016 , publisher =

work page 2016
[41]

Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =

Ren. Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =. 2011 , publisher =

work page 2011
[42]

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =

Are Mutants a Valid Substitute for Real Faults in Software Testing? , author =. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =. 2014 , publisher =

work page 2014
[43]

2020 , publisher =

Tufano, Michele and Kimko, Jason and Wang, Shiya and Watson, Cody and Bavota, Gabriele and Di Penta, Massimiliano and Poshyvanyk, Denys , booktitle =. 2020 , publisher =

work page 2020
[44]

2022 , publisher =

Degiovanni, Renzo and Papadakis, Mike , booktitle =. 2022 , publisher =

work page 2022
[46]

2024 , url =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. 2024 , url =

work page 2024
[47]

2023 , pages =

Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Muralidharan, Huchao and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and Xiang, Bing , booktitle =. 2023 , pages =

work page 2023
[48]

2025 , url =

Anthropic , title =. 2025 , url =

work page 2025
[49]

2025 , url =

DeepSeek-AI , title =. 2025 , url =

work page 2025
[50]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

work page 2025
[51]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025
[52]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[53]

Introducing GLM-4.6 , year =

work page
[54]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[55]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

work page 2024
[56]

Benchmarking Practices in

Happe, Andreas and Cito, J. Benchmarking Practices in. 2024 IEEE Secure Development Conference (SecDev) , year =. doi:10.1109/SecDev60338.2024.00013 , url =

work page doi:10.1109/secdev60338.2024.00013 2024
[57]

Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

Zhu, Hengcheng and Yang, Zhou and Wang, Kailong and Li, Li and Ren, Ziyou and Liu, Yan and Lo, David and Wang, Haoyu , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

work page
[58]

2024 , url =

Tian, Yongqiang and Wu, Yuxiang and Wan, Yao and Zhang, Hongyu , journal =. 2024 , url =

work page 2024
[59]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024
[60]

Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo

Earl T. Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering. 2015 , month=

work page 2015
[62]

The Thirteenth International Conference on Learning Representations , year=

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[63]

Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

work page
[66]

Anthropic. 2025. https://www.anthropic.com/ Ai research and products that put safety at the frontier

work page 2025
[67]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering , 41(5):507--525

work page 2015
[69]

Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82--90

work page 2013
[70]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. Multi PL-E : a scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49(7):3675--3691

work page 2023
[71]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. C ode T : Code generation with generated tests. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462--6477. Association for Computational Linguistics

work page 2023
[72]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. https://doi.org/10.1145/2931037.2946338 PIT : A practical mutation testing tool for Java . In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA '16), pages 449--452. ACM

work page doi:10.1145/2931037.2946338 2016
[74]

DeepSeek-AI. 2025. https://api-docs.deepseek.com/news/news250821 Deepseek-v3.1 release

work page 2025
[75]

Renzo Degiovanni and Mike Papadakis. 2022. https://doi.org/10.1109/ICSTW55395.2022.00047 BERT : Mutation testing using pre-trained language models . In Proceedings of the 15th IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW '22), pages 160--169. IEEE

work page doi:10.1109/icstw55395.2022.00047 2022
[76]

Yanmin Dong, Zhenya Huang, Zheng Zhang, Guanhao Zhao, Likang Wu, Hongke Zhao, Binbin Jin, and Qi Liu. 2025. https://doi.org/10.1145/3701551.3703537 Enhancing code search intent with programming context exploration . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 596–605, New York, NY, USA. Assoc...

work page doi:10.1145/3701551.3703537 2025
[77]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. https://doi.org/10.1145/3695988 Large language models for software engineering: A systematic literature review . ACM Transactions on Software Engineering and Methodology, 33(8):1--79

work page doi:10.1145/3695988 2024
[78]

Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2025. https://arxiv.org/abs/2310.02407 Challenging bug prediction and repair models with synthetic bugs . In Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25). IEEE

work page arXiv 2025
[79]

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi\`ere. 2025. https://openreview.net/forum?id=Agqf3qX150 Testgeneval: A real world unit test generation and test completion benchmark . In The Thirteenth International Conference on Learning Representations

work page 2025
[80]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations

work page 2024
[81]

Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 a . Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 654--665

work page 2014
[82]

Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 b . https://doi.org/10.1145/2635868.2635929 Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14), pages 654--665. ACM

work page doi:10.1145/2635868.2635929 2014
[83]

Kapfhammer

Ren \' e Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. https://doi.org/10.1109/ASE.2011.6100138 MAJOR: an efficient and extensible tool for mutation analysis in a java compiler . In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11), pages 612--615. ACM

work page doi:10.1109/ase.2011.6100138 2011
[84]

Kimi Team , Yifan Bai, and 1 others. 2025. https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025
[85]

James C King. 1976. Symbolic execution and program testing. Communications of the ACM, 19(7):385--394

work page 1976
[86]

Hung Le, Yue Wang, Akhilesh D Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314--21328

work page 2022
[87]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36

work page 2023

Showing first 80 references.

[1] [1]

and Li, Junnan and Hoi, Steven

Wang, Yue and Le, Hung and Gotmare, Akhilesh and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven. C ode T 5+: Open Code Large Language Models for Code Understanding and Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023

[2] [2]

Advances in Neural Information Processing Systems , volume=

CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [3]

C ode T : Code Generation with Generated Tests

Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023

[4] [4]

Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification

Shao, Etsuko and Wang, Yiyang. Not all steps are equal: Efficient Generation of Code with Large Language Models through Guided Verification. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023

[5] [5]

Communications of the ACM , volume=

Symbolic execution for software testing: three decades later , author=. Communications of the ACM , volume=. 2013 , publisher=

work page 2013

[6] [6]

Communications of the ACM , volume=

Symbolic execution and program testing , author=. Communications of the ACM , volume=. 1976 , publisher=

work page 1976

[7] [7]

Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

No more manual tests? Evaluating and improving chatgpt for unit test generation , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , year=

work page

[8] [8]

Self-Edit: Fault-Aware Code Editor for Code Generation

Zhang, Kechi and Li, Zhuo and Li, Jia and Li, Ge and Jin, Zhi. Self-Edit: Fault-Aware Code Editor for Code Generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023

[9] [9]

The Twelfth International Conference on Learning Representations , year=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. The Twelfth International Conference on Learning Representations , year=

work page

[10] [10]

Is Your Code Generated by Chat

Jiawei Liu and Chunqiu Steven Xia and Yuyao Wang and Lingming Zhang , booktitle=. Is Your Code Generated by Chat

work page

[11] [11]

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

Are mutants a valid substitute for real faults in software testing? , author=. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering , pages=

work page

[12] [12]

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , booktitle=. Re

work page

[13] [13]

Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, Sydney and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q and Guha, Arjun and Greenberg, Michael and Jangda, Abhinav , journal=. Multi. 2023 , publisher=

work page 2023

[14] [14]

Zheng, Qinkai and Xia, Xiao and Zou, Xu and Dong, Yuxiao and Wang, Shan and Xue, Yufei and Wang, Zihan and Shen, Lei and Wang, Andi and Li, Yang and others , booktitle=. Code

work page

[15] [15]

C ode T : Code Generation with Generated Tests

Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu. C ode T : Code Generation with Generated Tests. Proceedings of the 11th International Conference on Learning Representations. 2023

work page 2023

[16] [16]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. Advances in Neural Information Processing Systems 36. 2023

work page 2023

[17] [18]

SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =

M. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents , booktitle =. 2024 , url =

work page 2024

[18] [19]

The Thirteenth International Conference on Learning Representations , year =

Jain, Kush and Synnaeve, Gabriel and Rozi\`ere, Baptiste , title =. The Thirteenth International Conference on Learning Representations , year =

work page

[19] [20]

Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

Zhang, Quanjun and Shang, Ye and Fang, Chunrong and Gu, Siqi and Zhou, Jianyi and Chen, Zhenyu , title =. Proceedings of the 47th International Conference on Software Engineering (ICSE '25) , year =

work page

[20] [21]

arXiv preprint arXiv:2505.05283 , year =

Wang, Kaixin and Li, Tianlin and Zhang, Xiaoyu and Wang, Chong and Sun, Weisong and Liu, Yang and Shi, Bin , title =. arXiv preprint arXiv:2505.05283 , year =

work page arXiv

[21] [22]

and Spahar-McClure, Justin and Anderson, Carolyn Jane

Cassano, Federico and Gouwar, John and Huebner, Daniel and O'Toole, Kelley and Lee, Edward E. and Spahar-McClure, Justin and Anderson, Carolyn Jane. M ulti PL-E : A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering. 2023

work page 2023

[22] [23]

and Liu, Kui , title =

Wang, Guancheng and Xu, Qinghua and Briand, Lionel C. and Liu, Kui , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '25) , year =

work page

[23] [24]

Mutation-Guided LLM-based Test Generation at Meta , booktitle =

Foster, Christopher and Gulati, Abhishek and Harman, Mark and Harper, Inna and Mao, Ke and Ritchey, Jillian and Robert, Herv. Mutation-Guided LLM-based Test Generation at Meta , booktitle =. 2025 , publisher =

work page 2025

[24] [25]

Proceedings of the 39th International Conference on Software Engineering,

Thierry Titcheu Chekam and Mike Papadakis and Yves Le Traon and Mark Harman , title =. Proceedings of the 39th International Conference on Software Engineering,

work page

[25] [26]

An Analysis and Survey of the Development of Mutation Testing

Yue Jia and Mark Harman. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering

work page

[26] [27]

naturalness

Matthieu Jimenez and Thierry Titcheu Chekam and Maxime Cordy and Mike Papadakis and Marinos Kintis and Yves Le Traon and Mark Harman , editor =. Are mutants really natural?: a study on how "naturalness" helps mutant selection , booktitle =. 2018 , url =. doi:10.1145/3239235.3240500 , timestamp =

work page doi:10.1145/3239235.3240500 2018

[27] [29]

Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

Ibrahimzada, Ali Reza and Chen, Yang and Rong, Ryan and Jabbarvand, Reyhaneh , title =. Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25) , year =

work page

[28] [30]

2025 , eprint=

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving , author=. 2025 , eprint=

work page 2025

[29] [31]

2025 , eprint=

SWE-smith: Scaling Data for Software Engineering Agents , author=. 2025 , eprint=

work page 2025

[30] [32]

2025 , publisher =

Wang, Xingyao and Zhu, Boxuan and Odibat, Fangkai and Liu, Yian and Liu, Boyuan and Li, Zhuoer and Zhou, Shuyan and Neubig, Graham , booktitle =. 2025 , publisher =

work page 2025

[31] [33]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

Wang, Wenhan and Yang, Chenyuan and Wang, Zhijie and Huang, Yuheng and Chu, Zhaoyang and Song, Da and Zhang, Lingming and Chen, An Ran and Ma, Lei , title =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =

work page 2025

[32] [34]

Large Language Models are Few-Shot Testers: Exploring

Kang, Sungmin and Yoon, Juyeon and Yoo, Shin , booktitle =. Large Language Models are Few-Shot Testers: Exploring. 2024 , url =

work page 2024

[33] [35]

Yang, Zheyuan and Kuang, Zexi and Xia, Xue and Zhao, Yilun , booktitle =. Can. 2025 , publisher =

work page 2025

[34] [37]

Siddiq, Mohammed Latif and Santos, Joanna C. S. , title =. Proceedings of the 46th International Conference on Software Engineering (ICSE '24) , year =. doi:10.1145/3597503.3639106 , url =

work page doi:10.1145/3597503.3639106

[35] [38]

Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

Yuan, Zhiqiang and Lou, Yiling and Liu, Mingwei and Ding, Shiji and Li, Kaixuan and Liang, Chendong and Peng, Xin , title =. Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =. doi:10.1145/3691620.3695037 , url =

work page doi:10.1145/3691620.3695037

[36] [39]

Advances in Computers , volume =

Mutation Testing Advances: An Analysis and Survey , author =. Advances in Computers , volume =. 2019 , publisher =

work page 2019

[37] [40]

2016 , publisher =

Coles, Henry and Laurent, Thomas and Henard, Christopher and Papadakis, Mike and Ventresque, Anthony , booktitle =. 2016 , publisher =

work page 2016

[38] [41]

Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =

Ren. Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11) , pages =. 2011 , publisher =

work page 2011

[39] [42]

Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =

Are Mutants a Valid Substitute for Real Faults in Software Testing? , author =. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14) , pages =. 2014 , publisher =

work page 2014

[40] [43]

2020 , publisher =

Tufano, Michele and Kimko, Jason and Wang, Shiya and Watson, Cody and Bavota, Gabriele and Di Penta, Massimiliano and Poshyvanyk, Denys , booktitle =. 2020 , publisher =

work page 2020

[41] [44]

2022 , publisher =

Degiovanni, Renzo and Papadakis, Mike , booktitle =. 2022 , publisher =

work page 2022

[42] [46]

2024 , url =

Liu, Tianyang and Xu, Canwen and McAuley, Julian , booktitle =. 2024 , url =

work page 2024

[43] [47]

2023 , pages =

Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Muralidharan, Huchao and Ramanathan, Murali Krishna and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and Xiang, Bing , booktitle =. 2023 , pages =

work page 2023

[44] [48]

2025 , url =

Anthropic , title =. 2025 , url =

work page 2025

[45] [49]

2025 , url =

DeepSeek-AI , title =. 2025 , url =

work page 2025

[46] [50]

2025 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2025 , eprint=

work page 2025

[47] [51]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025

[48] [52]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[49] [53]

Introducing GLM-4.6 , year =

work page

[50] [54]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[51] [55]

2024 , url=

John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

work page 2024

[52] [56]

Benchmarking Practices in

Happe, Andreas and Cito, J. Benchmarking Practices in. 2024 IEEE Secure Development Conference (SecDev) , year =. doi:10.1109/SecDev60338.2024.00013 , url =

work page doi:10.1109/secdev60338.2024.00013 2024

[53] [57]

Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

Zhu, Hengcheng and Yang, Zhou and Wang, Kailong and Li, Li and Ren, Ziyou and Liu, Yan and Lo, David and Wang, Haoyu , title =. Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) , year =

work page

[54] [58]

2024 , url =

Tian, Yongqiang and Wu, Yuxiang and Wan, Yao and Zhang, Hongyu , journal =. 2024 , url =

work page 2024

[55] [59]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024

[56] [60]

Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo

Earl T. Barr and Mark Harman and Phil McMinn and Muzammil Shahbaz and Shin Yoo. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering. 2015 , month=

work page 2015

[57] [62]

The Thirteenth International Conference on Learning Representations , year=

Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[58] [63]

Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

Zhao, Yuze and Huang, Zhenya and Zhang, Kai and Gao, Weibo and Liu, Qi and Liu, Xukai and Yao, Fangzhou and Chen, Enhong , journal=. Semantic-Aligned Code Summarization: Bridging the Gap Between Code and Natural Language Through Data Flow Analysis , year=

work page

[59] [66]

Anthropic. 2025. https://www.anthropic.com/ Ai research and products that put safety at the frontier

work page 2025

[60] [67]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and 1 others. 2021. https://arxiv.org/abs/2108.07732 Program synthesis with large language models . arXiv preprint arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[61] [68]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering , 41(5):507--525

work page 2015

[62] [69]

Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82--90

work page 2013

[63] [70]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. 2023. Multi PL-E : a scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Softw. Eng., 49(7):3675--3691

work page 2023

[64] [71]

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. C ode T : Code generation with generated tests. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6462--6477. Association for Computational Linguistics

work page 2023

[65] [72]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. https://arxiv.org/abs/2107.03374 Evaluating large language models trained on code . arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[66] [73]

Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. https://doi.org/10.1145/2931037.2946338 PIT : A practical mutation testing tool for Java . In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA '16), pages 449--452. ACM

work page doi:10.1145/2931037.2946338 2016

[67] [74]

DeepSeek-AI. 2025. https://api-docs.deepseek.com/news/news250821 Deepseek-v3.1 release

work page 2025

[68] [75]

Renzo Degiovanni and Mike Papadakis. 2022. https://doi.org/10.1109/ICSTW55395.2022.00047 BERT : Mutation testing using pre-trained language models . In Proceedings of the 15th IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW '22), pages 160--169. IEEE

work page doi:10.1109/icstw55395.2022.00047 2022

[69] [76]

Yanmin Dong, Zhenya Huang, Zheng Zhang, Guanhao Zhao, Likang Wu, Hongke Zhao, Binbin Jin, and Qi Liu. 2025. https://doi.org/10.1145/3701551.3703537 Enhancing code search intent with programming context exploration . In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining, WSDM '25, page 596–605, New York, NY, USA. Assoc...

work page doi:10.1145/3701551.3703537 2025

[70] [77]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. https://doi.org/10.1145/3695988 Large language models for software engineering: A systematic literature review . ACM Transactions on Software Engineering and Methodology, 33(8):1--79

work page doi:10.1145/3695988 2024

[71] [78]

Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2025. https://arxiv.org/abs/2310.02407 Challenging bug prediction and repair models with synthetic bugs . In Proceedings of the 25th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '25). IEEE

work page arXiv 2025

[72] [79]

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi\`ere. 2025. https://openreview.net/forum?id=Agqf3qX150 Testgeneval: A real world unit test generation and test completion benchmark . In The Thirteenth International Conference on Learning Representations

work page 2025

[73] [80]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations

work page 2024

[74] [81]

Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 a . Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 654--665

work page 2014

[75] [82]

Ren \'e Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014 b . https://doi.org/10.1145/2635868.2635929 Are mutants a valid substitute for real faults in software testing? In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE '14), pages 654--665. ACM

work page doi:10.1145/2635868.2635929 2014

[76] [83]

Kapfhammer

Ren \' e Just, Franz Schweiggert, and Gregory M. Kapfhammer. 2011. https://doi.org/10.1109/ASE.2011.6100138 MAJOR: an efficient and extensible tool for mutation analysis in a java compiler . In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE '11), pages 612--615. ACM

work page doi:10.1109/ase.2011.6100138 2011

[77] [84]

Kimi Team , Yifan Bai, and 1 others. 2025. https://arxiv.org/abs/2507.20534 Kimi k2: Open agentic intelligence . Preprint, arXiv:2507.20534

work page internal anchor Pith review Pith/arXiv arXiv 2025

[78] [85]

James C King. 1976. Symbolic execution and program testing. Communications of the ACM, 19(7):385--394

work page 1976

[79] [86]

Hung Le, Yue Wang, Akhilesh D Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pages 21314--21328

work page 2022

[80] [87]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36

work page 2023