Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Pith reviewed 2026-05-16 06:33 UTC · model grok-4.3
The pith
Agent-written tests do not meaningfully improve LLM code agents' success at resolving repository issues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In LLM agents that iteratively edit code and validate patches on repository-level tasks, the frequency of on-the-fly test generation is statistically similar between resolved and unresolved issues; prompt interventions that increase or decrease test volume likewise leave patch success rates statistically unchanged across four models.
What carries the argument
Prompt-intervention experiments that revise agent instructions to increase or reduce test writing and then compare patch resolution rates on SWE-bench Verified trajectories.
If this is right
- Test writing in these agents serves chiefly as an observational feedback channel rather than a verification mechanism.
- Agents that write fewer tests can complete the same tasks at lower interaction cost.
- Benchmark results that reward test coverage may overstate the practical value of agent-generated tests.
- Future agent designs can safely de-emphasize automatic test generation without losing resolution performance.
Where Pith is reading between the lines
- The finding suggests that exploration and verification phases in agent workflows can be decoupled without performance loss.
- Similar patterns may appear in other iterative agent settings where intermediate outputs are cheap to generate but expensive to validate.
- Agent training or prompting that targets outcome signals directly, rather than process mimicry, could yield efficiency gains.
Load-bearing premise
The chosen prompt edits and benchmark tasks isolate the causal effect of test writing from other changes in how the models explore or edit the code.
What would settle it
A direct comparison in which the same models solve the identical SWE-bench Verified issues once with test-writing tools available and once with those tools disabled, measuring whether resolution rates differ.
Figures
read the original abstract
Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-ranking agents.This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget? To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes trajectories from six LLMs on SWE-bench Verified to assess the role of agent-generated tests in resolving repository-level issues. It observes that test writing is frequent but similar between resolved and unresolved tasks, with tests primarily used for observational feedback via print statements rather than assertions. A prompt-intervention experiment with four models, modifying instructions to increase or decrease test writing, finds no significant change in final outcomes, leading to the conclusion that such tests mainly impact process and cost rather than success rates.
Significance. If the findings hold after addressing isolation concerns, this would challenge the assumption that on-the-fly test generation is a key driver of success for LLM code agents on repository tasks. It offers empirical trajectory analysis and controlled interventions on an established benchmark, highlighting potential efficiency gains by de-emphasizing test writing in agent designs.
major comments (2)
- [§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.
- [§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.
minor comments (2)
- [Methods] Clarify the exact definition and counting method for 'agent-written tests' (e.g., whether it includes print statements, assertions, or all new test functions) in the methods section to improve reproducibility.
- [Abstract] The abstract references 'GPT-5.2'; confirm in the text whether this is a specific variant or a placeholder for a current model like GPT-4o.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our analysis of agent-generated tests in LLM-based software engineering agents. The comments help clarify the evidentiary requirements for our claims about test-writing frequency and the prompt-intervention results. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.
Authors: We agree that confirming the intervention primarily affected test-writing volume, without major compensatory changes in other behaviors, is necessary to support the causal interpretation of the null result. In the revised manuscript we will add a new table in §4 reporting secondary metrics (average tool-call counts, reasoning token usage, and edit sizes in lines changed) for the baseline versus increase-test and decrease-test conditions across the four models. These metrics will be computed from the same trajectories to allow direct before/after comparison. revision: yes
-
Referee: [§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.
Authors: We acknowledge that the trajectory analysis would benefit from explicit statistical details. The revised §3 will report per-model sample sizes (all 500 SWE-bench Verified tasks were attempted by each of the six models), include statistical tests comparing test-writing frequencies between resolved and unresolved tasks within each model (e.g., chi-squared tests for proportions with p-values and effect sizes), and state that no trajectories were excluded beyond those that failed to produce a valid final patch due to runtime or parsing errors. revision: yes
Circularity Check
No circularity: purely empirical observations and interventions
full rationale
The paper conducts direct empirical analysis of agent trajectories on SWE-bench Verified and controlled prompt interventions to modulate test-writing volume. No equations, fitted parameters, derivations, or self-citations are used to support core claims; results follow from observed frequencies, comparisons between resolved/unresolved tasks, and outcome deltas after prompt changes. All steps are externally verifiable via the benchmark and reported metrics without reducing to self-definition or input renaming.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SWE-bench Verified tasks are representative of repository-level code repair challenges faced by LLM agents.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
value-revealing prints consistently outnumber assertions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Introducing Claude Opus 4.5. Anthropic Newsroom. https: //www.anthropic.com/news/claude-opus-4-5 Model announcement. See also the Claude Opus 4.5 system card page: https://www.anthropic.com/claude-opus-4-5- system-card
work page 2025
-
[2]
Anthropic. 2026. Agent Skills. https://platform.claude.com/docs/en/agents-and- tools/agent-skills/overview. Claude API Docs. Accessed: 2026-01-30
work page 2026
-
[3]
Anthropic. 2026. Create custom subagents. https://code.claude.com/docs/en/sub- agents. Claude Code Docs. Accessed: 2026-03-26
work page 2026
-
[4]
Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61
work page 2024
-
[5]
Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). Sacramento, CA, USA
work page 2025
-
[6]
Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)
work page internal anchor Pith review arXiv 2025
-
[7]
Zhi Chen and Lingxiao Jiang. 2024. Promise and peril of collaborative code generation models: Balancing effectiveness and memorization. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 493–505
work page 2024
-
[8]
Zhi Chen and Lingxiao Jiang. 2025. Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenar- ios. InProceedings of the 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
work page 2025
-
[9]
Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. InProceedings of the 47th IEEE/ACM International Conference on Soft- ware Engineering (ICSE). Rio de Janeiro, Brazil
work page 2026
-
[10]
Cognition Labs. 2024. Introducing Devin, the First AI Software Engineer. https: //cognition.ai/blog/introducing-devin
work page 2024
-
[11]
DeepSeek. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. (2025). arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
DeepSeek. 2025. Reasoning Model (deepseek-reasoner). DeepSeek API Docu- mentation. https://api-docs.deepseek.com/guides/reasoning_model Official API guide for DeepSeek reasoning model endpoint
work page 2025
- [13]
-
[14]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al . 2025. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling.arXiv preprint arXiv:2507.23370(2025)
-
[16]
Google Cloud. 2025. Gemini 3 Pro on Vertex AI. Vertex AI Model Docu- mentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini/3-pro Official model documentation (includes preview variants)
work page 2025
-
[17]
Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-guided llm-based test generation at meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191
work page 2025
-
[18]
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al
-
[19]
InThe Twelfth International Conference on Learning Representations
MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InThe Twelfth International Conference on Learning Representations
-
[20]
Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Networks for Software Development. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=4R71pdPBZp
work page 2025
-
[21]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations
work page 2024
-
[22]
Jimenez, Ofir Press, and John Yang
Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://www.swebench.com/multilingual.html
work page 2025
-
[23]
Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. 2025. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
LangChain. 2026. LangSmith Observability. https://docs.langchain.com/ langsmith/observability. Official documentation. Accessed: 2026-03-26
work page 2026
-
[25]
Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W Eric Wong. 2025. Evaluating large language models for software testing.Computer Standards & Interfaces93 (2025), 103942
work page 2025
-
[26]
Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang
- [27]
- [28]
-
[29]
Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini. 2025. A system for automated unit test generation using large language models and assessment of generated test suites. In2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 29–36
work page 2025
-
[30]
Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. 2026. Un- derstanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). Rio de Janeiro, Brazil
work page 2026
-
[31]
MiniMax. 2025. MiniMax-M2. MiniMax News. https://www.minimax.io/news/ minimax-m2 Official release note / technical overview
work page 2025
-
[32]
Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. InASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering. Seoul, South Korea
work page 2025
-
[33]
Moonshot AI. 2025. Introducing Kimi K2 Thinking. Project page. https:// moonshotai.github.io/Kimi-K2/thinking.html Official page describing Kimi K2 Thinking
work page 2025
-
[34]
Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887
work page 2024
-
[36]
OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/
work page 2024
-
[37]
2025.Update to GPT-5 System Card: GPT-5.2
OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. Technical Report. Ope- nAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf System card (PDF)
work page 2025
-
[38]
OpenAI. 2026. Why SWE-bench Verified no longer measures frontier coding capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/. Published: February 23, 2026. Accessed: 2026-03-26
work page 2026
-
[39]
Albert Örwall. 2024. Moatless Tools. https://github.com/aorwall/moatless-tools
work page 2024
-
[40]
2026.Good Integration Practices: Conventions for Python test discovery
pytest developers. 2026.Good Integration Practices: Conventions for Python test discovery. pytest documentation. https://docs.pytest.org/en/stable/explanation/ goodpractices.html#conventions-for-python-test-discovery Conference’26, October, 2026, Washington, DC, USA Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, and Lingxiao Jiang
work page 2026
-
[41]
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174– 15186
work page 2024
-
[42]
Maxime Robeyns, Martin Szummer, and Laurence Aitchison. 2025. A Self- Improving Coding Agent. InScaling Self-Improving Foundation Models without Human Supervision. https://openreview.net/forum?id=rShJCyLsOr
work page 2025
- [43]
-
[44]
Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Mu- rali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Pro- ceedings of the ACM on Software Engineering1, FSE (2024), 951–971
work page 2024
-
[45]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering50, 1 (2023), 85–105
work page 2023
-
[46]
Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain adaptation for code model-based unit test case generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222
work page 2024
-
[47]
SWE-agent Team. 2024. mini-SWE-agent. https://github.com/SWE-agent/mini- swe-agent
work page 2024
-
[48]
SWE-bench Team. 2024. SWE-bench Bash-only Leaderboard. https://www. swebench.com/bash-only.html
work page 2024
-
[49]
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936
work page 2024
-
[50]
Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3547–3562
work page 2025
-
[51]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang
- [54]
-
[55]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, Vol. 37. 50528–50652
work page 2024
-
[56]
Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619
work page 2024
-
[57]
Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)
work page 2025
-
[58]
Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen
-
[59]
In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)
Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 206–217
-
[60]
Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering1, FSE (2024), 1703–1726
work page 2024
-
[61]
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. 2025. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. arXiv:2505.22954 [cs.AI] https://arxiv.org/abs/2505.22954
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion gener- ation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25
work page 2025
-
[63]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.