pith. sign in

arxiv: 2602.07900 · v2 · submitted 2026-02-08 · 💻 cs.SE · cs.AI

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Pith reviewed 2026-05-16 06:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code agentsagent-generated testsSWE-benchprompt interventionrepository-level repairissue resolutionsoftware engineering agents
0
0 comments X

The pith

Agent-written tests do not meaningfully improve LLM code agents' success at resolving repository issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether tests that LLM-based software agents generate during their workflows actually help fix bugs in large codebases. Analysis of trajectories from six models on SWE-bench Verified shows that resolved and unresolved tasks have nearly identical rates of test writing. When tests appear, they function mostly as print statements for observation rather than assertions for verification. Prompt edits that deliberately raise or lower the volume of such tests produce no reliable change in final resolution rates. The work therefore concludes that current test-writing behavior mainly alters process steps and token costs without moving task outcomes.

Core claim

In LLM agents that iteratively edit code and validate patches on repository-level tasks, the frequency of on-the-fly test generation is statistically similar between resolved and unresolved issues; prompt interventions that increase or decrease test volume likewise leave patch success rates statistically unchanged across four models.

What carries the argument

Prompt-intervention experiments that revise agent instructions to increase or reduce test writing and then compare patch resolution rates on SWE-bench Verified trajectories.

If this is right

  • Test writing in these agents serves chiefly as an observational feedback channel rather than a verification mechanism.
  • Agents that write fewer tests can complete the same tasks at lower interaction cost.
  • Benchmark results that reward test coverage may overstate the practical value of agent-generated tests.
  • Future agent designs can safely de-emphasize automatic test generation without losing resolution performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The finding suggests that exploration and verification phases in agent workflows can be decoupled without performance loss.
  • Similar patterns may appear in other iterative agent settings where intermediate outputs are cheap to generate but expensive to validate.
  • Agent training or prompting that targets outcome signals directly, rather than process mimicry, could yield efficiency gains.

Load-bearing premise

The chosen prompt edits and benchmark tasks isolate the causal effect of test writing from other changes in how the models explore or edit the code.

What would settle it

A direct comparison in which the same models solve the identical SWE-bench Verified issues once with test-writing tools available and once with those tools disabled, measuring whether resolution rates differ.

Figures

Figures reproduced from arXiv: 2602.07900 by Chao Peng, David Lo, Lingxiao Jiang, Xiaodong Gu, Yuling Shi, Zhensu Sun, Zhi Chen.

Figure 1
Figure 1. Figure 1: Overview of the study design. RQ1 examines testing behaviors, RQ2 analyzes feedback signals in agent-written tests, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Composition of feedback signals in agent-written [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Outcome-transition distribution on tasks with an [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-ranking agents.This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget? To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes trajectories from six LLMs on SWE-bench Verified to assess the role of agent-generated tests in resolving repository-level issues. It observes that test writing is frequent but similar between resolved and unresolved tasks, with tests primarily used for observational feedback via print statements rather than assertions. A prompt-intervention experiment with four models, modifying instructions to increase or decrease test writing, finds no significant change in final outcomes, leading to the conclusion that such tests mainly impact process and cost rather than success rates.

Significance. If the findings hold after addressing isolation concerns, this would challenge the assumption that on-the-fly test generation is a key driver of success for LLM code agents on repository tasks. It offers empirical trajectory analysis and controlled interventions on an established benchmark, highlighting potential efficiency gains by de-emphasizing test writing in agent designs.

major comments (2)
  1. [§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.
  2. [§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.
minor comments (2)
  1. [Methods] Clarify the exact definition and counting method for 'agent-written tests' (e.g., whether it includes print statements, assertions, or all new test functions) in the methods section to improve reproducibility.
  2. [Abstract] The abstract references 'GPT-5.2'; confirm in the text whether this is a specific variant or a placeholder for a current model like GPT-4o.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our analysis of agent-generated tests in LLM-based software engineering agents. The comments help clarify the evidentiary requirements for our claims about test-writing frequency and the prompt-intervention results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.

    Authors: We agree that confirming the intervention primarily affected test-writing volume, without major compensatory changes in other behaviors, is necessary to support the causal interpretation of the null result. In the revised manuscript we will add a new table in §4 reporting secondary metrics (average tool-call counts, reasoning token usage, and edit sizes in lines changed) for the baseline versus increase-test and decrease-test conditions across the four models. These metrics will be computed from the same trajectories to allow direct before/after comparison. revision: yes

  2. Referee: [§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.

    Authors: We acknowledge that the trajectory analysis would benefit from explicit statistical details. The revised §3 will report per-model sample sizes (all 500 SWE-bench Verified tasks were attempted by each of the six models), include statistical tests comparing test-writing frequencies between resolved and unresolved tasks within each model (e.g., chi-squared tests for proportions with p-values and effect sizes), and state that no trajectories were excluded beyond those that failed to produce a valid final patch due to runtime or parsing errors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations and interventions

full rationale

The paper conducts direct empirical analysis of agent trajectories on SWE-bench Verified and controlled prompt interventions to modulate test-writing volume. No equations, fitted parameters, derivations, or self-citations are used to support core claims; results follow from observed frequencies, comparisons between resolved/unresolved tasks, and outcome deltas after prompt changes. All steps are externally verifiable via the benchmark and reported metrics without reducing to self-definition or input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of SWE-bench Verified for real repository-level tasks and the assumption that prompt modifications cleanly vary only test-writing behavior.

axioms (1)
  • domain assumption SWE-bench Verified tasks are representative of repository-level code repair challenges faced by LLM agents.
    All trajectory analysis and intervention results are derived exclusively from performance on this benchmark.

pith-pipeline@v0.9.0 · 5538 in / 1154 out tokens · 52726 ms · 2026-05-16T06:33:24.771386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures

    cs.SE 2026-04 accept novelty 7.0

    Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.

  2. Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

    cs.SE 2026-04 accept novelty 7.0

    Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    Anthropic. 2025. Introducing Claude Opus 4.5. Anthropic Newsroom. https: //www.anthropic.com/news/claude-opus-4-5 Model announcement. See also the Claude Opus 4.5 system card page: https://www.anthropic.com/claude-opus-4-5- system-card

  2. [2]

    Anthropic. 2026. Agent Skills. https://platform.claude.com/docs/en/agents-and- tools/agent-skills/overview. Claude API Docs. Accessed: 2026-01-30

  3. [3]

    Anthropic. 2026. Create custom subagents. https://code.claude.com/docs/en/sub- agents. Claude Code Docs. Accessed: 2026-03-26

  4. [4]

    Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61

  5. [5]

    Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). Sacramento, CA, USA

  6. [6]

    Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)

  7. [7]

    Zhi Chen and Lingxiao Jiang. 2024. Promise and peril of collaborative code generation models: Balancing effectiveness and memorization. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 493–505

  8. [8]

    Zhi Chen and Lingxiao Jiang. 2025. Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenar- ios. InProceedings of the 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

  9. [9]

    Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. InProceedings of the 47th IEEE/ACM International Conference on Soft- ware Engineering (ICSE). Rio de Janeiro, Brazil

  10. [10]

    Cognition Labs. 2024. Introducing Devin, the First AI Software Engineer. https: //cognition.ai/blog/introducing-devin

  11. [11]

    DeepSeek. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. (2025). arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512.02556

  12. [12]

    DeepSeek. 2025. Reasoning Model (deepseek-reasoner). DeepSeek API Docu- mentation. https://api-docs.deepseek.com/guides/reasoning_model Official API guide for DeepSeek reasoning model endpoint

  13. [13]

    Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering.arXiv preprint arXiv:2501.14723(2025)

  14. [14]

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al . 2025. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046(2025)

  15. [15]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling.arXiv preprint arXiv:2507.23370(2025)

  16. [16]

    Google Cloud. 2025. Gemini 3 Pro on Vertex AI. Vertex AI Model Docu- mentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini/3-pro Official model documentation (includes preview variants)

  17. [17]

    Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-guided llm-based test generation at meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191

  18. [18]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

  19. [19]

    InThe Twelfth International Conference on Learning Representations

    MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InThe Twelfth International Conference on Learning Representations

  20. [20]

    Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Networks for Software Development. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=4R71pdPBZp

  21. [21]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations

  22. [22]

    Jimenez, Ofir Press, and John Yang

    Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://www.swebench.com/multilingual.html

  23. [23]

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. 2025. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296 (2025)

  24. [24]

    LangChain. 2026. LangSmith Observability. https://docs.langchain.com/ langsmith/observability. Official documentation. Accessed: 2026-03-26

  25. [25]

    Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W Eric Wong. 2025. Evaluating large language models for software testing.Computer Standards & Interfaces93 (2025), 103942

  26. [26]

    Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang

  27. [27]

    An Empirical Study on Failures in Automated Issue Solving.arXiv preprint arXiv:2509.13941(2025)

  28. [28]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.arXiv preprint arXiv:2409.00899(2024)

  29. [29]

    Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini. 2025. A system for automated unit test generation using large language models and assessment of generated test suites. In2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 29–36

  30. [30]

    Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. 2026. Un- derstanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). Rio de Janeiro, Brazil

  31. [31]

    MiniMax. 2025. MiniMax-M2. MiniMax News. https://www.minimax.io/news/ minimax-m2 Official release note / technical overview

  32. [32]

    Ernst, and Mauro Pezzè

    Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. InASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering. Seoul, South Korea

  33. [33]

    Moonshot AI. 2025. Introducing Kimi K2 Thinking. Project page. https:// moonshotai.github.io/Kimi-K2/thinking.html Official page describing Kimi K2 Thinking

  34. [34]

    Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)

  35. [35]

    Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

  36. [36]

    OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

  37. [37]

    2025.Update to GPT-5 System Card: GPT-5.2

    OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. Technical Report. Ope- nAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf System card (PDF)

  38. [38]

    OpenAI. 2026. Why SWE-bench Verified no longer measures frontier coding capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/. Published: February 23, 2026. Accessed: 2026-03-26

  39. [39]

    Albert Örwall. 2024. Moatless Tools. https://github.com/aorwall/moatless-tools

  40. [40]

    2026.Good Integration Practices: Conventions for Python test discovery

    pytest developers. 2026.Good Integration Practices: Conventions for Python test discovery. pytest documentation. https://docs.pytest.org/en/stable/explanation/ goodpractices.html#conventions-for-python-test-discovery Conference’26, October, 2026, Washington, DC, USA Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, and Lingxiao Jiang

  41. [41]

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174– 15186

  42. [42]

    Maxime Robeyns, Martin Szummer, and Laurence Aitchison. 2025. A Self- Improving Coding Agent. InScaling Self-Improving Foundation Models without Human Supervision. https://openreview.net/forum?id=rShJCyLsOr

  43. [43]

    Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232(2024)

  44. [44]

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Mu- rali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Pro- ceedings of the ACM on Software Engineering1, FSE (2024), 951–971

  45. [45]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering50, 1 (2023), 85–105

  46. [46]

    Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain adaptation for code model-based unit test case generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222

  47. [47]

    SWE-agent Team. 2024. mini-SWE-agent. https://github.com/SWE-agent/mini- swe-agent

  48. [48]

    SWE-bench Team. 2024. SWE-bench Bash-only Leaderboard. https://www. swebench.com/bash-only.html

  49. [49]

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

  50. [50]

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3547–3562

  51. [51]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...

  52. [52]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)

  53. [53]

    Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

  54. [54]

    Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv preprint arXiv:2511.13646(2025)

  55. [55]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, Vol. 37. 50528–50652

  56. [56]

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

  57. [57]

    Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)

  58. [58]

    Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen

  59. [59]

    In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)

    Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 206–217

  60. [60]

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering1, FSE (2024), 1703–1726

  61. [61]

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. 2025. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. arXiv:2505.22954 [cs.AI] https://arxiv.org/abs/2505.22954

  62. [62]

    Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion gener- ation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25

  63. [63]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604