Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Chao Peng; David Lo; Lingxiao Jiang; Xiaodong Gu; Yuling Shi; Zhensu Sun; Zhi Chen

arxiv: 2602.07900 · v2 · submitted 2026-02-08 · 💻 cs.SE · cs.AI

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Zhi Chen , Zhensu Sun , Yuling Shi , Chao Peng , Xiaodong Gu , David Lo , Lingxiao Jiang This is my paper

Pith reviewed 2026-05-16 06:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM code agentsagent-generated testsSWE-benchprompt interventionrepository-level repairissue resolutionsoftware engineering agents

0 comments

The pith

Agent-written tests do not meaningfully improve LLM code agents' success at resolving repository issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether tests that LLM-based software agents generate during their workflows actually help fix bugs in large codebases. Analysis of trajectories from six models on SWE-bench Verified shows that resolved and unresolved tasks have nearly identical rates of test writing. When tests appear, they function mostly as print statements for observation rather than assertions for verification. Prompt edits that deliberately raise or lower the volume of such tests produce no reliable change in final resolution rates. The work therefore concludes that current test-writing behavior mainly alters process steps and token costs without moving task outcomes.

Core claim

In LLM agents that iteratively edit code and validate patches on repository-level tasks, the frequency of on-the-fly test generation is statistically similar between resolved and unresolved issues; prompt interventions that increase or decrease test volume likewise leave patch success rates statistically unchanged across four models.

What carries the argument

Prompt-intervention experiments that revise agent instructions to increase or reduce test writing and then compare patch resolution rates on SWE-bench Verified trajectories.

If this is right

Test writing in these agents serves chiefly as an observational feedback channel rather than a verification mechanism.
Agents that write fewer tests can complete the same tasks at lower interaction cost.
Benchmark results that reward test coverage may overstate the practical value of agent-generated tests.
Future agent designs can safely de-emphasize automatic test generation without losing resolution performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests that exploration and verification phases in agent workflows can be decoupled without performance loss.
Similar patterns may appear in other iterative agent settings where intermediate outputs are cheap to generate but expensive to validate.
Agent training or prompting that targets outcome signals directly, rather than process mimicry, could yield efficiency gains.

Load-bearing premise

The chosen prompt edits and benchmark tasks isolate the causal effect of test writing from other changes in how the models explore or edit the code.

What would settle it

A direct comparison in which the same models solve the identical SWE-bench Verified issues once with test-writing tools available and once with those tools disabled, measuring whether resolution rates differ.

Figures

Figures reproduced from arXiv: 2602.07900 by Chao Peng, David Lo, Lingxiao Jiang, Xiaodong Gu, Yuling Shi, Zhensu Sun, Zhi Chen.

**Figure 1.** Figure 1: Overview of the study design. RQ1 examines testing behaviors, RQ2 analyzes feedback signals in agent-written tests, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Composition of feedback signals in agent-written [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Outcome-transition distribution on tasks with an [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of this behavior remains unclear. For example, GPT-5.2 writes almost no new tests yet achieves performance comparable to top-ranking agents.This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget? To better understand the role of agent-written tests, we analyze trajectories produced by six strong LLMs on SWE-bench Verified. Our results show that test writing is common, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies. When tests are written, they mainly serve as observational feedback channels, with value-revealing print statements appearing much more often than assertion-based checks. Based on these insights, we perform a prompt-intervention study by revising the prompts used with four models to either increase or reduce test writing. The results suggest that prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes in this setting. Taken together, these results suggest that current agent-written testing practices reshape process and cost more than final task outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds test-writing frequency similar for resolved and unresolved tasks, with prompt changes to test volume showing no clear effect on outcomes.

read the letter

The main thing to know is that this paper reports test writing by agents happens at similar rates whether the task gets resolved or not, and that prompt tweaks to push agents toward more or fewer tests do not shift final resolution rates on SWE-bench Verified. They also note that the tests agents do write are mostly print statements for observation rather than assertions. That is the core empirical observation they add. The trajectory analysis across six models and the controlled prompt interventions on four of them are the parts that feel new and worth looking at. They give concrete numbers on how often agents insert tests and what those tests actually look like in practice. That kind of breakdown is useful for anyone tuning agent workflows. The soft spot is the one the stress-test note flags: changing the prompt to alter test volume almost certainly changes other parts of the agent's behavior at the same time, such as how much it explores or what it prioritizes in its edits. The paper would be stronger if it showed that tool-call counts, reasoning length, or patch sizes stayed stable across the intervention conditions. Without those checks, the null result on outcomes is harder to interpret cleanly. The benchmark choice and multi-model setup are reasonable, but the causal isolation of test volume is the weakest link. This is for researchers building or evaluating LLM agents for repository-level repair tasks. It supplies fresh trajectory data on a practical question even if the intervention design needs tightening. I would send it to peer review so referees can press on the secondary metrics and statistical details.

Referee Report

2 major / 2 minor

Summary. The paper analyzes trajectories from six LLMs on SWE-bench Verified to assess the role of agent-generated tests in resolving repository-level issues. It observes that test writing is frequent but similar between resolved and unresolved tasks, with tests primarily used for observational feedback via print statements rather than assertions. A prompt-intervention experiment with four models, modifying instructions to increase or decrease test writing, finds no significant change in final outcomes, leading to the conclusion that such tests mainly impact process and cost rather than success rates.

Significance. If the findings hold after addressing isolation concerns, this would challenge the assumption that on-the-fly test generation is a key driver of success for LLM code agents on repository tasks. It offers empirical trajectory analysis and controlled interventions on an established benchmark, highlighting potential efficiency gains by de-emphasizing test writing in agent designs.

major comments (2)

[§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.
[§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.

minor comments (2)

[Methods] Clarify the exact definition and counting method for 'agent-written tests' (e.g., whether it includes print statements, assertions, or all new test functions) in the methods section to improve reproducibility.
[Abstract] The abstract references 'GPT-5.2'; confirm in the text whether this is a specific variant or a placeholder for a current model like GPT-4o.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our analysis of agent-generated tests in LLM-based software engineering agents. The comments help clarify the evidentiary requirements for our claims about test-writing frequency and the prompt-intervention results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (prompt-intervention study): The claim that prompt revisions successfully modulate only test-writing volume (and thus isolate its causal effect on outcomes) is load-bearing for the null result. The manuscript must report secondary metrics such as tool-call counts, reasoning token usage, and edit sizes before/after intervention to demonstrate that other behaviors remain stable; absent this, compensatory shifts could mask or mimic the effect of test volume changes.

Authors: We agree that confirming the intervention primarily affected test-writing volume, without major compensatory changes in other behaviors, is necessary to support the causal interpretation of the null result. In the revised manuscript we will add a new table in §4 reporting secondary metrics (average tool-call counts, reasoning token usage, and edit sizes in lines changed) for the baseline versus increase-test and decrease-test conditions across the four models. These metrics will be computed from the same trajectories to allow direct before/after comparison. revision: yes
Referee: [§3] §3 (trajectory analysis): The observation that resolved and unresolved tasks exhibit similar test-writing frequencies lacks reported sample sizes per model, statistical tests (e.g., p-values or effect sizes), and exclusion criteria for trajectories. This detail is needed to support the claim that test writing does not correlate with resolution success within models.

Authors: We acknowledge that the trajectory analysis would benefit from explicit statistical details. The revised §3 will report per-model sample sizes (all 500 SWE-bench Verified tasks were attempted by each of the six models), include statistical tests comparing test-writing frequencies between resolved and unresolved tasks within each model (e.g., chi-squared tests for proportions with p-values and effect sizes), and state that no trajectories were excluded beyond those that failed to produce a valid final patch due to runtime or parsing errors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations and interventions

full rationale

The paper conducts direct empirical analysis of agent trajectories on SWE-bench Verified and controlled prompt interventions to modulate test-writing volume. No equations, fitted parameters, derivations, or self-citations are used to support core claims; results follow from observed frequencies, comparisons between resolved/unresolved tasks, and outcome deltas after prompt changes. All steps are externally verifiable via the benchmark and reported metrics without reducing to self-definition or input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of SWE-bench Verified for real repository-level tasks and the assumption that prompt modifications cleanly vary only test-writing behavior.

axioms (1)

domain assumption SWE-bench Verified tasks are representative of repository-level code repair challenges faced by LLM agents.
All trajectory analysis and intervention results are derived exclusively from performance on this benchmark.

pith-pipeline@v0.9.0 · 5538 in / 1154 out tokens · 52726 ms · 2026-05-16T06:33:24.771386+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

prompt-induced changes in the volume of agent-written tests do not significantly change final outcomes
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

value-revealing prints consistently outnumber assertions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
cs.SE 2026-04 accept novelty 7.0

Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
cs.SE 2026-04 accept novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

Anthropic. 2025. Introducing Claude Opus 4.5. Anthropic Newsroom. https: //www.anthropic.com/news/claude-opus-4-5 Model announcement. See also the Claude Opus 4.5 system card page: https://www.anthropic.com/claude-opus-4-5- system-card

work page 2025
[2]

Anthropic. 2026. Agent Skills. https://platform.claude.com/docs/en/agents-and- tools/agent-skills/overview. Claude API Docs. Accessed: 2026-01-30

work page 2026
[3]

Anthropic. 2026. Create custom subagents. https://code.claude.com/docs/en/sub- agents. Claude Code Docs. Accessed: 2026-03-26

work page 2026
[4]

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61

work page 2024
[5]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). Sacramento, CA, USA

work page 2025
[6]

Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)

work page internal anchor Pith review arXiv 2025
[7]

Zhi Chen and Lingxiao Jiang. 2024. Promise and peril of collaborative code generation models: Balancing effectiveness and memorization. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 493–505

work page 2024
[8]

Zhi Chen and Lingxiao Jiang. 2025. Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenar- ios. InProceedings of the 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

work page 2025
[9]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. InProceedings of the 47th IEEE/ACM International Conference on Soft- ware Engineering (ICSE). Rio de Janeiro, Brazil

work page 2026
[10]

Cognition Labs. 2024. Introducing Devin, the First AI Software Engineer. https: //cognition.ai/blog/introducing-devin

work page 2024
[11]

DeepSeek. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. (2025). arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

DeepSeek. 2025. Reasoning Model (deepseek-reasoner). DeepSeek API Docu- mentation. https://api-docs.deepseek.com/guides/reasoning_model Official API guide for DeepSeek reasoning model endpoint

work page 2025
[13]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering.arXiv preprint arXiv:2501.14723(2025)

work page arXiv 2025
[14]

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al . 2025. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling.arXiv preprint arXiv:2507.23370(2025)

work page arXiv 2025
[16]

Google Cloud. 2025. Gemini 3 Pro on Vertex AI. Vertex AI Model Docu- mentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini/3-pro Official model documentation (includes preview variants)

work page 2025
[17]

Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-guided llm-based test generation at meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191

work page 2025
[18]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page
[19]

InThe Twelfth International Conference on Learning Representations

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InThe Twelfth International Conference on Learning Representations

work page
[20]

Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Networks for Software Development. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=4R71pdPBZp

work page 2025
[21]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations

work page 2024
[22]

Jimenez, Ofir Press, and John Yang

Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://www.swebench.com/multilingual.html

work page 2025
[23]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. 2025. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

LangChain. 2026. LangSmith Observability. https://docs.langchain.com/ langsmith/observability. Official documentation. Accessed: 2026-03-26

work page 2026
[25]

Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W Eric Wong. 2025. Evaluating large language models for software testing.Computer Standards & Interfaces93 (2025), 103942

work page 2025
[26]

Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang

work page
[27]

An Empirical Study on Failures in Automated Issue Solving.arXiv preprint arXiv:2509.13941(2025)

work page arXiv 2025
[28]

Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.arXiv preprint arXiv:2409.00899(2024)

work page arXiv 2024
[29]

Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini. 2025. A system for automated unit test generation using large language models and assessment of generated test suites. In2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 29–36

work page 2025
[30]

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. 2026. Un- derstanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). Rio de Janeiro, Brazil

work page 2026
[31]

MiniMax. 2025. MiniMax-M2. MiniMax News. https://www.minimax.io/news/ minimax-m2 Official release note / technical overview

work page 2025
[32]

Ernst, and Mauro Pezzè

Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. InASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering. Seoul, South Korea

work page 2025
[33]

Moonshot AI. 2025. Introducing Kimi K2 Thinking. Project page. https:// moonshotai.github.io/Kimi-K2/thinking.html Official page describing Kimi K2 Thinking

work page 2025
[34]

Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

work page 2024
[36]

OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

work page 2024
[37]

2025.Update to GPT-5 System Card: GPT-5.2

OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. Technical Report. Ope- nAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf System card (PDF)

work page 2025
[38]

OpenAI. 2026. Why SWE-bench Verified no longer measures frontier coding capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/. Published: February 23, 2026. Accessed: 2026-03-26

work page 2026
[39]

Albert Örwall. 2024. Moatless Tools. https://github.com/aorwall/moatless-tools

work page 2024
[40]

2026.Good Integration Practices: Conventions for Python test discovery

pytest developers. 2026.Good Integration Practices: Conventions for Python test discovery. pytest documentation. https://docs.pytest.org/en/stable/explanation/ goodpractices.html#conventions-for-python-test-discovery Conference’26, October, 2026, Washington, DC, USA Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, and Lingxiao Jiang

work page 2026
[41]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174– 15186

work page 2024
[42]

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. 2025. A Self- Improving Coding Agent. InScaling Self-Improving Foundation Models without Human Supervision. https://openreview.net/forum?id=rShJCyLsOr

work page 2025
[43]

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232(2024)

work page arXiv 2024
[44]

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Mu- rali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Pro- ceedings of the ACM on Software Engineering1, FSE (2024), 951–971

work page 2024
[45]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering50, 1 (2023), 85–105

work page 2023
[46]

Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain adaptation for code model-based unit test case generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222

work page 2024
[47]

SWE-agent Team. 2024. mini-SWE-agent. https://github.com/SWE-agent/mini- swe-agent

work page 2024
[48]

SWE-bench Team. 2024. SWE-bench Bash-only Leaderboard. https://www. swebench.com/bash-only.html

work page 2024
[49]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

work page 2024
[50]

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3547–3562

work page 2025
[51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

work page
[54]

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv preprint arXiv:2511.13646(2025)

work page arXiv 2025
[55]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, Vol. 37. 50528–50652

work page 2024
[56]

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

work page 2024
[57]

Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)

work page 2025
[58]

Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen

work page
[59]

In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)

Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 206–217

work page
[60]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering1, FSE (2024), 1703–1726

work page 2024
[61]

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. 2025. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. arXiv:2505.22954 [cs.AI] https://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion gener- ation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25

work page 2025
[63]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

work page 2024

[1] [1]

Anthropic. 2025. Introducing Claude Opus 4.5. Anthropic Newsroom. https: //www.anthropic.com/news/claude-opus-4-5 Model announcement. See also the Claude Opus 4.5 system card page: https://www.anthropic.com/claude-opus-4-5- system-card

work page 2025

[2] [2]

Anthropic. 2026. Agent Skills. https://platform.claude.com/docs/en/agents-and- tools/agent-skills/overview. Claude API Docs. Accessed: 2026-01-30

work page 2026

[3] [3]

Anthropic. 2026. Create custom subagents. https://code.claude.com/docs/en/sub- agents. Claude Code Docs. Accessed: 2026-03-26

work page 2026

[4] [4]

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61

work page 2024

[5] [5]

Islem Bouzenia and Michael Pradel. 2025. Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). Sacramento, CA, USA

work page 2025

[6] [6]

Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, and Baishakhi Ray. 2025. Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study.arXiv preprint arXiv:2506.08311(2025)

work page internal anchor Pith review arXiv 2025

[7] [7]

Zhi Chen and Lingxiao Jiang. 2024. Promise and peril of collaborative code generation models: Balancing effectiveness and memorization. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 493–505

work page 2024

[8] [8]

Zhi Chen and Lingxiao Jiang. 2025. Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenar- ios. InProceedings of the 32nd IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

work page 2025

[9] [9]

Zhi Chen, Wei Ma, and Lingxiao Jiang. 2026. Beyond Final Code: A Process- Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios. InProceedings of the 47th IEEE/ACM International Conference on Soft- ware Engineering (ICSE). Rio de Janeiro, Brazil

work page 2026

[10] [10]

Cognition Labs. 2024. Introducing Devin, the First AI Software Engineer. https: //cognition.ai/blog/introducing-devin

work page 2024

[11] [11]

DeepSeek. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. (2025). arXiv:2512.02556 [cs.CL] https://arxiv.org/abs/2512.02556

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

DeepSeek. 2025. Reasoning Model (deepseek-reasoner). DeepSeek API Docu- mentation. https://api-docs.deepseek.com/guides/reasoning_model Official API guide for DeepSeek reasoning model endpoint

work page 2025

[13] [13]

Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, and Azalia Mirhoseini. 2025. CodeMonkeys: Scaling Test-Time Compute for Software Engineering.arXiv preprint arXiv:2501.14723(2025)

work page arXiv 2025

[14] [14]

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al . 2025. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, and Xia Liu. 2025. Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling.arXiv preprint arXiv:2507.23370(2025)

work page arXiv 2025

[16] [16]

Google Cloud. 2025. Gemini 3 Pro on Vertex AI. Vertex AI Model Docu- mentation. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/ gemini/3-pro Official model documentation (includes preview variants)

work page 2025

[17] [17]

Mark Harman, Jillian Ritchey, Inna Harper, Shubho Sengupta, Ke Mao, Abhishek Gulati, Christopher Foster, and Hervé Robert. 2025. Mutation-guided llm-based test generation at meta. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 180–191

work page 2025

[18] [18]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al

work page

[19] [19]

InThe Twelfth International Conference on Learning Representations

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework. InThe Twelfth International Conference on Learning Representations

work page

[20] [20]

Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. 2025. Self-Evolving Multi-Agent Collaboration Networks for Software Development. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=4R71pdPBZp

work page 2025

[21] [21]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations

work page 2024

[22] [22]

Jimenez, Ofir Press, and John Yang

Kabir Khandpur, Kilian Lieret, Carlos E. Jimenez, Ofir Press, and John Yang. 2025. SWE-bench Multilingual. https://www.swebench.com/multilingual.html

work page 2025

[23] [23]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, et al. 2025. Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

LangChain. 2026. LangSmith Observability. https://docs.langchain.com/ langsmith/observability. Official documentation. Accessed: 2026-03-26

work page 2026

[25] [25]

Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W Eric Wong. 2025. Evaluating large language models for software testing.Computer Standards & Interfaces93 (2025), 103942

work page 2025

[26] [26]

Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, and Li Zhang

work page

[27] [27]

An Empirical Study on Failures in Automated Issue Solving.arXiv preprint arXiv:2509.13941(2025)

work page arXiv 2025

[28] [28]

Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.arXiv preprint arXiv:2409.00899(2024)

work page arXiv 2024

[29] [29]

Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini. 2025. A system for automated unit test generation using large language models and assessment of generated test suites. In2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 29–36

work page 2025

[30] [30]

Oorja Majgaonkar, Zhiwei Fei, Xiang Li, Federica Sarro, and He Ye. 2026. Un- derstanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). Rio de Janeiro, Brazil

work page 2026

[31] [31]

MiniMax. 2025. MiniMax-M2. MiniMax News. https://www.minimax.io/news/ minimax-m2 Official release note / technical overview

work page 2025

[32] [32]

Ernst, and Mauro Pezzè

Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, and Mauro Pezzè. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. InASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering. Seoul, South Korea

work page 2025

[33] [33]

Moonshot AI. 2025. Introducing Kimi K2 Thinking. Project page. https:// moonshotai.github.io/Kimi-K2/thinking.html Official page describing Kimi K2 Thinking

work page 2025

[34] [34]

Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

work page 2024

[36] [36]

OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

work page 2024

[37] [37]

2025.Update to GPT-5 System Card: GPT-5.2

OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. Technical Report. Ope- nAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_ 5_2_system-card.pdf System card (PDF)

work page 2025

[38] [38]

OpenAI. 2026. Why SWE-bench Verified no longer measures frontier coding capabilities. https://openai.com/index/why-we-no-longer-evaluate-swe-bench- verified/. Published: February 23, 2026. Accessed: 2026-03-26

work page 2026

[39] [39]

Albert Örwall. 2024. Moatless Tools. https://github.com/aorwall/moatless-tools

work page 2024

[40] [40]

2026.Good Integration Practices: Conventions for Python test discovery

pytest developers. 2026.Good Integration Practices: Conventions for Python test discovery. pytest documentation. https://docs.pytest.org/en/stable/explanation/ goodpractices.html#conventions-for-python-test-discovery Conference’26, October, 2026, Washington, DC, USA Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, and Lingxiao Jiang

work page 2026

[41] [41]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2024. ChatDev: Communicative Agents for Software Development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15174– 15186

work page 2024

[42] [42]

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. 2025. A Self- Improving Coding Agent. InScaling Self-Improving Foundation Models without Human Supervision. https://openreview.net/forum?id=rShJCyLsOr

work page 2025

[43] [43]

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. Specrover: Code intent extraction via llms.arXiv preprint arXiv:2408.02232(2024)

work page arXiv 2024

[44] [44]

Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Mu- rali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Pro- ceedings of the ACM on Software Engineering1, FSE (2024), 951–971

work page 2024

[45] [45]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering50, 1 (2023), 85–105

work page 2023

[46] [46]

Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain adaptation for code model-based unit test case generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222

work page 2024

[47] [47]

SWE-agent Team. 2024. mini-SWE-agent. https://github.com/SWE-agent/mini- swe-agent

work page 2024

[48] [48]

SWE-bench Team. 2024. SWE-bench Bash-only Leaderboard. https://www. swebench.com/bash-only.html

work page 2024

[49] [49]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering50, 4 (2024), 911–936

work page 2024

[50] [50]

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. Testeval: Benchmarking large language models for test case generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 3547–3562

work page 2025

[51] [51]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agent- less: Demystifying LLM-based Software Engineering Agents.arXiv preprint arXiv:2407.01489(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

work page

[54] [54]

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv preprint arXiv:2511.13646(2025)

work page arXiv 2025

[55] [55]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems, Vol. 37. 50528–50652

work page 2024

[56] [56]

Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

work page 2024

[57] [57]

Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)

work page 2025

[58] [58]

Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen

work page

[59] [59]

In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS)

Llm for test script generation and migration: Challenges, capabilities, and opportunities. In2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 206–217

work page

[60] [60]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering1, FSE (2024), 1703–1726

work page 2024

[61] [61]

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. 2025. Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents. arXiv:2505.22954 [cs.AI] https://arxiv.org/abs/2505.22954

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion gener- ation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25

work page 2025

[63] [63]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

work page 2024