pith. sign in

arxiv: 2607.00990 · v1 · pith:FAVOD2EUnew · submitted 2026-07-01 · 💻 cs.SE · cs.AI

SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests

Pith reviewed 2026-07-02 08:20 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM agentssoftware bug fixingbug reproduction testsruntime diagnosispatch generationSWE-benchsoftware engineeringmulti-faceted testing
0
0 comments X

The pith

SWE-Doctor improves software issue resolution by guiding patch generation with runtime diagnoses from multi-faceted bug reproduction tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that directly applying bug reproduction tests to guide patch generation in LLM-based agents can mislead or limit the fixes because single tests may not capture all aspects of an issue. SWE-Doctor addresses this by creating multiple tests for different requirements in the issue description, executing and debugging them to build diagnosis records grounded in runtime behavior, and feeding those records plus localization details into the patch generation process. This leads to consistently better performance than existing agents on two major benchmarks for real-world Python bug fixes. Readers would care if this approach makes automated repair more reliable by turning test execution into actionable diagnostic signals rather than just pass/fail checks.

Core claim

SWE-Doctor first generates multi-faceted BRTs for different behavioral requirements stated in the issue, then executes and debugs these BRTs to construct runtime-grounded diagnosis records, and finally uses the diagnoses together with localization information inferred during BRT generation to guide patch generation and reduce partial patches. It outperforms baselines across all LLM-benchmark combinations with average rates of 75.7% on SWE-bench Verified and 59.4% on SWE-bench Pro, improving the latter by 8.0-8.9 points.

What carries the argument

Runtime-grounded diagnosis records built from executing and debugging multi-faceted bug reproduction tests.

If this is right

  • Outperforms existing agents on all 10 LLM-benchmark combinations tested.
  • Achieves 75.7% average resolution on SWE-bench Verified and 59.4% on SWE-bench Pro.
  • Improves resolution by 8.0-8.9 percentage points on the more challenging SWE-bench Pro.
  • Reduces partial patches by incorporating diagnoses from multiple behavioral aspects of the issue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could allow agents to handle issues with more complex or multi-part symptoms by systematically covering behavioral requirements.
  • Combining such diagnoses with other tools like static analysis might further improve localization accuracy in future agent designs.
  • Applying the same multi-faceted approach to test generation in non-bug-fixing tasks like feature addition could yield similar benefits.
  • The method suggests that runtime feedback can be more effective when structured as diagnostic records rather than raw test outcomes.

Load-bearing premise

Multi-faceted BRTs for different behavioral requirements will yield reliable runtime-grounded diagnosis records that reduce partial patches without adding misleading signals or gaps in coverage.

What would settle it

Observing lower resolution rates than baselines on a set of issues where the generated multi-faceted BRTs fail to cover key manifestations or produce incorrect debug information.

Figures

Figures reproduced from arXiv: 2607.00990 by Jie M. Zhang, Yang Liu, Yaoqi Guo, Yiling Lou, Yun Ma, Zhenpeng Chen.

Figure 1
Figure 1. Figure 1: Illustration of the unresolved django__django-13512 issue, where a generated BRT covers only the editable form-field path and guides the agent to produce a partial patch. generated BRT is a fail-to-pass test (denoted as F→P issues) and issues for which the generated BRT is a fail-to-fail test (denoted as F→F issues). A generated BRT is a fail-to￾pass test if it fails on the original repository, i.e., befor… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SWE-Doctor. unsupported requirements and helps ensure that the generated BRTs target the behaviors actually requested by the issue. Targeted BRT generation. For each extracted requirement, SWE-Doctor uses an LLM to generate a targeted BRT that exercises the behavior described by it. To make test gener￾ation focused, SWE-Doctor performs requirement-level bug localization before generating the BR… view at source ↗
Figure 3
Figure 3. Figure 3: (RQ1) Venn diagrams showing the overlap of issues resolved by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large language model (LLM)-based software engineering agents are increasingly developed to resolve software issues by generating patches from issue reports and code repositories. Bug reproduction tests (BRTs) are an important building block for such agents and have been shown useful for patch validation. However, it remains unclear whether BRTs can also help the more central stage of patch generation. We first conduct a preliminary study and find that directly using advanced BRT generators to guide patch generation is not beneficial: fail-to-fail BRTs can mislead agents, while even fail-to-pass BRTs bring limited or negative gains. Our analysis reveals two reasons: fail-to-pass BRTs may cover only one manifestation of the reported issue, leading to partial patches, whereas fail-to-fail BRTs are unreliable as direct patch-generation targets. Motivated by these insights, we propose SWE-Doctor, a software issue resolution agent that guides patch generation with runtime diagnoses derived from multi-faceted BRT executions. SWE-Doctor first generates multi-faceted BRTs for different behavioral requirements stated in the issue, then executes and debugs these BRTs to construct runtime-grounded diagnosis records, and finally uses the diagnoses together with localization information inferred during BRT generation to guide patch generation and reduce partial patches. We evaluate SWE-Doctor on Python bug-fixing issues from the widely adopted SWE-bench Verified and SWE-bench Pro across five LLM backends. SWE-Doctor consistently outperforms existing agents across all 10 LLM-benchmark combinations, achieving average resolution rates of 75.7% on SWE-bench Verified and 59.4% on SWE-bench Pro. In particular, on the more challenging SWE-bench Pro, SWE-Doctor improves the average resolution rate by 8.0-8.9 percentage points over the baseline agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SWE-Doctor, an LLM-based software engineering agent that addresses limitations of direct bug reproduction test (BRT) use for patch generation. A preliminary study identifies that fail-to-pass BRTs can lead to partial patches (covering only one issue manifestation) and fail-to-fail BRTs are unreliable. SWE-Doctor generates multi-faceted BRTs for different behavioral requirements in the issue, executes and debugs them to build runtime-grounded diagnosis records, and combines these with localization information to guide patch generation. Evaluation on SWE-bench Verified and SWE-bench Pro across five LLM backends shows consistent outperformance, with average resolution rates of 75.7% and 59.4%, and gains of 8.0-8.9 percentage points on the Pro benchmark over baselines.

Significance. If the results hold, the work offers a concrete way to extend BRTs from validation to the patch-generation stage in SE agents, directly targeting the partial-patch and unreliability issues identified in the preliminary study. The evaluation on two widely used external benchmarks (SWE-bench Verified and Pro) across multiple backends is a positive aspect, as is the explicit diagnosis of failure modes before proposing the solution. This could inform practical improvements in agent design for real-world bug fixing.

major comments (2)
  1. [§3] §3 (Method description): The central claim that runtime diagnoses from multi-faceted BRT executions reliably steer patch generation and reduce partial patches rests on the assumption that these diagnoses avoid misleading signals or coverage gaps. However, the method provides no explicit mechanism (such as cross-BRT consistency checks, coverage metrics, or oracle validation of diagnosis records) to ensure this, despite the preliminary study explicitly documenting these risks for single BRTs. This is load-bearing for the reported performance gains.
  2. [Evaluation / Abstract] Evaluation section and abstract: The headline results (75.7% on Verified, 59.4% on Pro, +8.0-8.9 pp on Pro across all 10 LLM-benchmark combinations) are presented as averages without reported statistical significance, variance across runs, or detailed experimental controls. This weakens the ability to assess whether the consistent outperformance is robust rather than sensitive to specific conditions.
minor comments (1)
  1. [Abstract] Abstract: The range '8.0-8.9 percentage points' is stated without clarifying whether it represents the minimum and maximum improvements across the five backends or a different aggregation; adding this detail would improve precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below with clarifications from the manuscript and indicate planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Method description): The central claim that runtime diagnoses from multi-faceted BRT executions reliably steer patch generation and reduce partial patches rests on the assumption that these diagnoses avoid misleading signals or coverage gaps. However, the method provides no explicit mechanism (such as cross-BRT consistency checks, coverage metrics, or oracle validation of diagnosis records) to ensure this, despite the preliminary study explicitly documenting these risks for single BRTs. This is load-bearing for the reported performance gains.

    Authors: The preliminary study documents risks specifically for single BRTs, which directly motivated SWE-Doctor's use of multi-faceted BRTs that target distinct behavioral requirements stated in the issue. Each BRT is executed and debugged to produce runtime-grounded diagnosis records; the debugging step analyzes execution traces and failure modes to derive actionable diagnoses rather than raw pass/fail signals. These diagnoses are then combined with localization information obtained during BRT generation. We agree that §3 would benefit from more explicit discussion of how this design reduces (rather than eliminates) coverage gaps and misleading signals. We will revise the method section to elaborate on the role of multi-faceted coverage and debugging in mitigating the risks identified in the preliminary study, and we will add a brief description of an optional cross-BRT consistency step during diagnosis aggregation. revision: partial

  2. Referee: [Evaluation / Abstract] Evaluation section and abstract: The headline results (75.7% on Verified, 59.4% on Pro, +8.0-8.9 pp on Pro across all 10 LLM-benchmark combinations) are presented as averages without reported statistical significance, variance across runs, or detailed experimental controls. This weakens the ability to assess whether the consistent outperformance is robust rather than sensitive to specific conditions.

    Authors: The reported averages reflect performance across five distinct LLM backends on two separate benchmarks, with outperformance observed in every one of the 10 LLM-benchmark combinations. This cross-backend consistency is presented as the primary indicator of robustness. We did not report per-run variance or conduct formal significance tests, primarily because of the substantial computational cost of repeated full agent runs. We will revise the evaluation section to explicitly acknowledge this limitation, to highlight the 10/10 consistency result as supporting evidence, and to note any available seed-level data if it can be extracted from our existing logs without new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper conducts a preliminary study on limitations of single BRTs, proposes SWE-Doctor using multi-faceted BRTs for diagnosis-guided patching, and reports resolution rates on the external SWE-bench Verified and SWE-bench Pro benchmarks across multiple LLMs. These benchmarks are standard, independent datasets not constructed or fitted inside the paper. The headline performance figures (75.7%, 59.4%, +8–8.9 pp) are measured outcomes on those benchmarks rather than quantities derived by construction from the method. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps; the reliability assumption for diagnoses is an empirical claim tested on the external data rather than a self-definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions that BRT generators can produce tests covering distinct behavioral requirements and that runtime debugging of those tests yields actionable, non-misleading diagnoses; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption BRT generators can reliably produce tests that each target a distinct behavioral requirement stated in the issue report.
    Invoked in the method description when stating that multi-faceted BRTs are generated for different requirements.
  • domain assumption Runtime execution and debugging of BRTs produces diagnosis records that are more informative for patch generation than the BRTs themselves.
    Central premise of the motivation and SWE-Doctor design.

pith-pipeline@v0.9.1-grok · 5885 in / 1295 out tokens · 27980 ms · 2026-07-02T08:20:32.867893+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages

  1. [1]

    SWE-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  2. [2]

    Demystifying LLM-based software engineering agents,

    C. S. Xia, Y . Deng, S. Dunn, and L. Zhang, “Demystifying LLM-based software engineering agents,”Proceedings of the ACM on Software Engineering (PACMSE), vol. 2, no. FSE, pp. 801–824, 2025

  3. [3]

    SWE-bench: Can language models resolve real-world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  4. [4]

    Heterogeneous prompting and execution feedback for software engineering issue test generation and selection,

    T. Ahmed, J. Ganhotra, A. Shinnar, and M. Hirzel, “Heterogeneous prompting and execution feedback for software engineering issue test generation and selection,” inProceedings of the 48th IEEE/ACM Inter- national Conference on Software Engineering (ICSE), 2026

  5. [5]

    Agentic bug reproduction for effective automated program repair at google,

    R. Cheng, M. Tufano, J. Cito, J. Cambronero, P. Rondon, R. Wei, A. Sun, and S. Chandra, “Agentic bug reproduction for effective automated program repair at google,”arXiv, 2025

  6. [6]

    Issue2Test: Gen- erating reproducing test cases from issue reports,

    N. Nashid, I. Bouzenia, M. Pradel, and A. Mesbah, “Issue2Test: Gen- erating reproducing test cases from issue reports,” inProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE), 2026, to appear

  7. [7]

    AssertFlip: Reproducing bugs via inversion of llm-generated passing tests,

    L. Khatib, N. S. Mathews, and M. Nagappan, “AssertFlip: Reproducing bugs via inversion of llm-generated passing tests,” inProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE), 2026

  8. [8]

    SWE-Bench Pro: Can AI agents solve long-horizon software engineer- ing tasks?

    X. Deng, J. Da, E. Pan, Y . He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. M. Hendryx, Z. Wang, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “SWE-Bench Pro: Can AI agents solve long-horizon software engineer- ing tasks?”arXiv, 2025

  9. [9]

    Live-SWE-agent: Can software engineering agents self-evolve on the fly?

    C. S. Xia, Z. Wang, Y . Yang, Y . Wei, and L. Zhang, “Live-SWE-agent: Can software engineering agents self-evolve on the fly?”arXiv, 2025

  10. [10]

    SWE-Doctor,

    “SWE-Doctor,” https://github.com/SWE-Doctor/SWE-Doctor, 2026

  11. [11]

    Gistify! codebase-level understanding via runtime execution,

    H. Lee, M. Kim, C. Singh, M. Pereira, A. Sonwane, I. White, E. Stengel- Eskin, M. Bansal, Z. Shi, A. Sordoni, M.-A. C ˆot´e, X. Yuan, and L. Cac- cia, “Gistify! codebase-level understanding via runtime execution,” in International Conference on Learning Representations (ICLR), 2026

  12. [12]

    Swenergy: An em- pirical study on energy efficiency in agentic issue resolution frameworks with slms,

    A. Tripathy, C. P. Harshit, and K. Vaidhyanathan, “Swenergy: An em- pirical study on energy efficiency in agentic issue resolution frameworks with slms,” inProceedings of the 2026 International Workshop on Agentic Engineering, 2026, pp. 104–111

  13. [13]

    Zeller,Why Programs Fail: A Guide to Systematic Debugging, 2nd ed

    A. Zeller,Why Programs Fail: A Guide to Systematic Debugging, 2nd ed. Morgan Kaufmann, 2009

  14. [14]

    Using hypotheses as a debugging aid,

    A. Alaboudi and T. D. LaToza, “Using hypotheses as a debugging aid,” inProceedings of the 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 2020

  15. [15]

    pdb — the Python debugger,

    Python Software Foundation, “pdb — the Python debugger,” https://docs.python.org/3/library/pdb.html, 2024

  16. [16]

    Klear- agentforge: Forging agentic intelligence through posttraining scaling,

    Q. Wang, H. Zhang, J. Fu, K. Fu, Y . Liu, T. Zhang, C. Sun, G. Jiang, J. Tang, X. Ji, Y . Yue, J. Zhang, F. Zhang, K. Gai, and G. Zhou, “Klear- agentforge: Forging agentic intelligence through posttraining scaling,” arXiv preprint arXiv:2511.05951, 2025

  17. [17]

    Swe-replay: Efficient test-time scaling for software engineering agents,

    Y . Ding and L. Zhang, “Swe-replay: Efficient test-time scaling for software engineering agents,”arXiv preprint arXiv:2601.22129, 2026

  18. [18]

    arXiv:2504.08703 (2025) Springer Nature 2021 LATEX template 58Projecting the Emerging Mindset of SWE Agent

    M. S. Rashid, C. Bock, Y . Zhuang, A. Buchholz, T. Esler, S. Valentin, L. Franceschi, M. Wistuba, P. T. Sivaprasad, W. J. Kimet al., “Swe- polybench: A multi-language benchmark for repository level evaluation of coding agents,”arXiv preprint arXiv:2504.08703, 2025

  19. [19]

    GPT-5.4 thinking system card,

    OpenAI, “GPT-5.4 thinking system card,” https://openai.com/index/gpt- 5-4-thinking-system-card/, 2026, published March 5, 2026

  20. [20]

    Introducing GPT-5.4 mini and nano,

    ——, “Introducing GPT-5.4 mini and nano,” https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026

  21. [21]

    Claude Sonnet 4.6 system card,

    Anthropic, “Claude Sonnet 4.6 system card,” https://www.anthropic.com/claude-sonnet-4-6-system-card, 2026, published February 17, 2026

  22. [22]

    Deepseek v4 technical documentation,

    DeepSeek, “Deepseek v4 technical documentation,” https://fe- static.deepseek.com/chat/transparency/deepseek-V4-model-card-EN.pdf, 2026, published April 27, 2026

  23. [23]

    Xiaomi MiMo-V2.5-Pro,

    Xiaomi, “Xiaomi MiMo-V2.5-Pro,” https://mimo.xiaomi.com/mimo-v2- 5-pro/, 2026, released April 27, 2026

  24. [24]

    Delve: A debugger for the Go programming language,

    D. Parker and The Delve Contributors, “Delve: A debugger for the Go programming language,” https://github.com/go-delve/delve, 2014

  25. [25]

    Large language models are few-shot testers: Exploring LLM-based general bug reproduction,

    S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring LLM-based general bug reproduction,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineer- ing (ICSE), 2023, pp. 2312–2323

  26. [26]

    Automatic generation of test cases based on bug reports: A feasibility study with large language models,

    L. Plein, W. C. Ou ´edraogo, J. Klein, and T. F. Bissyand ´e, “Automatic generation of test cases based on bug reports: A feasibility study with large language models,” inProceedings of the 46th IEEE/ACM Inter- national Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2024, pp. 360–361

  27. [27]

    SWT-Bench: Testing and validating real-world bug-fixes with code agents,

    N. M ¨undler, M. N. M ¨uller, J. He, and M. Vechev, “SWT-Bench: Testing and validating real-world bug-fixes with code agents,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  28. [28]

    TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?

    T. Ahmed, M. Hirzel, R. Pan, A. Shinnar, and S. Sinha, “TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?” arXiv, 2024

  29. [29]

    Otter: Generating tests from issues to validate SWE patches,

    T. Ahmed, J. Ganhotra, R. Pan, A. Shinnar, S. Sinha, and M. Hirzel, “Otter: Generating tests from issues to validate SWE patches,” in Proceedings of the 42nd International Conference on Machine Learning (ICML), vol. 267, 2025, pp. 752–771

  30. [30]

    Eet: Experience-driven early termination for cost-efficient software engineering agents,

    Y . Guo, Y . Xiao, J. M. Zhang, M. Harman, Y . Lou, Y . Liu, and Z. Chen, “Eet: Experience-driven early termination for cost-efficient software engineering agents,” inFindings of the Association for Computational Linguistics (ACL), 2026

  31. [31]

    OpenHands: An open platform for AI software developers as generalist agents,

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “OpenHands: An open platform for AI software developers as generalist agents,” inThe Thirteenth International Conference on Le...

  32. [32]

    AutoCodeRover: Autonomous program improvement,

    Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “AutoCodeRover: Autonomous program improvement,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024, pp. 1592–1604

  33. [33]

    MAGIS: LLM-based multi-agent framework for GitHub issue resolution,

    W. Tao, Y . Zhou, Y . Wang, W. Zhang, H. Zhang, and Y . Cheng, “MAGIS: LLM-based multi-agent framework for GitHub issue resolution,” in Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

  34. [34]

    Diversity empowers intelligence: Integrating expertise of software engineering agents,

    K. Zhang, W. Yao, Z. Liu, Y . Feng, Z. Liu, R. Murthy, T. Lan, L. Li, R. Lou, J. Xu, B. Pang, Y . Zhou, S. Heinecke, S. Savarese, H. Wang, and C. Xiong, “Diversity empowers intelligence: Integrating expertise of software engineering agents,” inThe Thirteenth International Con- ference on Learning Representations (ICLR), 2025

  35. [35]

    Trae agent: An LLM-based agent for software engineering with test-time scaling,

    P. Gao, Z. Tian, X. Meng, X. Wang, R. Hu, Y . Xiao, Y . Liu, Z. Zhang, J. Chen, C. Gao, Y . Lin, Y . Xiong, C. Peng, and X. Liu, “Trae agent: An LLM-based agent for software engineering with test-time scaling,” arXiv, 2025. 11

  36. [36]

    CodeMonkeys: Scaling test-time compute for software engineering,

    R. Ehrlich, B. Brown, J. Juravsky, R. Clark, C. R ´e, and A. Mirhoseini, “CodeMonkeys: Scaling test-time compute for software engineering,” arXiv, 2025

  37. [37]

    S*: Test time scaling for code generation,

    D. Li, S. Cao, C. Cao, X. Li, S. Tan, K. Keutzer, J. Xing, J. E. Gonzalez, and I. Stoica, “S*: Test time scaling for code generation,” inFindings of the Association for Computational Linguistics: EMNLP, 2025, pp. 15 964–15 978. 12