Recognition: unknown
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3
The pith
Adding interactive debuggers to coding agents lets them fix over 20 percent more bugs on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Debug2Fix equips software engineering agents with interactive debuggers via a subagent architecture so that the main agent can inspect runtime states, variables, and execution traces while fixing bugs. The approach is implemented for Java and Python and evaluated on GitBug-Java and SWE-Bench-Live, where it yields more than 20 percent higher success rates than baselines that use only static analysis or iterative test execution. The results also show that weaker models equipped with Debug2Fix match or exceed the bug-fixing performance of stronger models, indicating that debugger access can substitute for additional model scale.
What carries the argument
A subagent architecture that runs an interactive debugger session, exposing runtime inspection commands and output directly to the main coding agent.
If this is right
- Bug-fixing success rates rise when agents receive live runtime data instead of relying on static code or test outcomes alone.
- Weaker models become competitive with stronger models once they share the same debugging tool set.
- Both the subagent separation and the debugger integration are necessary; removing either reduces the observed gains.
- The framework applies to existing Java and Python codebases without changing the underlying language runtimes.
Where Pith is reading between the lines
- Agent designers may achieve larger gains by expanding available tools than by increasing model size alone.
- Debugging agents could eventually handle bugs whose symptoms appear only at runtime and are invisible to static or test-based methods.
- Integration of debuggers into agents could reduce the manual debugging load for human developers on routine defects.
Load-bearing premise
The runtime information returned by the debugger can be correctly interpreted and used by the agent without creating new errors or hallucinations that erase the gains.
What would settle it
Re-running the GitBug-Java and SWE-Bench-Live evaluations after adding debugger access and finding that the agent produces more incorrect patches than the baseline because it misreads stack traces or variable values.
Figures
read the original abstract
While significant progress has been made in automating various aspects of software development through coding agents, there is still significant room for improvement in their bug fixing capabilities. Debugging and investigation of runtime behavior remains largely a manual, developer-driven process. Popular coding agents typically rely on either static analysis of the code or iterative test-fix cycles, which is akin to trial and error debugging. We posit that there is a wealth of rich runtime information that developers routinely access while debugging code, which agents are currently deprived of due to design limitations. Despite how prevalent debuggers are in modern IDEs and command-line tools, they have surprisingly not made their way into coding agents. In this work, we introduce Debug2Fix, a novel framework that incorporates interactive debugging as a core component of a software engineering agent via a subagent architecture. We incorporate debuggers for Java and Python into our agent framework and evaluate against GitBug-Java and SWE-Bench-Live and achieve >20% improvement in performance compared to the baseline for certain models. Furthermore, using our framework, we're able to make weaker models like GPT-5 and Claude Haiku 4.5 match or exceed the performances of stronger models like Claude Sonnet 4.5, showing that better tool design is often just as important as switching to a more expensive model. Finally, we conduct systematic ablations demonstrating the importance of both the subagent architecture and debugger integration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Debug2Fix, a framework that integrates interactive debugging into coding agents via a subagent architecture, incorporating debuggers for Java and Python. It evaluates the approach on GitBug-Java and SWE-Bench-Live, reporting >20% performance gains over baselines for certain models, and claims that the framework enables weaker models (e.g., GPT-5, Claude Haiku 4.5) to match or exceed stronger models (e.g., Claude Sonnet 4.5). Systematic ablations are presented to demonstrate the contributions of the subagent architecture and debugger integration.
Significance. If the empirical results hold under full verification, the work provides evidence that runtime debugging tools can meaningfully improve bug-fixing capabilities in coding agents, with the potential to reduce dependence on larger models by emphasizing tool design. The subagent-based integration and benchmark results offer a practical demonstration of operationalizing interactive debugging in agents, which could influence future agent architectures in software engineering.
major comments (2)
- [Experimental Evaluation and Ablation Studies] The headline claims of >20% gains and weaker models matching stronger ones on GitBug-Java and SWE-Bench-Live rest on the unverified assumption that the debugging subagent reliably parses and acts on runtime outputs (stack traces, variable states) without net error increase from hallucinations or misinterpretations. The ablations show value for the debugger component but do not isolate or measure the rate at which debugger use introduces incorrect patches relative to baseline trial-and-error loops.
- [Results and Evaluation] The abstract and results sections report performance improvements without accompanying error bars, exact baseline configurations, or full experimental protocols (e.g., number of runs, temperature settings, or failure mode analysis), which prevents independent verification of the central performance claims.
minor comments (2)
- [Abstract] The abstract refers to gains 'for certain models' without naming them; specifying the models and exact percentages per benchmark would improve precision.
- [Framework Description] Notation for the subagent architecture and its interaction with the main agent could be clarified with a diagram or pseudocode to aid reproducibility.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper. We address the major comments below and will revise the manuscript accordingly to improve clarity and verifiability of our results.
read point-by-point responses
-
Referee: [Experimental Evaluation and Ablation Studies] The headline claims of >20% gains and weaker models matching stronger ones on GitBug-Java and SWE-Bench-Live rest on the unverified assumption that the debugging subagent reliably parses and acts on runtime outputs (stack traces, variable states) without net error increase from hallucinations or misinterpretations. The ablations show value for the debugger component but do not isolate or measure the rate at which debugger use introduces incorrect patches relative to baseline trial-and-error loops.
Authors: We thank the referee for highlighting this important point. Our ablations compare the full Debug2Fix framework against baselines without the debugger, showing consistent gains that suggest the debugging subagent provides net benefit. However, we agree that we have not explicitly measured the rate of errors introduced by the subagent's parsing of runtime outputs. In the revised version, we will add a detailed failure mode analysis, including a breakdown of cases where debugger interactions led to incorrect patches, and compare the error rates to the baseline. This will be supported by manual review of a subset of the results. revision: yes
-
Referee: [Results and Evaluation] The abstract and results sections report performance improvements without accompanying error bars, exact baseline configurations, or full experimental protocols (e.g., number of runs, temperature settings, or failure mode analysis), which prevents independent verification of the central performance claims.
Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we will include error bars from repeated experiments, specify exact baseline configurations including model parameters and temperatures, detail the number of runs and random seeds used, and provide a comprehensive experimental protocol along with failure mode analysis in the appendix. revision: yes
Circularity Check
No circularity: empirical results on external benchmarks
full rationale
The paper introduces Debug2Fix as an engineering framework and reports performance gains (>20% on GitBug-Java and SWE-Bench-Live) via direct experimental comparison to baselines. No derivation chain, equations, fitted parameters, or predictions appear; results are measured outcomes on held-out benchmarks rather than quantities defined or fitted inside the paper. Self-citations, if present, are not load-bearing for the central empirical claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agents can effectively interpret and act on debugger output without systematic misinterpretation.
invented entities (1)
-
Debugging subagent
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
Reference graph
Works this paper leans on
-
[1]
Claude for Coding,
Anthropic, “Claude for Coding, ” https://www.anthropic.com/claude-code, 2024, accessed: 2025-07-14
2024
-
[2]
VSCode Agent Mode,
Microsoft, “VSCode Agent Mode, ” https://code.visualstudio.com/blogs/ 2025/04/07/agentMode, 2025, accessed: 2025-09-28
2025
-
[3]
Opendevin: An open platform for ai software developers as generalist agents,
X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Opendevin: An open platform for ai software developers as generalist agents, ” 2024. [Online]. Available: https://arxiv.org/abs/2...
Pith/arXiv arXiv 2024
-
[4]
Saving swe-bench: A benchmark mutation approach for realistic agent evaluation,
S. Garg, B. Steenhoek, and Y. Huang, “Saving swe-bench: A benchmark mutation approach for realistic agent evaluation, ” 2026. [Online]. Available: https://arxiv.org/abs/2510.08996
arXiv 2026
-
[5]
The swe-bench illusion: When state-of-the-art llms remember instead of reason,
S. Liang, S. Garg, and R. Z. Moghaddam, “The swe-bench illusion: When state-of-the-art llms remember instead of reason, ” 2025. [Online]. Available: https://arxiv.org/abs/2506.12286
arXiv 2025
-
[6]
Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?
X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?” 2025. [Online]. Available: https://arxiv...
Pith/arXiv arXiv 2025
-
[7]
Multi-swe-bench: A multilingual benchmark for issue resolving,
D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, S. Liu, Y. Xiao, L. Chen, Y. Zhang, J. Su, T. Liu, R. Long, K. Shen, and L. Xiang, “Multi-swe-bench: A multilingual benchmark for issue resolving, ” 2025. [Online]. Available: https://arxiv.org/abs/2504.02605
arXiv 2025
-
[8]
Teaching large language models to self-debug,
X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language models to self-debug, ” 2023. [Online]. Available: https://arxiv.org/abs/2304.05128
Pith/arXiv arXiv 2023
-
[9]
Reflexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning, ”Advances in Neural Information Processing Systems, vol. 36, 2023
2023
-
[10]
Self-refine: Iterative refinement with self-feedback,
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refinement with self-feedback, ” 2023. [Online]. Available: https://arxiv.org/abs/2303.17651
Pith/arXiv arXiv 2023
-
[11]
Y. Wang, M. Pradel, and Z. Liu, “Are "solved issues" in swe-bench really solved correctly? an empirical study, ”ArXiv, vol. abs/2503.15223, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277113006
arXiv 2025
-
[12]
Livemcpbench: Can agents navigate an ocean of mcp tools?
G. Mo, W. Zhong, J. Chen, X. Chen, Y. Lu, H. Lin, B. He, X. Han, and L. Sun, “Livemcpbench: Can agents navigate an ocean of mcp tools?” 2025. [Online]. Available: https://arxiv.org/abs/2508.01780
arXiv 2025
-
[13]
jdb — the java debugger,
Oracle, “jdb — the java debugger, ” 2026, java SE 21 Documentation. [Online]. Available: https://docs.oracle.com/en/java/javase/21/docs/specs/ man/jdb.html
2026
-
[14]
pdb — the python debugger,
Python Software Foundation, “pdb — the python debugger, ” 2024, version 3.12.0. [Online]. Available: https://docs.python.org/3/library/pdb.html
2024
-
[15]
Gitbug-java: A reproducible benchmark of recent java bugs,
A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A reproducible benchmark of recent java bugs, ” inProceedings of the 21st International Conference on Mining Software Repositories, ser. MSR ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 118–122. [Online]. Available: https://doi.org/10.1145/3643991.3644884
-
[16]
L. Zhang, S. He, C. Zhang, Y. Kang, B. Li, C. Xie, J. Wang, M. Wang, Y. Huang, S. Fu, E. Nallipogu, Q. Lin, Y. Dang, S. Rajmohan, and D. Zhang, “Swe-bench goes live!” 2025. [Online]. Available: https://arxiv.org/abs/2505.23419
arXiv 2025
-
[17]
Explainable automated debugging via large language model-driven scientific debugging,
S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated debugging via large language model-driven scientific debugging, ” 2023. [Online]. Available: https://arxiv.org/abs/2304.02195
arXiv 2023
-
[18]
Chatdbg: Augmenting debugging with large language models,
K. H. Levin, N. van Kempen, E. D. Berger, and S. N. Freund, “Chatdbg: Augmenting debugging with large language models, ”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 1892–1913, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1145/3729355
doi:10.1145/3729355 1913
-
[19]
Debug like a human: A large language model debugger via verifying runtime execution step-by-step,
L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step-by-step, ” 2024. [Online]. Available: https://arxiv.org/abs/2402.16906
arXiv 2024
-
[20]
debug-gym: A text-based environment for interactive debugging,
X. Yuan, M. M. Moss, C. E. Feghali, C. Singh, D. Moldavskaya, D. MacPhee, L. Caccia, M. Pereira, M. Kim, A. Sordoni, and M.-A. Côté, “debug-gym: A text-based environment for interactive debugging, ” 2025. [Online]. Available: https://arxiv.org/abs/2503.21557
arXiv 2025
-
[21]
Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering, ” 2024
2024
-
[22]
GitHub Copilot Agent,
GitHub, “GitHub Copilot Agent, ” https://github.blog/news-insights/ product-news/github-copilot-meet-the-new-coding-agent/, 2024, ac- cessed: 2025-07-14
2024
-
[23]
https://windsurf.com/,
Windsurf, “https://windsurf.com/, ” 2024, accessed: 2025-07-14
2024
-
[24]
J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.04834
arXiv 2025
-
[25]
Masai: Modular architecture for software-engineering ai agents,
D. Arora, A. Sonwane, N. Wadhwa, A. Mehrotra, S. Utpala, R. Bairi, A. Kanade, and N. Natarajan, “Masai: Modular architecture for software-engineering ai agents, ” 2024. [Online]. Available: https: //arxiv.org/abs/2406.11638
arXiv 2024
-
[26]
Autodev: Automated ai-driven development,
M. Tufano, A. Agarwal, J. Jang, R. Z. Moghaddam, and N. Sundaresan, “Autodev: Automated ai-driven development, ” 2024. [Online]. Available: https://arxiv.org/abs/2403.08299
arXiv 2024
-
[27]
Mapcoder: Multi-agent code generation for competitive problem solving,
M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving, ” 2024. [Online]. Available: https://arxiv.org/abs/2405.11403
arXiv 2024
-
[28]
Unidebugger: Hierarchical multi-agent framework for unified software debugging,
C. Lee, C. S. Xia, L. Yang, J. tse Huang, Z. Zhu, L. Zhang, and M. R. Lyu, “Unidebugger: Hierarchical multi-agent framework for unified software debugging, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.17153
arXiv 2025
-
[29]
Swe-bench: Can language models resolve real-world github issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” 2024. [Online]. Available: https://arxiv.org/abs/2310.06770 11
Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.