arxiv: 2602.18571 · v2 · submitted 2026-02-20 · 💻 cs.SE · cs.AI

Recognition: unknown

Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

Spandan Garg , Yufan Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords coding agentsinteractive debuggingbug fixingsoftware engineering agentssubagent architectureruntime informationJavaPython

0 comments

The pith

Adding interactive debuggers to coding agents lets them fix over 20 percent more bugs on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that coding agents currently miss rich runtime information that human developers use during debugging, relying instead on static code review or blind test-fix loops. It presents Debug2Fix, a framework that adds interactive debugging through a dedicated subagent for Java and Python programs. On GitBug-Java and SWE-Bench-Live, this produces performance gains exceeding 20 percent for several models. The same framework allows weaker models such as GPT-5 and Claude Haiku 4.5 to reach or surpass stronger models like Claude Sonnet 4.5. Ablation studies isolate the contributions of the subagent structure and the debugger component itself.

Core claim

Debug2Fix equips software engineering agents with interactive debuggers via a subagent architecture so that the main agent can inspect runtime states, variables, and execution traces while fixing bugs. The approach is implemented for Java and Python and evaluated on GitBug-Java and SWE-Bench-Live, where it yields more than 20 percent higher success rates than baselines that use only static analysis or iterative test execution. The results also show that weaker models equipped with Debug2Fix match or exceed the bug-fixing performance of stronger models, indicating that debugger access can substitute for additional model scale.

What carries the argument

A subagent architecture that runs an interactive debugger session, exposing runtime inspection commands and output directly to the main coding agent.

If this is right

Bug-fixing success rates rise when agents receive live runtime data instead of relying on static code or test outcomes alone.
Weaker models become competitive with stronger models once they share the same debugging tool set.
Both the subagent separation and the debugger integration are necessary; removing either reduces the observed gains.
The framework applies to existing Java and Python codebases without changing the underlying language runtimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designers may achieve larger gains by expanding available tools than by increasing model size alone.
Debugging agents could eventually handle bugs whose symptoms appear only at runtime and are invisible to static or test-based methods.
Integration of debuggers into agents could reduce the manual debugging load for human developers on routine defects.

Load-bearing premise

The runtime information returned by the debugger can be correctly interpreted and used by the agent without creating new errors or hallucinations that erase the gains.

What would settle it

Re-running the GitBug-Java and SWE-Bench-Live evaluations after adding debugger access and finding that the agent produces more incorrect patches than the baseline because it misreads stack traces or variable values.

Figures

Figures reproduced from arXiv: 2602.18571 by Spandan Garg, Yufan Huang.

**Figure 2.** Figure 2: A bug from a popular open-source Python repository on GitHub. We see very different trajectories taken by [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: System prompt for the Debug Subagent. The [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Instructions added (green) to the main agent [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Plot showing an aggregated view of all the tra [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

While significant progress has been made in automating various aspects of software development through coding agents, there is still significant room for improvement in their bug fixing capabilities. Debugging and investigation of runtime behavior remains largely a manual, developer-driven process. Popular coding agents typically rely on either static analysis of the code or iterative test-fix cycles, which is akin to trial and error debugging. We posit that there is a wealth of rich runtime information that developers routinely access while debugging code, which agents are currently deprived of due to design limitations. Despite how prevalent debuggers are in modern IDEs and command-line tools, they have surprisingly not made their way into coding agents. In this work, we introduce Debug2Fix, a novel framework that incorporates interactive debugging as a core component of a software engineering agent via a subagent architecture. We incorporate debuggers for Java and Python into our agent framework and evaluate against GitBug-Java and SWE-Bench-Live and achieve >20% improvement in performance compared to the baseline for certain models. Furthermore, using our framework, we're able to make weaker models like GPT-5 and Claude Haiku 4.5 match or exceed the performances of stronger models like Claude Sonnet 4.5, showing that better tool design is often just as important as switching to a more expensive model. Finally, we conduct systematic ablations demonstrating the importance of both the subagent architecture and debugger integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Debug2Fix, a framework that integrates interactive debugging into coding agents via a subagent architecture, incorporating debuggers for Java and Python. It evaluates the approach on GitBug-Java and SWE-Bench-Live, reporting >20% performance gains over baselines for certain models, and claims that the framework enables weaker models (e.g., GPT-5, Claude Haiku 4.5) to match or exceed stronger models (e.g., Claude Sonnet 4.5). Systematic ablations are presented to demonstrate the contributions of the subagent architecture and debugger integration.

Significance. If the empirical results hold under full verification, the work provides evidence that runtime debugging tools can meaningfully improve bug-fixing capabilities in coding agents, with the potential to reduce dependence on larger models by emphasizing tool design. The subagent-based integration and benchmark results offer a practical demonstration of operationalizing interactive debugging in agents, which could influence future agent architectures in software engineering.

major comments (2)

[Experimental Evaluation and Ablation Studies] The headline claims of >20% gains and weaker models matching stronger ones on GitBug-Java and SWE-Bench-Live rest on the unverified assumption that the debugging subagent reliably parses and acts on runtime outputs (stack traces, variable states) without net error increase from hallucinations or misinterpretations. The ablations show value for the debugger component but do not isolate or measure the rate at which debugger use introduces incorrect patches relative to baseline trial-and-error loops.
[Results and Evaluation] The abstract and results sections report performance improvements without accompanying error bars, exact baseline configurations, or full experimental protocols (e.g., number of runs, temperature settings, or failure mode analysis), which prevents independent verification of the central performance claims.

minor comments (2)

[Abstract] The abstract refers to gains 'for certain models' without naming them; specifying the models and exact percentages per benchmark would improve precision.
[Framework Description] Notation for the subagent architecture and its interaction with the main agent could be clarified with a diagram or pseudocode to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We address the major comments below and will revise the manuscript accordingly to improve clarity and verifiability of our results.

read point-by-point responses

Referee: [Experimental Evaluation and Ablation Studies] The headline claims of >20% gains and weaker models matching stronger ones on GitBug-Java and SWE-Bench-Live rest on the unverified assumption that the debugging subagent reliably parses and acts on runtime outputs (stack traces, variable states) without net error increase from hallucinations or misinterpretations. The ablations show value for the debugger component but do not isolate or measure the rate at which debugger use introduces incorrect patches relative to baseline trial-and-error loops.

Authors: We thank the referee for highlighting this important point. Our ablations compare the full Debug2Fix framework against baselines without the debugger, showing consistent gains that suggest the debugging subagent provides net benefit. However, we agree that we have not explicitly measured the rate of errors introduced by the subagent's parsing of runtime outputs. In the revised version, we will add a detailed failure mode analysis, including a breakdown of cases where debugger interactions led to incorrect patches, and compare the error rates to the baseline. This will be supported by manual review of a subset of the results. revision: yes
Referee: [Results and Evaluation] The abstract and results sections report performance improvements without accompanying error bars, exact baseline configurations, or full experimental protocols (e.g., number of runs, temperature settings, or failure mode analysis), which prevents independent verification of the central performance claims.

Authors: We agree that additional details are necessary for reproducibility. In the revised manuscript, we will include error bars from repeated experiments, specify exact baseline configurations including model parameters and temperatures, detail the number of runs and random seeds used, and provide a comprehensive experimental protocol along with failure mode analysis in the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper introduces Debug2Fix as an engineering framework and reports performance gains (>20% on GitBug-Java and SWE-Bench-Live) via direct experimental comparison to baselines. No derivation chain, equations, fitted parameters, or predictions appear; results are measured outcomes on held-out benchmarks rather than quantities defined or fitted inside the paper. Self-citations, if present, are not load-bearing for the central empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work is empirical and relies on standard assumptions about agent tool-use and benchmark validity rather than new axioms or fitted parameters.

axioms (1)

domain assumption Agents can effectively interpret and act on debugger output without systematic misinterpretation.
Implicit in the claim that debugger access improves fix rates.

invented entities (1)

Debugging subagent no independent evidence
purpose: Separate component that handles interactive debugger sessions within the main coding agent.
New architectural element introduced to incorporate runtime debugging.

pith-pipeline@v0.9.0 · 5549 in / 1148 out tokens · 20457 ms · 2026-05-15T20:14:56.920336+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
cs.AI 2026-05 unverdicted novelty 4.0

A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.

Reference graph

Works this paper leans on

29 extracted references · 5 linked inside Pith · cited by 1 Pith paper

[1]

Claude for Coding,

Anthropic, “Claude for Coding, ” https://www.anthropic.com/claude-code, 2024, accessed: 2025-07-14

2024
[2]

VSCode Agent Mode,

Microsoft, “VSCode Agent Mode, ” https://code.visualstudio.com/blogs/ 2025/04/07/agentMode, 2025, accessed: 2025-09-28

2025
[3]

Opendevin: An open platform for ai software developers as generalist agents,

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Opendevin: An open platform for ai software developers as generalist agents, ” 2024. [Online]. Available: https://arxiv.org/abs/2...

Pith/arXiv arXiv 2024
[4]

Saving swe-bench: A benchmark mutation approach for realistic agent evaluation,

S. Garg, B. Steenhoek, and Y. Huang, “Saving swe-bench: A benchmark mutation approach for realistic agent evaluation, ” 2026. [Online]. Available: https://arxiv.org/abs/2510.08996

arXiv 2026
[5]

The swe-bench illusion: When state-of-the-art llms remember instead of reason,

S. Liang, S. Garg, and R. Z. Moghaddam, “The swe-bench illusion: When state-of-the-art llms remember instead of reason, ” 2025. [Online]. Available: https://arxiv.org/abs/2506.12286

arXiv 2025
[6]

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?” 2025. [Online]. Available: https://arxiv...

Pith/arXiv arXiv 2025
[7]

Multi-swe-bench: A multilingual benchmark for issue resolving,

D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, S. Liu, Y. Xiao, L. Chen, Y. Zhang, J. Su, T. Liu, R. Long, K. Shen, and L. Xiang, “Multi-swe-bench: A multilingual benchmark for issue resolving, ” 2025. [Online]. Available: https://arxiv.org/abs/2504.02605

arXiv 2025
[8]

Teaching large language models to self-debug,

X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language models to self-debug, ” 2023. [Online]. Available: https://arxiv.org/abs/2304.05128

Pith/arXiv arXiv 2023
[9]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning, ”Advances in Neural Information Processing Systems, vol. 36, 2023

2023
[10]

Self-refine: Iterative refinement with self-feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refinement with self-feedback, ” 2023. [Online]. Available: https://arxiv.org/abs/2303.17651

Pith/arXiv arXiv 2023
[11]

Are "solved issues

Y. Wang, M. Pradel, and Z. Liu, “Are "solved issues" in swe-bench really solved correctly? an empirical study, ”ArXiv, vol. abs/2503.15223, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277113006

arXiv 2025
[12]

Livemcpbench: Can agents navigate an ocean of mcp tools?

G. Mo, W. Zhong, J. Chen, X. Chen, Y. Lu, H. Lin, B. He, X. Han, and L. Sun, “Livemcpbench: Can agents navigate an ocean of mcp tools?” 2025. [Online]. Available: https://arxiv.org/abs/2508.01780

arXiv 2025
[13]

jdb — the java debugger,

Oracle, “jdb — the java debugger, ” 2026, java SE 21 Documentation. [Online]. Available: https://docs.oracle.com/en/java/javase/21/docs/specs/ man/jdb.html

2026
[14]

pdb — the python debugger,

Python Software Foundation, “pdb — the python debugger, ” 2024, version 3.12.0. [Online]. Available: https://docs.python.org/3/library/pdb.html

2024
[15]

Gitbug-java: A reproducible benchmark of recent java bugs,

A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A reproducible benchmark of recent java bugs, ” inProceedings of the 21st International Conference on Mining Software Repositories, ser. MSR ’24. New York, NY, USA: Association for Computing Machinery, 2024, p. 118–122. [Online]. Available: https://doi.org/10.1145/3643991.3644884

doi:10.1145/3643991.3644884 2024
[16]

Swe-bench goes live!

L. Zhang, S. He, C. Zhang, Y. Kang, B. Li, C. Xie, J. Wang, M. Wang, Y. Huang, S. Fu, E. Nallipogu, Q. Lin, Y. Dang, S. Rajmohan, and D. Zhang, “Swe-bench goes live!” 2025. [Online]. Available: https://arxiv.org/abs/2505.23419

arXiv 2025
[17]

Explainable automated debugging via large language model-driven scientific debugging,

S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated debugging via large language model-driven scientific debugging, ” 2023. [Online]. Available: https://arxiv.org/abs/2304.02195

arXiv 2023
[18]

Chatdbg: Augmenting debugging with large language models,

K. H. Levin, N. van Kempen, E. D. Berger, and S. N. Freund, “Chatdbg: Augmenting debugging with large language models, ”Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. 1892–1913, Jun. 2025. [Online]. Available: http://dx.doi.org/10.1145/3729355

doi:10.1145/3729355 1913
[19]

Debug like a human: A large language model debugger via verifying runtime execution step-by-step,

L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step-by-step, ” 2024. [Online]. Available: https://arxiv.org/abs/2402.16906

arXiv 2024
[20]

debug-gym: A text-based environment for interactive debugging,

X. Yuan, M. M. Moss, C. E. Feghali, C. Singh, D. Moldavskaya, D. MacPhee, L. Caccia, M. Pereira, M. Kim, A. Sordoni, and M.-A. Côté, “debug-gym: A text-based environment for interactive debugging, ” 2025. [Online]. Available: https://arxiv.org/abs/2503.21557

arXiv 2025
[21]

Swe-agent: Agent-computer interfaces enable automated soft- ware engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated soft- ware engineering, ” 2024

2024
[22]

GitHub Copilot Agent,

GitHub, “GitHub Copilot Agent, ” https://github.blog/news-insights/ product-news/github-copilot-meet-the-new-coding-agent/, 2024, ac- cessed: 2025-07-14

2024
[23]

https://windsurf.com/,

Windsurf, “https://windsurf.com/, ” 2024, accessed: 2025-07-14

2024
[24]

Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead,

J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.04834

arXiv 2025
[25]

Masai: Modular architecture for software-engineering ai agents,

D. Arora, A. Sonwane, N. Wadhwa, A. Mehrotra, S. Utpala, R. Bairi, A. Kanade, and N. Natarajan, “Masai: Modular architecture for software-engineering ai agents, ” 2024. [Online]. Available: https: //arxiv.org/abs/2406.11638

arXiv 2024
[26]

Autodev: Automated ai-driven development,

M. Tufano, A. Agarwal, J. Jang, R. Z. Moghaddam, and N. Sundaresan, “Autodev: Automated ai-driven development, ” 2024. [Online]. Available: https://arxiv.org/abs/2403.08299

arXiv 2024
[27]

Mapcoder: Multi-agent code generation for competitive problem solving,

M. A. Islam, M. E. Ali, and M. R. Parvez, “Mapcoder: Multi-agent code generation for competitive problem solving, ” 2024. [Online]. Available: https://arxiv.org/abs/2405.11403

arXiv 2024
[28]

Unidebugger: Hierarchical multi-agent framework for unified software debugging,

C. Lee, C. S. Xia, L. Yang, J. tse Huang, Z. Zhu, L. Zhang, and M. R. Lyu, “Unidebugger: Hierarchical multi-agent framework for unified software debugging, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.17153

arXiv 2025
[29]

Swe-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” 2024. [Online]. Available: https://arxiv.org/abs/2310.06770 11

Pith/arXiv arXiv 2024