Recognition: unknown
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
Pith reviewed 2026-05-08 03:16 UTC · model grok-4.3
The pith
A function-level debugging interface lets basic agents resolve 63.8% of SWE-bench Verified repair tasks at low cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an agent-centric debugging interface built around a Frame Lifetime Trace data structure and a small set of high-level navigational commands supplies exactly the execution information LLM agents need for program repair. This interface replaces the cost-inefficient line-by-line interaction of conventional debuggers, enabling a simple agent to achieve 63.8% resolution on SWE-bench Verified at an average cost of $1.28 per task and delivering additive gains when plugged into existing state-of-the-art agents.
What carries the argument
Agent-centric Debugging Interface (ADI), a function-level interaction paradigm powered by the Frame Lifetime Trace data structure that records stateful execution within each function together with high-level navigational commands that let the agent move and inspect without line-by-line requests.
If this is right
- Basic agents equipped only with ADI reach 63.8% resolution on SWE-bench Verified and slightly exceed the performance of the optimized Claude-Tools agent.
- The same interface added to existing SOTA agents produces consistent gains between 6.2% and 18.5% on resolved tasks.
- Average per-task cost drops to $1.28 when using Claude-Sonnet-3.7, showing that high-level commands reduce token consumption compared with traditional debuggers.
- ADI acts as a plug-and-play module that improves any agent architecture without requiring changes to the underlying model or workflow.
Where Pith is reading between the lines
- If the trace abstraction proves sufficient for most bugs, similar high-level interfaces could be designed for other agent-driven tasks such as test generation or vulnerability discovery.
- The approach implies that LLM agents benefit more from curated, function-scoped views of execution state than from exhaustive low-level traces.
- Developers of repair benchmarks could add new tasks that explicitly test whether function-level traces are enough or whether cross-function data flows still require extra instrumentation.
Load-bearing premise
That the Frame Lifetime Trace and high-level navigational commands together contain all information an agent requires to diagnose and repair complex multi-function bugs without needing line-by-line inspection or extra low-level feedback.
What would settle it
A set of repair tasks in which agents using only ADI repeatedly fail while agents given full line-by-line variable inspection or additional low-level traces succeed.
Figures
read the original abstract
Autonomous agents for automated program repair represent a promising frontier in software engineering, yet their effectiveness is often hindered by reliance on post-mortem, coarse-grained execution feedback. While integrating traditional interactive debuggers seems a natural solution, their low-level, line-by-line interaction paradigm turns out to be cost-inefficient for LLM-based agents, leading to exhausted budgets and unproductive loops. To mitigate this, we introduce Agent-centric Debugging Interface (ADI), a novel agent-centric debugging interface designed for cost-efficient, end-to-end autonomous interaction. Specifically, Agent-centric Debugging Interface realizes a function-level interaction paradigm, powered by our Frame Lifetime Trace, a comprehensive data structure encapsulating a function's stateful execution trace, and a set of high-level navigational commands. Our extensive evaluation on the SWE-bench benchmark demonstrates the effectiveness and efficiency of ADI. By simply equipping a basic agent with ADI, it successfully resolves 63.8\% of the tasks on the SWE-bench Verified set, even slightly outperforming the highly optimized and high-investment Claude-Tools agent, at an average cost of USD 1.28 per task with Claude-Sonnet-3.7. Furthermore, we demonstrate ADI's generality by integrating it as a plug-and-play component into existing SOTA agents, delivering consistent gains ranging from 6.2\% to 18.5\% on the resolved tasks. These results indicate that Agent-centric Debugging Interface can provide a general and efficient enhancement for existing autonomous agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent-centric Debugging Interface (ADI), a function-level debugging paradigm for LLM-based autonomous agents that uses a Frame Lifetime Trace data structure to encapsulate stateful execution and high-level navigational commands instead of line-by-line interaction. On the SWE-bench Verified benchmark, a basic agent equipped with ADI resolves 63.8% of tasks (slightly outperforming the optimized Claude-Tools baseline) at an average cost of USD 1.28 per task with Claude-Sonnet-3.7; integrating ADI as a plug-and-play module into existing SOTA agents yields consistent gains of 6.2% to 18.5% in resolved tasks.
Significance. If the empirical results hold under rigorous validation, the work offers a meaningful advance in software engineering by making dynamic analysis practical and cost-effective for autonomous program repair agents. The plug-and-play integration results and clear cost/resolution figures on a standard benchmark are particular strengths that could influence agent design more broadly.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The headline 63.8% resolution rate, outperformance over Claude-Tools, and 6.2–18.5% gains are reported without details on experimental controls, number of independent runs, variance due to LLM non-determinism, or statistical significance tests. This is load-bearing for the central empirical claim.
- [§3.1 (Frame Lifetime Trace)] §3.1 (Frame Lifetime Trace): No ablation study or failure-case breakdown is provided on whether the trace preserves all information needed for multi-function or side-effect bugs (e.g., untraced globals, I/O, or cross-boundary state). The reported gains rest on the assumption that function-level traces plus navigational commands suffice without low-level feedback; this requires explicit validation.
minor comments (2)
- [Abstract] Abstract: Specify the exact baseline numbers for Claude-Tools and confirm whether the 63.8% figure applies to the full SWE-bench Verified set.
- [§3.2] Notation for the high-level navigational commands could be formalized with a small table or grammar to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving the rigor of our empirical evaluation and the analysis of our proposed data structure. We address each of the major comments below, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: The headline 63.8% resolution rate, outperformance over Claude-Tools, and 6.2–18.5% gains are reported without details on experimental controls, number of independent runs, variance due to LLM non-determinism, or statistical significance tests. This is load-bearing for the central empirical claim.
Authors: We agree that the current presentation lacks sufficient details on experimental controls and statistical rigor to fully substantiate the claims given LLM non-determinism. In the revised manuscript, we will expand the evaluation section to report results from multiple independent runs (using different random seeds), including means and standard deviations for the resolution rates and costs. We will also perform and report statistical significance tests (such as t-tests) for the observed gains over baselines. Furthermore, we will provide clearer documentation of the experimental setup and controls used. revision: yes
-
Referee: No ablation study or failure-case breakdown is provided on whether the trace preserves all information needed for multi-function or side-effect bugs (e.g., untraced globals, I/O, or cross-boundary state). The reported gains rest on the assumption that function-level traces plus navigational commands suffice without low-level feedback; this requires explicit validation.
Authors: We agree that an explicit validation of the Frame Lifetime Trace's coverage for various bug types would strengthen the paper. Although the SWE-bench tasks encompass multi-function and side-effect bugs, we did not include a dedicated ablation or failure breakdown. In the revision, we will incorporate a failure-case analysis that categorizes the tasks based on bug characteristics (e.g., involvement of globals, I/O, cross-function state) and discusses the information preserved by the trace. We will also analyze cases where low-level feedback might have been beneficial. This addresses the need for explicit validation while building on the existing end-to-end results. revision: partial
Circularity Check
No circularity; empirical results on external benchmark
full rationale
The paper introduces ADI with Frame Lifetime Trace and high-level commands, then reports direct empirical outcomes on SWE-bench Verified (63.8% resolution, cost figures, and plug-in gains of 6.2-18.5%). No equations, parameter fits, or derivations appear; the central claims are measured performance numbers against an external benchmark rather than quantities that reduce by construction to the paper's own inputs or self-citations. The evaluation is self-contained and falsifiable via the benchmark.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Agent-centric Debugging Interface (ADI)
no independent evidence
-
Frame Lifetime Trace
no independent evidence
Reference graph
Works this paper leans on
-
[1]
https://github.com/swe-bench/experiments/issues/16 GitHub issue #16, SWE-bench/experiments
2024.Whether using test patch is allowed. https://github.com/swe-bench/experiments/issues/16 GitHub issue #16, SWE-bench/experiments
2024
-
[2]
Hugging Face
2024-02-29. Hugging Face. https://huggingface.co
2024
-
[3]
OpenAI API
2024-02-29. OpenAI API. https://openai.com/api
2024
-
[4]
astropy-12907 github issue
2025. astropy-12907 github issue. https://github.com/astropy/astropy/issues/12906
2025
-
[5]
Astropy Github Repository
2025. Astropy Github Repository. https://github.com/astropy/astropy
2025
-
[6]
django-11119 task pr
2025. django-11119 task pr. https://github.com/django/django/pull/11119/
2025
-
[7]
Github Repository
2025. Github Repository. https://github.com/GhabiX/ADI
2025
- [8]
-
[9]
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, 89–98
2007
-
[10]
Alibaba Cloud. 2025. Qwen3 API Price. Online. https://www.alibabacloud.com/help/zh/model-studio/models Accessed: 2025-07-12
2025
-
[11]
Anthropic. 2024. Introducing Claude 3.5 Sonnet. Online. https://www.anthropic.com/news/claude-3-5-sonnet Accessed: 2025-07-15
2024
-
[12]
Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. Online. https://www.anthropic.com/news/claude-3-7-sonnet Accessed: 2025-07-15
2025
-
[13]
Anthropic. 2025. Claude API Documentation. https://docs.anthropic.com/en/home. Accessed: 2025-06-30
2025
-
[14]
Anthropic. 2025. Raising the Bar on SWE-Bench Verified with Claude 3.5 Sonnet. https://www.anthropic.com/ engineering/swe-bench-sonnet. Accessed: 2025-09-11
2025
-
[15]
Yasharth Bajpai, Bhavya Chopra, Param Biyani, Cagri Aslan, Dustin Coleman, Sumit Gulwani, Chris Parnin, Arjun Radhakrishna, and Gustavo Soares. 2024. Let’s fix this together: Conversational debugging with github copilot. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 1–12
2024
-
[16]
Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111
2023
- [17]
- [18]
-
[19]
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128(2023)
work page internal anchor Pith review arXiv 2023
-
[20]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435
2023
-
[21]
Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, and Yaroslav Zharov. 2024. Debug smarter, not harder: Ai agents for error resolution in computational notebooks. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 363–371
2024
-
[22]
Yirui He, Ziyao He, Syed Fatiul Huq, and Sam Malek. 2026. ReFLAIR: Detecting Responsive Layout Reflow Issues using Multimodal Generative AI.Proceedings of the ACM on Software Engineering3, FSE (2026). doi:10.1145/3808136
-
[23]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79
2024
-
[24]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
-
[25]
Sungmin Kang, Bei Chen, Shin Yoo, and Jian-Guang Lou. 2025. Explainable automated debugging via large language model-driven scientific debugging.Empirical Software Engineering30, 2 (2025), 45
2025
-
[26]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323
2023
-
[27]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE031. Public...
2023
-
[28]
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931
2023
-
[29]
Kyla H Levin, Nicolas van Kempen, Emery D Berger, and Stephen N Freund. 2025. ChatDBG: Augmenting Debugging with Large Language Models.Proceedings of the ACM on Software Engineering2, FSE (2025), 1892–1913
2025
- [30]
-
[31]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357
2023
- [32]
- [33]
- [34]
-
[35]
OpenAI. 2024. Hello GPT-4o. Online. https://openai.com/index/hello-gpt-4o/ Accessed: 2025-07-15
2024
-
[36]
OpenAI. 2024. Introducing SWE-bench, verified. https://openai.com/index/introducing-swe-bench-verified/. Accessed on 2025-06-23
2024
-
[37]
Python Docs. 2025. Python ctypes. https://docs.python.org/3/library/ctypes.html
2025
-
[38]
Python Docs. 2025. Python Frame Objects. https://docs.python.org/3/c-api/frame.html
2025
-
[39]
Python Software Foundation. 2023. Python 3 Glossary — qualified name. https://docs.python.org/3/glossary.html#term- qualified-name
2023
-
[40]
2025.pdb — The Python Debugger
Python Software Foundation. 2025.pdb — The Python Debugger. https://docs.python.org/3/library/pdb.html
2025
-
[41]
Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2025. Specrover: Code intent extraction via llms. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 963–974
2025
-
[42]
Richard Stallman, Roland Pesch, Stan Shebs, et al. 1988. Debugging with GDB.Free Software Foundation675 (1988)
1988
- [43]
-
[44]
Hanzhuo Tan, Qi Luo, Ling Jiang, Zizheng Zhan, Jing Li, Haotian Zhang, and Yuqun Zhang. 2024. Prompt-based code completion via multi-retrieval augmented generation.ACM Transactions on Software Engineering and Methodology (2024)
2024
-
[45]
Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. Llm4decompile: Decompiling binary code with large language models. 3473–3487 pages
2024
-
[46]
Hanzhuo Tan, Xiaolong Tian, Hanrui Qi, Jiaming Liu, Siyi Wang, GAO Zuchen, Qi Luo, Jing Li, and Yuqun Zhang. [n. d.]. Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track
-
[47]
Hanzhuo Tan, Chunpu Xu, Jing Li, Yuqun Zhang, Zeyang Fang, Zeyu Chen, and Baohua Lai. 2024. Hicl: Hashtag-driven in-context learning for social media natural language understanding.IEEE transactions on neural networks and learning systems36, 4 (2024), 7037–7050
2024
- [48]
-
[49]
The SWE-bench Team. 2024. SWE-bench: A Benchmark for Evaluating Large Language Models on Real World Software Issues. https://www.swebench.com. Accessed: 28-June-2025
2024
-
[50]
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)
work page internal anchor Pith review arXiv 2024
-
[51]
Wilkes, David J
Maurice V. Wilkes, David J. Wheeler, and Stanley Gill. 1951.The Preparation of Programs for an Electronic Digital Computer. Addison-Wesley Press, Cambridge, MA, USA
1951
-
[52]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)
work page internal anchor Pith review arXiv 2024
-
[53]
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494
2023
-
[54]
Chunqiu Steven Xia and Lingming Zhang. 2024. Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE031. Publication date: July 2026. FSE031:22 Jiahong Xiang †, Xiaoyang Xu, Xiaop...
2024
-
[55]
Jiahong Xiang. 2026. FramePilot-Artifacts. doi:10.5281/zenodo.19728388
-
[56]
Jiahong Xiang, Wenxiao He, Xihua Wang, Hongliang Tian, and Yuqun Zhang. 2026. Evaluating and Improving Automated Repository-Level Rust Issue Resolution with LLM-based Agents. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE). IEEE, 816–830. doi:10.1145/3744916.3773108
- [57]
-
[58]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652
2024
-
[59]
John Yang, Kilian Leret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. Swe-smith: Scaling data for software engineering agents.arXiv preprint arXiv:2504.21798(2025)
work page internal anchor Pith review arXiv 2025
-
[60]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629
work page internal anchor Pith review arXiv 2022
- [61]
-
[62]
Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained models for program understanding and generation. InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 39–51. Received 2025-09-12; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.