Recognition: 2 theorem links
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair
Pith reviewed 2026-05-08 17:48 UTC · model grok-4.3
The pith
A multi-granularity graph with intra-procedural data-flow edges lets LLM agents localize bugs more precisely and generate more successful fixes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARISE augments an LLM-based agent with a multi-granularity program graph that extends structural relationships to statement-level nodes connected by intra-procedural definition-use edges. The graph is exposed through a three-tier tool API that treats data-flow slicing as a first-class queryable primitive, letting the model trace in a single call which statements define or consume any variable of interest. Evaluated on SWE-bench Lite using Qwen2.5-Coder-32B-Instruct, ARISE improves Function Recall@1 by 17 points and Line Recall@1 by 15 points over the unmodified SWE-agent baseline. These localization gains raise repair success to 22 percent Pass@1 (66 out of 300 issues), a 4.7-point increase.
What carries the argument
Multi-granularity program graph with intra-procedural definition-use edges, exposed as a queryable primitive through a three-tier tool API for data-flow slicing.
If this is right
- Localization gains from the graph translate directly into higher rates of valid patches.
- Large code models can use the structured slice output without an extra natural-language summarization step.
- The graph builder and slicing API function as a drop-in addition for other agent frameworks.
- Controlled ablations show the performance lift comes from the data-flow edges rather than the tool interface alone.
Where Pith is reading between the lines
- The same slicing primitive could be tested on tasks beyond repair, such as vulnerability detection or test generation.
- Porting the graph builder to additional languages would allow direct comparison of data-flow benefits across codebases.
- Combining the intra-procedural slices with inter-procedural call edges might further improve localization on bugs that cross function boundaries.
- If the graph construction misses flows in unusually complex procedures, the observed gains would shrink on those specific cases.
Load-bearing premise
The automatically constructed graph accurately records the true definition-use relationships inside each procedure without adding false links or omitting critical flows.
What would settle it
Disable the data-flow edges and slicing tools while keeping the rest of the agent and tool schema unchanged, then re-run on the same 300 SWE-bench Lite issues; if recall and Pass@1 fall back to baseline levels the contribution of the graph is confirmed, but unchanged scores would falsify it.
Figures
read the original abstract
Repository-level fault localization (FL) and automated program repair (APR) require an agent to identify the relevant code units across files, follow call and data dependencies, and generate a valid patch. Existing graph-based systems provide structural representations of repositories (files, classes, functions and their relationships) but do not model how variable values flow within procedures, leaving agents without the semantic precision needed for function- and line-level localization. We present ARISE (Agentic Repository-level Issue Solving Engine), which augments an LLM-based agent with a multi-granularity program graph that extends structural relationships down to statement-level nodes connected by intra-procedural definition-use edges. ARISE exposes this graph through a three-tier tool API, which brings data-flow slicing as a first-class, queryable agent primitive that allows the model to trace, in a single call, which statements define or consume a variable of interest. We evaluate on SWE-bench Lite (300 real GitHub issues, 11 Python repositories) using Qwen2.5-Coder-32B-Instruct as the backbone. Compared to the unmodified SWE-agent baseline, ARISE improves Function Recall@1 by 17.0 points and Line Recall@1 by 15.0 points. These localization gains translate directly into repair success, with ARISE achieving 22.0% Pass@1 (66/300), a 4.7 percentage-point improvement over SWE-agent. Controlled ablations confirm that the improvement is driven by the data-flow graph rather than the tool schema, and that large code models consume structured slice output directly without requiring a natural-language summarization layer. The graph builder and slicing API are designed as a framework-agnostic, drop-in toolset for future APR research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents ARISE, a system that augments LLM-based agents for repository-level fault localization and automated program repair with a multi-granularity program graph. This graph extends structural relationships (files, classes, functions) with statement-level nodes connected by intra-procedural definition-use edges. The graph is exposed via a three-tier tool API that makes data-flow slicing a first-class primitive, allowing agents to trace definitions and uses in a single query. Evaluated on SWE-bench Lite (300 issues across 11 Python repositories) with Qwen2.5-Coder-32B-Instruct, ARISE reports a 17-point gain in Function Recall@1 and 15-point gain in Line Recall@1 over the unmodified SWE-agent baseline; these localization improvements yield a 4.7-point increase in Pass@1 repair success (22.0%, 66/300). Controlled ablations attribute the gains to the data-flow component rather than the tool schema, and show that the LLM backbone consumes structured slice outputs directly without a natural-language summarization layer. The graph builder and slicing API are presented as a framework-agnostic drop-in toolset.
Significance. If the reported gains and ablation results hold under scrutiny, the work makes a practical contribution to agentic APR by demonstrating that explicit intra-procedural data-flow modeling can measurably improve both localization precision and end-to-end repair rates on a standard benchmark. The release of the graph-construction and slicing infrastructure as a reusable toolset is a concrete strength that lowers the barrier for follow-on research. The observation that large code models can directly interpret structured slice output is also useful for system design.
major comments (1)
- [Graph construction and slicing sections (around §3–4)] The central claim that the automatically constructed multi-granularity graph 'accurately captures all relevant intra-procedural definition-use relationships without introducing false dependencies or missing critical flows' (weakest assumption) is load-bearing for the attribution of the 17-point Recall@1 and 4.7-point Pass@1 gains to the data-flow component. The manuscript should provide either (a) a manual audit of a random sample of generated slices against ground-truth def-use on a subset of the benchmark or (b) quantitative metrics on false-positive/negative edges, as the current ablation evidence alone does not rule out that the observed improvement stems from incidental properties of the slice representation rather than semantic fidelity.
minor comments (3)
- [Tool API description] A concrete example (with code snippet, graph fragment, and sample slice output) would help readers understand how the three-tier API is invoked and how the LLM consumes the structured result.
- [Abstract and §6] The abstract states the system is 'framework-agnostic,' yet all experiments are restricted to Python repositories; a brief discussion of the engineering effort required to port the graph builder to another language would clarify the scope of this claim.
- [Evaluation figures/tables] Table or figure captions should explicitly state the exact number of issues (300) and the backbone model (Qwen2.5-Coder-32B-Instruct) so that results can be interpreted without cross-referencing the text.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: The central claim that the automatically constructed multi-granularity graph 'accurately captures all relevant intra-procedural definition-use relationships without introducing false dependencies or missing critical flows' (weakest assumption) is load-bearing for the attribution of the 17-point Recall@1 and 4.7-point Pass@1 gains to the data-flow component. The manuscript should provide either (a) a manual audit of a random sample of generated slices against ground-truth def-use on a subset of the benchmark or (b) quantitative metrics on false-positive/negative edges, as the current ablation evidence alone does not rule out that the observed improvement stems from incidental properties of the slice representation rather than semantic fidelity.
Authors: We appreciate the referee's identification of this load-bearing assumption. The ablations in §5.3 hold the tool schema and structural graph fixed while removing only the intra-procedural def-use edges, producing consistent drops in both localization and repair metrics; this design makes it unlikely that gains arise solely from incidental formatting of the slice output. Nevertheless, we agree that direct quantification of edge-level fidelity would strengthen attribution. A full ground-truth audit across the 300 issues is not feasible within the revision window, as it would require exhaustive manual annotation of def-use relations. In the revised manuscript we have therefore (i) added a dedicated paragraph in §4.2 describing the static-analysis rules used for edge construction (reaching definitions via AST-based use-def chains) and (ii) inserted a Limitations subsection (§6) that explicitly states the assumption, notes that the analysis follows standard sound techniques for Python, and provides one fully worked example of a generated slice with manual verification. We view this as a partial but substantive response to the comment. revision: partial
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical system (ARISE) that augments an LLM agent with a multi-granularity repository graph including intra-procedural def-use edges, exposed via a three-tier tool API. Central claims consist of measured gains in Function/Line Recall@1 and Pass@1 on the external SWE-bench Lite benchmark (300 issues), plus ablations attributing gains to the data-flow component rather than tool schema. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text; the argument rests on direct experimental comparison against a public baseline (SWE-agent) with controlled conditions. This is self-contained empirical evidence with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The automatically extracted intra-procedural definition-use edges are sufficiently accurate and complete for the downstream agent to improve localization and repair.
invented entities (1)
-
ARISE multi-granularity program graph with statement-level def-use edges
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to Represent Programs with Graphs. InICLR. https://miltos.allamanis.com/publicationfiles/allamanis2018learning/allamanis2018learning.pdf
2018
-
[2]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code.Proceedings of the ACM on Programming Languages3, POPL (2019), 40:1–40:29. doi:10.1145/3290353
-
[3]
Amazon Web Services. 2024. Reimagining Software Development with the Amazon Q Developer Agent. https://aws.amazon.com/blogs/machine-learning/reimagining-software-development-with-the-amazon-q- developer-agent/. AWS Machine Learning Blog
2024
-
[4]
Ramakrishna Bairi et al. 2024. CodePlan: Repository-level Coding using LLMs and Planning.ACM Transactions on Software Engineering and Methodology (TOSEM)(2024). doi:10.1145/3643757
-
[5]
Sebastian Baltes, Oliver Moseler, Fabian Beck, and Stephan Diehl. 2017. Navigate, Understand, Communicate: How Developers Locate Performance Bugs. InICPC. 260–270. doi:10.1109/ICPC.2017.21
-
[6]
Andreas Bexell, Emma Söderberg, Christofer Rydenfält, and Sigrid Eldh. 2024. How Do Developers Approach Their First Bug in an Unfamiliar Code Base? An Exploratory Study of Large Program Comprehension. InPPIG. https://ppig.org/files/2024-PPIG-35th-bexell.pdf
2024
-
[7]
Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. 2025. LocAgent: Graph-Guided LLM Agents for Code Localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Aus...
-
[8]
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The Program Dependence Graph and Its Use in Optimization.ACM Transactions on Programming Languages and Systems9, 3 (1987), 319–349. doi:10.1145/24039.24041
-
[9]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, et al. 2021. GraphCode- BERT: Pre-training Code Representations with Data Flow. InICLR. https://openreview.net/forum?id=jLoC4ez43PZ
2021
-
[10]
Susan Horwitz, Thomas Reps, and David Binkley. 1990. Interprocedural slicing using dependence graphs.ACM Transactions on Programming Languages and Systems (TOPLAS)12, 1 (1990), 26–60
1990
-
[11]
Soneya Binta Hossain, Nan Jiang, Qiang Zhou, Xiaopeng Li, Wen-Hao Chiang, Yingjun Lyu, Hoan Nguyen, and Omer Tripp. 2024. A deep dive into large language models for automated bug localization and repair.Proceedings of the ACM on Software Engineering1, FSE (2024), 1471–1493
2024
-
[12]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report.arXiv preprint arXiv:2409.12186(2024)
work page internal anchor Pith review arXiv 2024
-
[13]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
-
[14]
James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. InProceedings of the 20th IEEE/ACM international Conference on Automated software engineering. 273–282
2005
-
[15]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP ’23). ACM. doi:10.1145/ 3600006.3613165
-
[16]
Jia Li et al. 2025. LONGCODEU: Benchmarking Long-Context Language Models on Long Code Understanding. In ACL. https://aclanthology.org/2025.acl-long.1324.pdf
2025
-
[17]
Chunyan Liu, Yan Lei, Huan Xie, Jinping Wang, Yue Yu, and David Lo. 2026. Survey on learning-based dynamic fault localization: From traditional machine learning to large language models.Comput. Surveys58, 9 (2026), 1–39
2026
-
[18]
Jia Liu et al . 2024. RepoQA: Evaluating Long Context Code Understanding. InICLR (Workshop/Poster). https: //openreview.net/pdf?id=hK9YSrFuGf 22 Shahd Seddik and Fatemeh Fard
2024
-
[19]
Tianyang Liu, Canwen Xu, and Julian McAuley. 2024. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. InICLR. https://proceedings.iclr.cc/paper_files/paper/2024/file/ d191ba4c8923ed8fd8935b7c98658b5f-Paper-Conference.pdf
2024
-
[20]
Wenjun Liu, Yihui Sun, Jiefeng Wei, Yiheng Li, Yiran Chen, Hai Zhao, Shuai Wang, Shizhe Fu, Ge Sun, and Kai Zhang. 2024. GraphCoder: Enhancing Repository-Level Code Completion via Coarse-to-Fine Retrieval Based on Code Context Graph. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE/ACM, 570–582. doi...
-
[21]
Xiangyan Liu, Bo Lan, Zhiyuan Hu, Yang Liu, Zhicheng Zhang, Fei Wang, Michael Qizhe Shieh, and Wenmeng Zhou
-
[22]
InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Codexgraph: Bridging large language models and code repositories via code graph databases. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 142–160
2025
-
[23]
Yingwei Ma, Qingping Yang, Rongyu Cao, Binhua Li, Fei Huang, and Yongbin Li. 2025. Alibaba LingmaAgent: Improving Automated Issue Resolution via Comprehensive Repository Exploration. InCompanion Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25). ACM, New York, NY, USA. doi:10.1145/3696630.3728549
-
[24]
Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair.arXiv preprint arXiv:2506.10484(2025)
work page internal anchor Pith review arXiv 2025
-
[25]
Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2025. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InProceedings of the International Conference on Learning Representations (ICLR). https://proceedings.iclr.cc/paper_files/paper/2025/ file/4a4a3c197deac0424...
2025
-
[26]
Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault localization.Software Testing, Verification and Reliability25, 5-7 (2015), 605–628
2015
-
[27]
Shani Pearce, Abhay Singh, Luke Hales, Emma Finlayson, and Brett A. Becker. 2024. Needles in a Haystack: Student Struggles with Working on Large Code Bases. InSIGCSE. doi:10.1145/3702652.3744218
-
[28]
Rachel Potvin and Josh Levenberg. 2016. Why Google Stores Billions of Lines of Code in a Single Repository.Commun. ACM59, 7 (2016), 78–87. doi:10.1145/2854146
- [29]
-
[30]
Samuel Rando et al. 2025. Evaluating Coding LLMs at 1M Context Windows: LongCodeBench. OpenReview preprint. https://openreview.net/pdf?id=GFPoM8Ylp8
2025
-
[31]
Melika Sepidband, Hamed Taherkhani, Hung Viet Pham, and Hadi Hemmati. 2026. RGFL: Reasoning Guided Fault Localization for Automated Program Repair Using Large Language Models.arXiv e-prints(2026), arXiv–2601
2026
-
[32]
Akihiro Takahashi, Yoshiki Higo, and Shinji Kusumoto. 2021. An Extensive Study on Smell-Aware Bug Localization. Journal of Systems and Software177 (2021), 110957. doi:10.1016/j.jss.2021.110957
-
[33]
Tianyi Tang, Tianyi Xu, Sumon Karmakar, and Toby Jia-Jun Li. 2023. An Empirical Study of Developer Behaviors for Validating and Repairing AI-Generated Code. InPLATEAU@SPLASH. https://toby.li/files/plateau23-tang-copilot.pdf
2023
-
[34]
Frank Tipp. 1995. A survey of program slicing techniques.Journal of programming languages3, 3 (1995), 121–189
1995
-
[35]
Mark Weiser. 1981. Program slicing.IEEE Transactions on Software Engineering4 (1981), 352–357
1981
-
[36]
W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, Franz Wotawa, and Dongcheng Li. 2023. Software fault localization: An overview of research, techniques, and tools.Handbook of Software Fault Localization: Foundations and Advances (2023), 1–117
2023
-
[37]
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying LLM-Based Software Engineering Agents.Proceedings of the ACM on Software Engineering2, FSE (2025), 801–824. doi:10.1145/3715754
-
[38]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. InIEEE Symposium on Security and Privacy. 590–604. doi:10.1109/SP.2014.44
-
[39]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review arXiv 2025
-
[40]
Boyang Yang, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, Bach Le, and Haoye Tian. 2025. Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs.arXiv preprint arXiv:2503.21710(2025). ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair 23
-
[41]
Jiménez, Alexander Wettig, Kilian Lieret, Shunyu Yao, et al
Jian Yang, Carlos E. Jiménez, Alexander Wettig, Kilian Lieret, Shunyu Yao, et al. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems (NeurIPS). https://papers.nips.cc/paper_files/paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf
2024
-
[42]
Seohyun Youm, Hojun Yeon, Eunjong Kim, Eunjong Lee, Eunjong Park, et al. 2018. Bench4BL: Reproducibility Study on the Performance of IR-based Bug Localization. InProceedings of ISSTA. doi:10.1145/3213846.3213856
-
[43]
Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. 2025. OrcaLoca: An LLM Agent Framework for Software Issue Localization. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, 73416–73436. https://proceedings.mlr.press/ v267/yu25x.html
2025
-
[44]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24). ACM, New York, NY, USA, 1592–1604. doi:10.1145/3650212.3680384
-
[45]
Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Iden- tification by Learning Comprehensive Program Semantics via Graph Neural Networks. InNeurIPS. 10197–10207. https://papers.neurips.cc/paper/2019/file/49265d2447bc3bbfe9e76306ce40a31f-Paper.pdf
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.