pith. machine review for the scientific record. sign in

arxiv: 2604.16021 · v2 · submitted 2026-04-17 · 💻 cs.SE · cs.AI

Recognition: unknown

Neurosymbolic Repo-level Code Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords localizationlogiclocperformancereasoningbenchmarkscodeka-logicquerystructural
0
0 comments X

The pith

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI systems for finding relevant code files or functions in large projects often succeed by spotting obvious names or paths rather than understanding how the code actually connects. The authors created a new test set called KA-LogicQuery that strips away those name hints so only genuine structural reasoning works. Their LogicLoc system asks an LLM to write logical rules in Datalog, checks the rules with a parser, and runs them on a fast engine to locate the right code precisely and with less computation than repeated LLM calls.

Core claim

LogicLoc significantly outperforms SOTA methods on KA-LogicQuery while maintaining competitive performance on popular issue-driven benchmarks with significantly lower token consumption and faster execution.

Load-bearing premise

That LLM-synthesized Datalog programs correctly capture the necessary structural facts of the codebase and that the parser-gated validation plus mutation feedback reliably prevent incorrect or inefficient rules from affecting the final localization result.

Figures

Figures reproduced from arXiv: 2604.16021 by Xiufeng Wu, Xiufeng Xu, Yi Li, Zejun Zhang.

Figure 1
Figure 1. Figure 1: A motivating exampele of a logic query Given a natural language query 𝑞 (e.g., a bug report or feature request) and a target codebase C, the goal of repository-level code localization is to identify a list of relevant code locations L = {𝑙1,𝑙2, . . . ,𝑙𝑘 }. Effective localization bridges the gap between informal natural language and the rigid execution logic of software. We identify three primary challenge… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LogicLoc framework set of potential locations to maximize precision. To mitigate hallucinations and enhance abstention ability, we decouple reasoning from generation. A deterministic engine handles inference while the LLM functions as a coordinator for query analysis and result calibration. The agent is equipped with three basic tools: “exec_dl” for executing Datalog programs, and “get_file_con… view at source ↗
Figure 3
Figure 3. Figure 3: Intermediate rules mutation and feedback [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average execution time and token consumption of [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

Code localization is a cornerstone of autonomous software engineering. Recent advancements have achieved impressive performance on real-world issue benchmarks. However, we identify a critical yet overlooked bias: these benchmarks are saturated with keyword references (e.g. file paths, function names), encouraging models to rely on superficial lexical matching rather than genuine structural reasoning. We term this phenomenon the Keyword Shortcut. To address this, we formalize the challenge of Keyword-Agnostic Logical Code Localization (KA-LCL) and introduce KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without any naming hints. Our evaluation reveals a catastrophic performance drop of state-of-the-art approaches on KA-LogicQuery, exposing their lack of deterministic reasoning capabilities. We propose LogicLoc, a novel agentic framework that combines large language models with the rigorous logical reasoning of Datalog for precise localization. LogicLoc extracts program facts from the codebase and leverages an LLM to synthesize Datalog programs, with parser-gated validation and mutation-based intermediate-rule diagnostic feedback to ensure correctness and efficiency. The validated programs are executed by a high-performance inference engine, enabling accurate and verifiable localization in a fully automated, closed-loop workflow. Experimental results demonstrate that LogicLoc significantly outperforms SOTA methods on KA-LogicQuery while maintaining competitive performance on popular issue-driven benchmarks. Notably, LogicLoc attains superior performance with significantly lower token consumption and faster execution by offloading structural traversal to a deterministic engine, reducing the overhead of iterative LLM inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity in LogicLoc's neurosymbolic derivation or benchmark evaluation

full rationale

The paper introduces KA-LogicQuery as a new diagnostic benchmark to expose keyword bias in prior issue-driven evaluations, then presents LogicLoc as an agentic pipeline that extracts facts, uses an LLM to synthesize Datalog programs, applies parser-gated validation plus mutation feedback, and executes on an external high-performance engine. All performance claims are empirical measurements on these benchmarks; no equation, definition, or central result reduces by construction to its own inputs, fitted parameters, or a self-citation chain. The validation steps are explicit and external to the final localization output, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the assumption that Datalog can faithfully encode code structure for localization and that LLM-generated programs can be made reliable through the described validation loop; no free parameters or invented physical entities are evident from the abstract.

axioms (1)
  • domain assumption Datalog can faithfully represent structural facts extracted from codebases for the purpose of precise localization queries.
    The framework extracts program facts and executes Datalog programs to enable accurate localization.
invented entities (3)
  • Keyword Shortcut no independent evidence
    purpose: Term for the bias where models rely on lexical matching rather than structural reasoning.
    Introduced to name the overlooked phenomenon in existing benchmarks.
  • KA-LogicQuery no independent evidence
    purpose: Diagnostic benchmark requiring structural reasoning without naming hints.
    New benchmark proposed to expose the limitations of current approaches.
  • LogicLoc no independent evidence
    purpose: Neurosymbolic agentic framework combining LLMs and Datalog with validation.
    The proposed system for keyword-agnostic logical code localization.

pith-pipeline@v0.9.0 · 5556 in / 1420 out tokens · 55670 ms · 2026-05-10T08:23:03.030759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng

  2. [2]

    CoreQA: uncovering potentials of language models in code repository question answering.arXiv preprint arXiv:2501.03447(2025)

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  4. [4]

    Zhaoling Chen, Robert Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, and Xingyao Wang. 2025. LocAgent: Graph-Guided LLM Agents for Code Localization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Moh...

  5. [5]

    Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, and Zhongxin Liu. 2025. NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition.arXiv preprint arXiv:2507.18130(2025)

  6. [6]

    Luca Di Grazia and Michael Pradel. 2023. Code search: A survey of techniques for finding code.Comput. Surveys55, 11 (2023), 1–31

  7. [7]

    Xinyu Gao, Zhijie Wang, Yang Feng, Lei Ma, Zhenyu Chen, and Baowen Xu. 2023. Benchmarking robustness of ai-enabled multi-sensor fusion systems: Challenges and opportunities. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 871–882

  8. [8]

    Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering.arXiv preprint arXiv:2412.14764(2024)

  9. [9]

    Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024. R2e: Turning any github repository into a programming agent environment. InForty-first International Conference on Machine Learning

  10. [10]

    Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code. arXiv:2410.09997 [cs.SE] https://arxiv.org/abs/2410.09997

  11. [11]

    Zhonghao Jiang, Xiaoxue Ren, Meng Yan, Wei Jiang, Yong Li, and Zhongxin Liu. 2025. CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching.arXiv preprint arXiv:2503.22424(2025)

  12. [12]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE- bench: Can Language Models Resolve Real-world Github Issues? https://huggingface.co/datasets/SWE-bench/SWE- bench_Lite. Accessed: 2025-10-04

  13. [13]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. [n. d.]. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations

  14. [14]

    Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. Infibench: Evaluating the question-answering capabilities of code large language models.Advances in Neural Information Processing Systems37 (2024), 128668–128698

  15. [15]

    Zehan Li, Jianfei Zhang, Chuantao Yin, Yuanxin Ouyang, and Wenge Rong. 2024. ProCQA: a large-scale community- based programming question answering dataset for code search.arXiv preprint arXiv:2403.16702(2024)

  16. [16]

    Chenxiao Liu and Xiaojun Wan. 2021. CodeQA: A question answering dataset for source code comprehension.arXiv preprint arXiv:2109.08365(2021)

  17. [17]

    Wei Liu, Chao Peng, Pengfei Gao, Aofan Liu, Wei Zhang, Haiyan Zhao, and Zhi Jin. 2025. GraphLocator: Graph-guided Causal Reasoning for Issue Localization.arXiv preprint arXiv:2512.22469(2025). , Vol. 1, No. 1, Article . Publication date: April 2026. 22 Xiufeng Xu, Xiuheng Wu, Zejun Zhang, and Yi Li

  18. [18]

    Zexiong Ma, Chao Peng, Pengfei Gao, Xiangxin Meng, Yanzhen Zou, and Bing Xie. 2025. SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning.CoRR(2025)

  19. [19]

    Niels Mündler, Mark Müller, Jingxuan He, and Martin Vechev. 2024. SWT-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems37 (2024), 81857–81887

  20. [20]

    Changan Niu, Chuanyi Li, Vincent Ng, and Bin Luo. 2023. Crosscodebench: Benchmarking cross-task generalization of source code models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 537–549

  21. [21]

    Yicheng Ouyang, Jun Yang, and Lingming Zhang. 2024. Benchmarking automated program repair: An extensive study on both real-world and artificial bugs. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 440–452

  22. [22]

    Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. 2025. SWE-QA: Can Language Models Answer Repository-level Code Questions?arXiv preprint arXiv:2509.14635(2025)

  23. [23]

    Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, and Shafiq Joty. 2025. SweRank: Software Issue Localization with Code Ranking.arXiv preprint arXiv:2505.07849(2025)

  24. [24]

    Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, and Shirish Shevade

  25. [25]

    InProceedings of the 17th Innovations in Software Engineering Conference

    Codequeries: A dataset of semantic queries over code. InProceedings of the 17th Innovations in Software Engineering Conference. 1–11

  26. [26]

    Jan Strich, Florian Schneider, Irina Nikishina, and Chris Biemann. 2024. On Improving Repository-Level Code QA for Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 209–244

  27. [27]

    Tarun Suresh, Revanth Gangi Reddy, Yifei Xu, Zach Nussbaum, Andriy Mulyar, Brandon Duderstadt, and Heng Ji

  28. [28]

    Cornstack: High-quality contrastive data for better code ranking.arXiv e-prints(2024), arXiv–2412

  29. [29]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. Magis: Llm-based multi-agent framework for github issue resolution.Advances in Neural Information Processing Systems37 (2024), 51963–51993

  30. [30]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

  31. [31]

    Xiuheng Wu, Chenguang Zhu, and Yi Li. 2021. Diffbase: A differential factbase for effective software evolution management. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 503–515

  32. [32]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

  33. [33]

    Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. 2025. SWE-Fixer: Training Open- Source LLMs for Effective and Efficient GitHub Issue Resolution. InICLR 2025 Third Workshop on Deep Learning for Code

  34. [34]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al . 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  35. [35]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems37 (2024), 50528–50652

  36. [36]

    SWE-smith: Scaling Data for Software Engineering Agents

    John Yang, Kilian Lieret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents. arXiv:2504.21798 [cs.SE] https://arxiv.org/abs/2504.21798

  37. [37]

    Zhongming Yu, Hejia Zhang, Yujie Zhao, Hanxian Huang, Matrix Yao, Ke Ding, and Jishen Zhao. 2025. OrcaLoca: An LLM Agent Framework for Software Issue Localization. arXiv:2502.00350 [cs.SE] https://arxiv.org/abs/2502.00350

  38. [38]

    Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, and Bing Xiang. 2024. Code representation learning at scale.arXiv preprint arXiv:2402.01935(2024)

  39. [39]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Autocoderover: Autonomous program improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1592–1604

  40. [40]

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. 2024. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). , Vol. 1, No. 1, Article . Publication date: April 2026