Recognition: 1 theorem link
· Lean TheoremSieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization
Pith reviewed 2026-05-14 18:02 UTC · model grok-4.3
The pith
SieveFL prunes 79% of methods to let local LLMs localize 41.8% of bugs at top-1 accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SieveFL achieves a Top-1 accuracy of 41.8% (165/395 bugs) and MRR of 0.469 on Defects4J v1.2.0 using a mid-sized local MoE model on a commodity workstation. Runtime pruning removes 79% of candidate methods and cuts token consumption by 49% while preserving top-1 accuracy exactly and improving top-3 through top-10 metrics by up to 2.4 percentage points. The system outperforms the strongest prior agent-based baseline by 2.1 percentage points in top-1 accuracy.
What carries the argument
The five-stage hierarchical pipeline that combines dense vector retrieval, JaCoCo runtime trace pruning, individual LLM screening, and comparative re-ranking to reduce thousands of methods to a small feasible set.
Load-bearing premise
JaCoCo traces and vector retrieval will reliably retain the true faulty method in the pruned candidate set without systematic exclusion.
What would settle it
A failing test whose faulty method is not executed during the test run or is missed by the initial vector retrieval, causing the method to be pruned before the LLM step and producing a complete miss.
Figures
read the original abstract
Automated fault localization requires connecting an observed test failure to the responsible method across thousands of candidates--a task that purely statistical approaches handle with limited precision and that LLMs cannot yet handle at full project scale due to prohibitive token cost and signal dilution. We present SieveFL, a five-stage hierarchical framework that resolves this tension through aggressive pre-LLM filtering. SieveFL converts a failing test into a natural-language failure description, uses dense vector retrieval to narrow the search to a small set of suspicious files, and then eliminates any method not executed during the failing test via JaCoCo runtime traces. Only the surviving candidates are passed to the LLM, which screens each method individually and re-ranks the confirmed suspects in a single comparative pass. We evaluate SieveFL on 395 bugs from Defects4J v1.2.0 using a mid-sized, openly available MoE model deployed on a commodity workstation (32 GB RAM, 8 GB GPU) via Ollama--no frontier APIs or datacenter hardware required. Treating 12 incomplete runs as failures, SieveFL achieves Top-1 accuracy of 41.8% (165/395 bugs) and an MRR of 0.469, outperforming the strongest prior agent-based baseline (AgentFL) by 2.1 pp in Top-1. Runtime pruning removes 79% of candidate methods and reduces input token consumption by 49%, while simultaneously improving ranking quality: Top-1 is preserved exactly and Top-3 through Top-10 improve by up to 2.4 pp. These results demonstrate that, with the right filtering architecture, capable fault localization does not require proprietary frontier models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SieveFL, a five-stage hierarchical framework for scalable LLM-based fault localization. It converts failing tests to natural-language descriptions, applies dense vector retrieval to suspicious files, uses JaCoCo traces to prune unexecuted methods (removing 79% of candidates), and then performs LLM-based screening and re-ranking on the reduced set. Evaluated on 395 Defects4J v1.2.0 bugs with a local mid-sized MoE model, it reports 41.8% Top-1 accuracy (165/395) and MRR 0.469, outperforming AgentFL by 2.1 pp while cutting token consumption by 49% and preserving Top-1 exactly (with gains in Top-3 to Top-10).
Significance. If the pruning stages reliably retain the ground-truth faulty method, the work shows that aggressive pre-LLM filtering enables accurate fault localization on commodity hardware without frontier models or accuracy loss. This is a practical advance for the field, directly addressing token cost and signal dilution at project scale while providing reproducible evaluation against an independent baseline.
major comments (2)
- [Results / Evaluation] Results section (and abstract claim that 'Top-1 is preserved exactly' after pruning): the headline accuracy numbers rest on the unverified assumption that the combined dense-retrieval + JaCoCo filter retains the true faulty method for all 395 bugs. No aggregate or per-bug recall figure is supplied for this two-stage filter, so it is impossible to determine whether the preserved Top-1 (and improved higher ranks) reflects a generally safe architecture or only the subset of bugs for which the filter happens to be safe.
- [Abstract / Evaluation] Abstract and §4 (evaluation setup): treating the 12 incomplete runs as failures is stated without describing the exact failure mode, retry policy, or how these cases are distributed across the reported metrics; this detail is needed to assess robustness of the 41.8% Top-1 figure.
minor comments (2)
- [Abstract] Abstract lacks any description of the prompt templates used for LLM screening and re-ranking, the vector-retrieval hyperparameters (e.g., top-k, similarity threshold), or the precise failure-handling logic for the 12 incomplete runs.
- [Methods] Notation for MRR and Top-k metrics is used without an explicit definition or reference to the standard formulas in the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional metrics.
read point-by-point responses
-
Referee: [Results / Evaluation] Results section (and abstract claim that 'Top-1 is preserved exactly' after pruning): the headline accuracy numbers rest on the unverified assumption that the combined dense-retrieval + JaCoCo filter retains the true faulty method for all 395 bugs. No aggregate or per-bug recall figure is supplied for this two-stage filter, so it is impossible to determine whether the preserved Top-1 (and improved higher ranks) reflects a generally safe architecture or only the subset of bugs for which the filter happens to be safe.
Authors: We agree that an explicit recall figure for the two-stage filter is necessary to fully substantiate the claim that Top-1 accuracy is preserved. In the revised manuscript we will add a new paragraph (and supporting table) in §5 reporting the aggregate recall of the dense-retrieval + JaCoCo pruning stage across all 395 bugs, together with a brief per-bug breakdown or histogram showing that the ground-truth faulty method is retained in the large majority of cases. This addition will confirm that the observed preservation of Top-1 (and gains at higher ranks) is not limited to a safe subset. revision: yes
-
Referee: [Abstract / Evaluation] Abstract and §4 (evaluation setup): treating the 12 incomplete runs as failures is stated without describing the exact failure mode, retry policy, or how these cases are distributed across the reported metrics; this detail is needed to assess robustness of the 41.8% Top-1 figure.
Authors: We acknowledge that the current description of the 12 incomplete runs lacks sufficient detail. In the revised §4 we will explicitly describe the failure modes (primarily timeout or out-of-memory errors during local LLM inference), the retry policy (a single retry with reduced context length), and confirm that these cases are counted as failures in every reported metric. We will also note that their distribution does not alter the relative ranking versus the AgentFL baseline. revision: yes
Circularity Check
No circularity; results from direct external benchmark evaluation
full rationale
The paper's central claims consist of empirical performance numbers (Top-1 accuracy 41.8%, MRR 0.469) obtained by running the described five-stage pipeline on the external Defects4J v1.2.0 benchmark and comparing against the independent prior system AgentFL. No equations, fitted parameters, or derivations are presented that reduce by construction to the reported outputs; the pruning stages are methodological choices whose safety is asserted via the observed accuracy preservation rather than by any self-referential definition or self-citation chain. The evaluation is therefore self-contained against an external benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption JaCoCo coverage traces accurately identify all executed methods in Java programs
- domain assumption Dense vector embeddings of failure descriptions can surface files containing the fault
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.leanreality_from_one_distinction, washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
five-stage hierarchical framework... LLM-based Test Analysis, Suspicious File Identification, Runtime-Aware Candidate Pruning via JaCoCo, Per-Method LLM Screening, LLM-Based Re-ranking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey on software fault localization,
W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,”IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016
2016
-
[2]
Practitioners’ expectations on automated fault localization,
P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” inProceedings of the 25th international symposium on software testing and analysis, pp. 165–176, 2016
2016
-
[3]
On the accuracy of spectrum-based fault localization,
R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART- MUTATION 2007), pp. 89–98, IEEE, 2007
2007
-
[4]
Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports,
J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports,” in2012 34th International conference on software engineering (ICSE), pp. 14–24, IEEE, 2012
2012
-
[5]
Improving bug localization using structured information retrieval,
R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, “Improving bug localization using structured information retrieval,” in2013 28th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), pp. 345–355, IEEE, 2013
2013
-
[6]
Irbfl: an information retrieval based fault localization approach,
Z. Li, X. Bai, H. Wang, and Y. Liu, “Irbfl: an information retrieval based fault localization approach,” in2020 IEEE 44th Annual Comput- ers, Software, and Applications Conference (COMPSAC), pp. 991–996, IEEE, 2020
2020
-
[7]
A quantitative and qualitative evaluation of llm-based explainable fault localization,
S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1424–1446, 2024
2024
-
[8]
Agentfl: Scaling llm-based fault localization to project-level context,
Y. Qin, S. Wang, Y. Lou, J. Dong, K. Wang, X. Li, and X. Mao, “Agentfl: Scaling llm-based fault localization to project-level context,” arXiv preprint arXiv:2403.16362, 2024
-
[9]
X. Shi, Z. Li, and A. R. Chen, “Enhancing llm-based fault localization with a functionality-aware retrieval-augmented generation framework,” arXiv preprint arXiv:2509.20552, 2025
-
[10]
Defects4j: A database of existing faults to enable controlled testing studies for java programs,
R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inPro- ceedings of the 2014 international symposium on software testing and analysis, pp. 437–440, 2014
2014
-
[11]
The dstar method for effective software fault localization,
W. E. Wong, V. Debroy, R. Gao, and Y. Li, “The dstar method for effective software fault localization,”IEEE Transactions on Reliability, vol. 63, no. 1, pp. 290–308, 2013
2013
-
[12]
An evaluation of pure spectrum-based fault localization techniques for large-scale software systems,
S. Heiden, L. Grunske, T. Kehrer, F. Keller, A. Van Hoorn, A. Filieri, and D. Lo, “An evaluation of pure spectrum-based fault localization techniques for large-scale software systems,”Software: Practice and Experience, vol. 49, no. 8, pp. 1197–1224, 2019
2019
-
[13]
Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization,
X. Li, W. Li, Y. Zhang, and L. Zhang, “Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization,” inProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pp. 169–180, 2019
2019
-
[14]
Boosting coverage-based fault localization via graph-based representation learning,
Y. Lou, Q. Zhu, J. Dong, X. Li, Z. Sun, D. Hao, L. Zhang, and L. Zhang, “Boosting coverage-based fault localization via graph-based representation learning,” inProceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 664–676, 2021
2021
-
[15]
Augmenting automated spectrum-based multi-fault localization via bor- derline confident learning,
W. Du, C. Li, S. Yin, H. Lin, H. Li, F. Zhan, Q. Ning, and Q. Ma, “Augmenting automated spectrum-based multi-fault localization via bor- derline confident learning,”ACM Transactions on Internet Technology, 2026
2026
-
[16]
Confident learning: Estimating uncertainty in dataset labels,
C. Northcutt, L. Jiang, and I. Chuang, “Confident learning: Estimating uncertainty in dataset labels,”Journal of Artificial Intelligence Research, vol. 70, pp. 1373–1411, 2021
2021
-
[17]
Robertson and H
S. Robertson and H. Zaragoza,The probabilistic relevance framework: BM25 and beyond, vol. 4. Now Publishers Inc, 2009
2009
-
[18]
Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,
C. S. Xia and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 819–831, 2024
2024
-
[19]
Large language models for test-free fault localization,
A. Z. Yang, C. Le Goues, R. Martins, and V. Hellendoorn, “Large language models for test-free fault localization,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering, pp. 1–12, 2024
2024
-
[20]
De- mystifying faulty code with llm: Step-by-step reasoning for explainable fault localization,
R. Widyasari, J. W. Ang, T. G. Nguyen, N. Sharma, and D. Lo, “De- mystifying faulty code with llm: Step-by-step reasoning for explainable fault localization,”arXiv preprint arXiv:2403.10507, 2024
-
[21]
Retrieval-Augmented Generation for Large Language Models: A Survey
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Semloc: Structured grounding of free-form llm reasoning for fault localization,
Z. Yang, H. Zhu, Q. Zhang, R. Gupta, and A. Kundu, “Semloc: Structured grounding of free-form llm reasoning for fault localization,” arXiv preprint arXiv:2603.29109, 2026
-
[23]
Explainable fault localization for programming assignments via llm-guided annota- tion,
F. Liu, T. Wang, L. Zhang, Z. Yang, J. Jiang, and Z. Sun, “Explainable fault localization for programming assignments via llm-guided annota- tion,”arXiv preprint arXiv:2509.25676, 2025
-
[24]
On the Role of Fault Localization Context for LLM-Based Program Repair
M. Sepidband, H. V. Pham, and H. Hemmati, “On the role of fault localization context for llm-based program repair,”arXiv preprint arXiv:2604.05481, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
The power of noise: Redefin- ing retrieval for rag systems,
F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y. Maarek, N. Tonellotto, and F. Silvestri, “The power of noise: Redefin- ing retrieval for rag systems,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 719–729, 2024
2024
-
[26]
A multi-agent approach to fault localization via graph-based retrieval and reflexion,
M. N. Rafi, D. J. Kim, T.-H. Chen, and S. Wang, “A multi-agent approach to fault localization via graph-based retrieval and reflexion,” arXiv preprint arXiv:2409.13642, 2024
-
[27]
Improving llm-based fault localization with external memory and project context,
I. Yeo, D. Ryu, and J. Baik, “Improving llm-based fault localization with external memory and project context,”arXiv preprint arXiv:2506.03585, 2025
-
[28]
Inspectcoder: Dynamic analysis-driven self repair through interactive llm-debugger collaboration,
Y. Wang, Y. Zhang, G. Li, C. Zhi, B. Li, F. Huang, Y. Li, and S. Deng, “Inspectcoder: Dynamic analysis-driven self repair through interactive llm-debugger collaboration,”Proceedings of the ACM on Programming Languages, vol. 10, no. OOPSLA1, pp. 1041–1069, 2026
2026
-
[29]
Debug like a human: A large language model debugger via verifying runtime execution step by step,
L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 851–870, 2024
2024
-
[30]
Re- flexion: Language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023
2023
-
[31]
Sentence-bert: Sentence embeddings using siamese bert-networks,
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019
2019
-
[32]
The faiss library,
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”IEEE Transactions on Big Data, 2025
2025
-
[33]
A vector space model for automatic indexing,
G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,”Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975
1975
-
[34]
JaCoCo: Java code coverage library
“JaCoCo: Java code coverage library.” https://www.jacoco.org/jacoco/,
-
[35]
Accessed: 2026-02-24
2026
-
[36]
Learning to combine multiple ranking metrics for fault localization,
J. Xuan and M. Monperrus, “Learning to combine multiple ranking metrics for fault localization,” in2014 IEEE international conference on software maintenance and evolution, pp. 191–200, IEEE, 2014
2014
-
[37]
Improving the efficiency of llm agent systems through trajectory reduction,
Y.-A. Xiao, P. Gao, C. Peng, and Y. Xiong, “Improving the efficiency of llm agent systems through trajectory reduction,”arXiv preprint arXiv:2509.23586, 2025
-
[38]
Fluccs: Using code and change metrics to improve fault localization,
J. Sohn and S. Yoo, “Fluccs: Using code and change metrics to improve fault localization,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 273–283, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.