arxiv: 2605.13491 · v1 · submitted 2026-05-13 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization

Mahdi Farzandway , Fatemeh Ghassemi

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:02 UTC · model grok-4.3

classification 💻 cs.SE

keywords fault localizationLLMpruningruntime tracesDefects4Jhierarchical filteringlocal modelsMoE

0 comments

The pith

SieveFL prunes 79% of methods to let local LLMs localize 41.8% of bugs at top-1 accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SieveFL as a five-stage hierarchical framework that converts a failing test into a natural-language description, uses dense vector retrieval to identify suspicious files, applies JaCoCo runtime traces to discard unexecuted methods, screens the survivors with an LLM, and re-ranks them comparatively. This architecture directly tackles the token cost and signal dilution that block LLMs from handling full project scales. A sympathetic reader would care because it shows practical fault localization is achievable on commodity hardware with openly available models rather than proprietary frontier systems. The evaluation on 395 Defects4J bugs demonstrates both substantial resource savings and accuracy that meets or exceeds prior agent-based baselines.

Core claim

SieveFL achieves a Top-1 accuracy of 41.8% (165/395 bugs) and MRR of 0.469 on Defects4J v1.2.0 using a mid-sized local MoE model on a commodity workstation. Runtime pruning removes 79% of candidate methods and cuts token consumption by 49% while preserving top-1 accuracy exactly and improving top-3 through top-10 metrics by up to 2.4 percentage points. The system outperforms the strongest prior agent-based baseline by 2.1 percentage points in top-1 accuracy.

What carries the argument

The five-stage hierarchical pipeline that combines dense vector retrieval, JaCoCo runtime trace pruning, individual LLM screening, and comparative re-ranking to reduce thousands of methods to a small feasible set.

Load-bearing premise

JaCoCo traces and vector retrieval will reliably retain the true faulty method in the pruned candidate set without systematic exclusion.

What would settle it

A failing test whose faulty method is not executed during the test run or is missed by the initial vector retrieval, causing the method to be pruned before the LLM step and producing a complete miss.

Figures

Figures reproduced from arXiv: 2605.13491 by Fatemeh Ghassemi, Mahdi Farzandway.

**Figure 1.** Figure 1: Overview of the SieveFL framework. The five-stage pipeline hierarchically prunes the search space from the full codebase to a ranked list of suspicious methods. Stages 1–3 perform progressively fine-grained filtering using LLM-based failure analysis, file-level retrieval, and runtime-aware pruning that combines JaCoCo coverage signals with semantic similarity. Stage 4 identifies top suspicious methods from… view at source ↗

**Figure 2.** Figure 2: Prompt used in Stage 1 (LLM-based Test Analysis). The three-point [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Prompt used in Stage 5 (LLM-Based Re-ranking). All confirmed sus [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Prompt used in Stage 4 (Per-Method LLM Screening). Each candidate [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Automated fault localization requires connecting an observed test failure to the responsible method across thousands of candidates--a task that purely statistical approaches handle with limited precision and that LLMs cannot yet handle at full project scale due to prohibitive token cost and signal dilution. We present SieveFL, a five-stage hierarchical framework that resolves this tension through aggressive pre-LLM filtering. SieveFL converts a failing test into a natural-language failure description, uses dense vector retrieval to narrow the search to a small set of suspicious files, and then eliminates any method not executed during the failing test via JaCoCo runtime traces. Only the surviving candidates are passed to the LLM, which screens each method individually and re-ranks the confirmed suspects in a single comparative pass. We evaluate SieveFL on 395 bugs from Defects4J v1.2.0 using a mid-sized, openly available MoE model deployed on a commodity workstation (32 GB RAM, 8 GB GPU) via Ollama--no frontier APIs or datacenter hardware required. Treating 12 incomplete runs as failures, SieveFL achieves Top-1 accuracy of 41.8% (165/395 bugs) and an MRR of 0.469, outperforming the strongest prior agent-based baseline (AgentFL) by 2.1 pp in Top-1. Runtime pruning removes 79% of candidate methods and reduces input token consumption by 49%, while simultaneously improving ranking quality: Top-1 is preserved exactly and Top-3 through Top-10 improve by up to 2.4 pp. These results demonstrate that, with the right filtering architecture, capable fault localization does not require proprietary frontier models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SieveFL shows a workable pruning pipeline that lets a local LLM handle project-scale fault localization with real efficiency gains on Defects4J, but the safety of the retrieval-plus-JaCoCo filter is not fully measured.

read the letter

The main takeaway is that this paper gives a concrete five-stage setup—failure description, dense file retrieval, JaCoCo execution filtering, per-method LLM screening, and final re-ranking—that cuts candidate methods by 79% and token use by 49% while keeping Top-1 accuracy at 41.8% on 395 Defects4J bugs and edging out AgentFL by 2.1 points. They run everything on a mid-sized open MoE model on ordinary hardware, which is the practical part worth noting right away. The evaluation uses a standard public benchmark with direct baseline comparison, and the numbers on preserved or slightly improved ranking after pruning are reported clearly. That combination of retrieval and runtime traces is the new piece; prior agent baselines did not structure the filtering this way. The soft spot is exactly the one the stress-test flags: no recall number for the pruning stages. The paper states that Top-1 holds exactly after pruning, but without knowing what fraction of the 395 ground-truth methods actually survive both the vector file step and the JaCoCo execution filter, it is hard to separate a generally safe filter from a filter that happens to be safe on the bugs where the LLM would have succeeded anyway. The 12 incomplete runs are treated as failures, which is transparent, but prompt templates and retrieval hyperparameters are not detailed enough to reproduce the exact behavior. This work is for people building or evaluating LLM tools for software engineering who need to stay under commodity hardware limits. The empirical grounding is strong enough that a serious editor should send it to peer review, with the main request being the missing retention statistics for the filter and a bit more on the prompt design.

Referee Report

2 major / 2 minor

Summary. The paper presents SieveFL, a five-stage hierarchical framework for scalable LLM-based fault localization. It converts failing tests to natural-language descriptions, applies dense vector retrieval to suspicious files, uses JaCoCo traces to prune unexecuted methods (removing 79% of candidates), and then performs LLM-based screening and re-ranking on the reduced set. Evaluated on 395 Defects4J v1.2.0 bugs with a local mid-sized MoE model, it reports 41.8% Top-1 accuracy (165/395) and MRR 0.469, outperforming AgentFL by 2.1 pp while cutting token consumption by 49% and preserving Top-1 exactly (with gains in Top-3 to Top-10).

Significance. If the pruning stages reliably retain the ground-truth faulty method, the work shows that aggressive pre-LLM filtering enables accurate fault localization on commodity hardware without frontier models or accuracy loss. This is a practical advance for the field, directly addressing token cost and signal dilution at project scale while providing reproducible evaluation against an independent baseline.

major comments (2)

[Results / Evaluation] Results section (and abstract claim that 'Top-1 is preserved exactly' after pruning): the headline accuracy numbers rest on the unverified assumption that the combined dense-retrieval + JaCoCo filter retains the true faulty method for all 395 bugs. No aggregate or per-bug recall figure is supplied for this two-stage filter, so it is impossible to determine whether the preserved Top-1 (and improved higher ranks) reflects a generally safe architecture or only the subset of bugs for which the filter happens to be safe.
[Abstract / Evaluation] Abstract and §4 (evaluation setup): treating the 12 incomplete runs as failures is stated without describing the exact failure mode, retry policy, or how these cases are distributed across the reported metrics; this detail is needed to assess robustness of the 41.8% Top-1 figure.

minor comments (2)

[Abstract] Abstract lacks any description of the prompt templates used for LLM screening and re-ranking, the vector-retrieval hyperparameters (e.g., top-k, similarity threshold), or the precise failure-handling logic for the 12 incomplete runs.
[Methods] Notation for MRR and Top-k metrics is used without an explicit definition or reference to the standard formulas in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional metrics.

read point-by-point responses

Referee: [Results / Evaluation] Results section (and abstract claim that 'Top-1 is preserved exactly' after pruning): the headline accuracy numbers rest on the unverified assumption that the combined dense-retrieval + JaCoCo filter retains the true faulty method for all 395 bugs. No aggregate or per-bug recall figure is supplied for this two-stage filter, so it is impossible to determine whether the preserved Top-1 (and improved higher ranks) reflects a generally safe architecture or only the subset of bugs for which the filter happens to be safe.

Authors: We agree that an explicit recall figure for the two-stage filter is necessary to fully substantiate the claim that Top-1 accuracy is preserved. In the revised manuscript we will add a new paragraph (and supporting table) in §5 reporting the aggregate recall of the dense-retrieval + JaCoCo pruning stage across all 395 bugs, together with a brief per-bug breakdown or histogram showing that the ground-truth faulty method is retained in the large majority of cases. This addition will confirm that the observed preservation of Top-1 (and gains at higher ranks) is not limited to a safe subset. revision: yes
Referee: [Abstract / Evaluation] Abstract and §4 (evaluation setup): treating the 12 incomplete runs as failures is stated without describing the exact failure mode, retry policy, or how these cases are distributed across the reported metrics; this detail is needed to assess robustness of the 41.8% Top-1 figure.

Authors: We acknowledge that the current description of the 12 incomplete runs lacks sufficient detail. In the revised §4 we will explicitly describe the failure modes (primarily timeout or out-of-memory errors during local LLM inference), the retry policy (a single retry with reduced context length), and confirm that these cases are counted as failures in every reported metric. We will also note that their distribution does not alter the relative ranking versus the AgentFL baseline. revision: yes

Circularity Check

0 steps flagged

No circularity; results from direct external benchmark evaluation

full rationale

The paper's central claims consist of empirical performance numbers (Top-1 accuracy 41.8%, MRR 0.469) obtained by running the described five-stage pipeline on the external Defects4J v1.2.0 benchmark and comparing against the independent prior system AgentFL. No equations, fitted parameters, or derivations are presented that reduce by construction to the reported outputs; the pruning stages are methodological choices whose safety is asserted via the observed accuracy preservation rather than by any self-referential definition or self-citation chain. The evaluation is therefore self-contained against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard domain tools and assumptions rather than new fitted parameters or invented entities.

axioms (2)

domain assumption JaCoCo coverage traces accurately identify all executed methods in Java programs
Invoked to prune non-executed methods after vector retrieval
domain assumption Dense vector embeddings of failure descriptions can surface files containing the fault
Used as the first-stage filter before runtime pruning

pith-pipeline@v0.9.0 · 5611 in / 1422 out tokens · 67561 ms · 2026-05-14T18:02:07.788711+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five-stage hierarchical framework... LLM-based Test Analysis, Suspicious File Identification, Runtime-Aware Candidate Pruning via JaCoCo, Per-Method LLM Screening, LLM-Based Re-ranking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 10 canonical work pages · 2 internal anchors

[1]

A survey on software fault localization,

W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A survey on software fault localization,”IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707–740, 2016

2016
[2]

Practitioners’ expectations on automated fault localization,

P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” inProceedings of the 25th international symposium on software testing and analysis, pp. 165–176, 2016

2016
[3]

On the accuracy of spectrum-based fault localization,

R. Abreu, P. Zoeteweij, and A. J. Van Gemund, “On the accuracy of spectrum-based fault localization,” inTesting: Academic and industrial conference practice and research techniques-MUTATION (TAICPART- MUTATION 2007), pp. 89–98, IEEE, 2007

2007
[4]

Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports,

J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports,” in2012 34th International conference on software engineering (ICSE), pp. 14–24, IEEE, 2012

2012
[5]

Improving bug localization using structured information retrieval,

R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, “Improving bug localization using structured information retrieval,” in2013 28th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), pp. 345–355, IEEE, 2013

2013
[6]

Irbfl: an information retrieval based fault localization approach,

Z. Li, X. Bai, H. Wang, and Y. Liu, “Irbfl: an information retrieval based fault localization approach,” in2020 IEEE 44th Annual Comput- ers, Software, and Applications Conference (COMPSAC), pp. 991–996, IEEE, 2020

2020
[7]

A quantitative and qualitative evaluation of llm-based explainable fault localization,

S. Kang, G. An, and S. Yoo, “A quantitative and qualitative evaluation of llm-based explainable fault localization,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1424–1446, 2024

2024
[8]

Agentfl: Scaling llm-based fault localization to project-level context,

Y. Qin, S. Wang, Y. Lou, J. Dong, K. Wang, X. Li, and X. Mao, “Agentfl: Scaling llm-based fault localization to project-level context,” arXiv preprint arXiv:2403.16362, 2024

work page arXiv 2024
[9]

Enhancing llm-based fault localization with a functionality-aware retrieval-augmented generation framework,

X. Shi, Z. Li, and A. R. Chen, “Enhancing llm-based fault localization with a functionality-aware retrieval-augmented generation framework,” arXiv preprint arXiv:2509.20552, 2025

work page arXiv 2025
[10]

Defects4j: A database of existing faults to enable controlled testing studies for java programs,

R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inPro- ceedings of the 2014 international symposium on software testing and analysis, pp. 437–440, 2014

2014
[11]

The dstar method for effective software fault localization,

W. E. Wong, V. Debroy, R. Gao, and Y. Li, “The dstar method for effective software fault localization,”IEEE Transactions on Reliability, vol. 63, no. 1, pp. 290–308, 2013

2013
[12]

An evaluation of pure spectrum-based fault localization techniques for large-scale software systems,

S. Heiden, L. Grunske, T. Kehrer, F. Keller, A. Van Hoorn, A. Filieri, and D. Lo, “An evaluation of pure spectrum-based fault localization techniques for large-scale software systems,”Software: Practice and Experience, vol. 49, no. 8, pp. 1197–1224, 2019

2019
[13]

Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization,

X. Li, W. Li, Y. Zhang, and L. Zhang, “Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization,” inProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis, pp. 169–180, 2019

2019
[14]

Boosting coverage-based fault localization via graph-based representation learning,

Y. Lou, Q. Zhu, J. Dong, X. Li, Z. Sun, D. Hao, L. Zhang, and L. Zhang, “Boosting coverage-based fault localization via graph-based representation learning,” inProceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 664–676, 2021

2021
[15]

Augmenting automated spectrum-based multi-fault localization via bor- derline confident learning,

W. Du, C. Li, S. Yin, H. Lin, H. Li, F. Zhan, Q. Ning, and Q. Ma, “Augmenting automated spectrum-based multi-fault localization via bor- derline confident learning,”ACM Transactions on Internet Technology, 2026

2026
[16]

Confident learning: Estimating uncertainty in dataset labels,

C. Northcutt, L. Jiang, and I. Chuang, “Confident learning: Estimating uncertainty in dataset labels,”Journal of Artificial Intelligence Research, vol. 70, pp. 1373–1411, 2021

2021
[17]

Robertson and H

S. Robertson and H. Zaragoza,The probabilistic relevance framework: BM25 and beyond, vol. 4. Now Publishers Inc, 2009

2009
[18]

Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,

C. S. Xia and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 819–831, 2024

2024
[19]

Large language models for test-free fault localization,

A. Z. Yang, C. Le Goues, R. Martins, and V. Hellendoorn, “Large language models for test-free fault localization,” inProceedings of the 46th IEEE/ACM International Conference on Software Engineering, pp. 1–12, 2024

2024
[20]

De- mystifying faulty code with llm: Step-by-step reasoning for explainable fault localization,

R. Widyasari, J. W. Ang, T. G. Nguyen, N. Sharma, and D. Lo, “De- mystifying faulty code with llm: Step-by-step reasoning for explainable fault localization,”arXiv preprint arXiv:2403.10507, 2024

work page arXiv 2024
[21]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Semloc: Structured grounding of free-form llm reasoning for fault localization,

Z. Yang, H. Zhu, Q. Zhang, R. Gupta, and A. Kundu, “Semloc: Structured grounding of free-form llm reasoning for fault localization,” arXiv preprint arXiv:2603.29109, 2026

work page arXiv 2026
[23]

Explainable fault localization for programming assignments via llm-guided annota- tion,

F. Liu, T. Wang, L. Zhang, Z. Yang, J. Jiang, and Z. Sun, “Explainable fault localization for programming assignments via llm-guided annota- tion,”arXiv preprint arXiv:2509.25676, 2025

work page arXiv 2025
[24]

On the Role of Fault Localization Context for LLM-Based Program Repair

M. Sepidband, H. V. Pham, and H. Hemmati, “On the role of fault localization context for llm-based program repair,”arXiv preprint arXiv:2604.05481, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

The power of noise: Redefin- ing retrieval for rag systems,

F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y. Maarek, N. Tonellotto, and F. Silvestri, “The power of noise: Redefin- ing retrieval for rag systems,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 719–729, 2024

2024
[26]

A multi-agent approach to fault localization via graph-based retrieval and reflexion,

M. N. Rafi, D. J. Kim, T.-H. Chen, and S. Wang, “A multi-agent approach to fault localization via graph-based retrieval and reflexion,” arXiv preprint arXiv:2409.13642, 2024

work page arXiv 2024
[27]

Improving llm-based fault localization with external memory and project context,

I. Yeo, D. Ryu, and J. Baik, “Improving llm-based fault localization with external memory and project context,”arXiv preprint arXiv:2506.03585, 2025

work page arXiv 2025
[28]

Inspectcoder: Dynamic analysis-driven self repair through interactive llm-debugger collaboration,

Y. Wang, Y. Zhang, G. Li, C. Zhi, B. Li, F. Huang, Y. Li, and S. Deng, “Inspectcoder: Dynamic analysis-driven self repair through interactive llm-debugger collaboration,”Proceedings of the ACM on Programming Languages, vol. 10, no. OOPSLA1, pp. 1041–1069, 2026

2026
[29]

Debug like a human: A large language model debugger via verifying runtime execution step by step,

L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large language model debugger via verifying runtime execution step by step,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 851–870, 2024

2024
[30]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

2023
[31]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pp. 3982–3992, 2019

2019
[32]

The faiss library,

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”IEEE Transactions on Big Data, 2025

2025
[33]

A vector space model for automatic indexing,

G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,”Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975

1975
[34]

JaCoCo: Java code coverage library

“JaCoCo: Java code coverage library.” https://www.jacoco.org/jacoco/,
[35]

Accessed: 2026-02-24

2026
[36]

Learning to combine multiple ranking metrics for fault localization,

J. Xuan and M. Monperrus, “Learning to combine multiple ranking metrics for fault localization,” in2014 IEEE international conference on software maintenance and evolution, pp. 191–200, IEEE, 2014

2014
[37]

Improving the efficiency of llm agent systems through trajectory reduction,

Y.-A. Xiao, P. Gao, C. Peng, and Y. Xiong, “Improving the efficiency of llm agent systems through trajectory reduction,”arXiv preprint arXiv:2509.23586, 2025

work page arXiv 2025
[38]

Fluccs: Using code and change metrics to improve fault localization,

J. Sohn and S. Yoo, “Fluccs: Using code and change metrics to improve fault localization,” inProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 273–283, 2017

2017