pith. sign in

arxiv: 2605.17290 · v1 · pith:YF5ULMXEnew · submitted 2026-05-17 · 💻 cs.SE

Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing

Pith reviewed 2026-05-19 22:51 UTC · model grok-4.3

classification 💻 cs.SE
keywords fault localizationLLM-based debuggingprocessor designRISC-VSystemVeriloghardware verificationbug localizationcode slicing
0
0 comments X

The pith

BluesFL triples top-1 bug localization in large processor designs using block-level instruction slicing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents BluesFL to address the challenge of localizing bugs in large-scale processor designs using LLMs. It proposes a dataflow-based code blockization to focus on critical local contexts and a Block-Level Instruction-Oriented Slicing algorithm that lets LLMs analyze instruction execution paths and states like human debuggers. Evaluated on a 19K-line RISC-V SystemVerilog core, the approach localizes 24 bugs at top-1, a 242.9% improvement over the state-of-the-art. This makes automated fault localization more practical and cost-effective for hardware verification.

Core claim

BluesFL is a block-level LLM-based fault localization framework for processor designs that uses dataflow-based blockization and the Blues slicing algorithm to enable LLMs to mimic human reasoning on instruction paths and processor states, achieving correct localization of 24 bugs at Top-1 in a real-world 19K-line RISC-V core.

What carries the argument

The Block-Level Instruction-Oriented Slicing (Blues) algorithm that guides LLMs to focus on relevant code blocks derived from dataflow analysis and to examine instruction execution paths and processor states.

If this is right

  • Reduces the manual effort in processor verification by automating a key step.
  • Lowers the average cost of localizing a bug to under 30 cents.
  • Outperforms existing module-level LLM approaches by a large margin on project-level designs.
  • Provides a scalable method applicable to other complex hardware designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The slicing technique might apply to debugging other large codebases in software engineering.
  • Future work could combine this with simulation traces for even better accuracy.
  • It suggests LLMs can handle hardware-specific reasoning when given structured context.

Load-bearing premise

That the dataflow-based blockization and Blues algorithm supply the right local context for LLMs to effectively replicate human debugging without overlooking important dependencies across blocks.

What would settle it

Testing the framework on another large processor design with a known set of injected bugs and measuring whether the top-1 hit rate remains significantly higher than baseline methods.

Figures

Figures reproduced from arXiv: 2605.17290 by Deheng Yang, Guangda Zhang, Jiang Wu, Jianjun Xu, Jiayu He, Xiaoguang Mao, Yan Lei, Yihao Qin, Zizhen Liu.

Figure 1
Figure 1. Figure 1: Overview of BluesFL. 3.2 Dataflow-based Code Blockization Definition 1 (Code Block). Let 𝐿 = {𝑙1,𝑙2, . . . ,𝑙𝑚 } denote the set of all code lines in the HDL source code. A code block 𝑏 is defined as a subset of lines 𝑏 ⊆ 𝐿 such that for any two distinct blocks 𝑏𝑖 , 𝑏𝑗 , we have 𝑏𝑖 ∩ 𝑏𝑗 = ∅. The set of all code blocks is denoted as 𝐵 = {𝑏1, 𝑏2, . . . , 𝑏𝑛 }. Based on Definition 1, we propose a dataflow-base… view at source ↗
Figure 3
Figure 3. Figure 3: The prompt template of BluesFL. We prompt LLM reasoning and make decisions via tool calls at each state (𝑏, 𝑡). As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histogram of Block Sizes in Ibex (Log Y Scale). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of the number of checked blocks across [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A case to show how [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Fault localization in modern processor design code is a critical yet time-consuming step during processor verification. While recent advances in LLM-based techniques for module-level hardware design have shown promising results, automatically localizing bugs in large-scale, project-level processor designs remains challenging. In this paper, we present BluesFL, a novel block-level LLM-based fault localization framework for processor designs. Inspired by the way engineers debug processors, we first propose a dataflow-based code blockization approach to guide LLMs to focus on critical local code context. We further propose a Block-Level Instruction-Oriented Slicing (Blues) algorithm that enables LLMs to mimic human reasoning by analyzing instruction execution paths and processor states. We evaluate BluesFL on a real-world RISC-V processor core comprising 19K lines of SystemVerilog code. Experimental results demonstrate that BluesFL correctly localizes 24 bugs at Top-1, achieving 242.9% improvement over the existing state-of-the-art (7 bugs). Cost analysis shows that BluesFL requires an average of only $0.257 to localize a single bug.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents BluesFL, a block-level LLM-based fault localization framework for processor designs. It proposes a dataflow-based code blockization approach and a Block-Level Instruction-Oriented Slicing (Blues) algorithm to enable LLMs to focus on critical local contexts and mimic human debugging by analyzing instruction execution paths and processor states. Evaluated on a real-world 19K-line RISC-V SystemVerilog core, BluesFL achieves Top-1 localization of 24 bugs, a 242.9% improvement over the prior state-of-the-art (7 bugs), with an average cost of $0.257 per bug.

Significance. If the central performance claims hold under fair and controlled comparisons, the work has moderate-to-high significance for scaling automated fault localization to project-level hardware designs. The human-inspired slicing and blockization techniques address a genuine scalability gap in LLM-based hardware debugging, and the low per-bug cost supports practical adoption. The evaluation on a real 19K-line core is a strength, though the absence of detailed ablations and baseline equivalence details limits the immediate impact.

major comments (3)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the headline claim of 242.9% improvement (24 vs. 7 Top-1 localizations) is load-bearing for the paper's contribution, yet the manuscript does not state whether the SOTA baseline received equivalent dataflow-based blockization and Blues slicing inputs. If the baseline operated on raw module- or file-level code while BluesFL used pre-sliced blocks, the reported gain conflates algorithmic innovation with input-format differences.
  2. [§4 and §5] §4 (Experimental Setup) and §5 (Results): no ablation is reported that isolates the effect of the Block-Level Instruction-Oriented Slicing algorithm from the blockization step or from standard LLM prompting. Without such controls, it is unclear whether the increase from 7 to 24 correctly localized bugs is attributable to the proposed Blues algorithm rather than other factors.
  3. [Evaluation] Evaluation section: the comparison lacks explicit details on bug selection criteria, whether identical bug instances were used across methods, the LLM backend and prompt configurations applied to the SOTA baseline, and any statistical significance testing for the Top-1 counts.
minor comments (2)
  1. [Abstract] Abstract: the reference to 'existing state-of-the-art (7 bugs)' should include a citation to the specific prior work being compared.
  2. [Figures/Tables] Figure and table captions: several figures illustrating the slicing process or example bug localizations would improve readability of how the instruction-oriented analysis operates.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and completeness of the evaluation.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] the headline claim of 242.9% improvement (24 vs. 7 Top-1 localizations) is load-bearing for the paper's contribution, yet the manuscript does not state whether the SOTA baseline received equivalent dataflow-based blockization and Blues slicing inputs. If the baseline operated on raw module- or file-level code while BluesFL used pre-sliced blocks, the reported gain conflates algorithmic innovation with input-format differences.

    Authors: The SOTA baseline was evaluated using the original module- and file-level code inputs as described in the prior work, without our dataflow-based blockization or Blues slicing. This is the appropriate comparison to demonstrate the benefit of the proposed techniques. We will revise the abstract and evaluation section to explicitly state the input formats used for the baseline versus BluesFL. revision: yes

  2. Referee: [§4 and §5] no ablation is reported that isolates the effect of the Block-Level Instruction-Oriented Slicing algorithm from the blockization step or from standard LLM prompting. Without such controls, it is unclear whether the increase from 7 to 24 correctly localized bugs is attributable to the proposed Blues algorithm rather than other factors.

    Authors: We agree that isolating the contributions would strengthen the claims. In the revised manuscript we will add an ablation study comparing standard LLM prompting on raw code, blockization without Blues slicing, and the full BluesFL pipeline to attribute performance gains to each component. revision: yes

  3. Referee: [Evaluation] the comparison lacks explicit details on bug selection criteria, whether identical bug instances were used across methods, the LLM backend and prompt configurations applied to the SOTA baseline, and any statistical significance testing for the Top-1 counts.

    Authors: We will expand the evaluation section to detail bug selection from real verification failures, confirm identical bug instances across methods, specify the LLM backend and prompt configurations used for the baseline, and report statistical significance testing on the Top-1 results. revision: yes

Circularity Check

0 steps flagged

Empirical framework with no derivation chain or fitted inputs

full rationale

The paper proposes BluesFL as an empirical LLM-based fault localization method using dataflow blockization and the Blues slicing algorithm, then reports experimental Top-1 localization counts (24 vs. 7) on a 19K-line RISC-V core. No equations, parameters fitted to subsets of data, or self-citation chains are described that would reduce any claimed result to its own inputs by construction. The performance numbers are presented as direct experimental outcomes rather than derivations, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the contribution is an empirical engineering method rather than a theoretical derivation.

pith-pipeline@v0.9.0 · 5748 in / 1076 out tokens · 31672 ms · 2026-05-19T22:51:51.565150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    CoreMark

    2025. CoreMark. https://github.com/lowRISC/ibex/tree/master/examples/sw/ benchmarks/coremark

  2. [2]

    2025. cva6. https://github.com/openhwgroup/cva6

  3. [3]

    2025. Ibex. https://github.com/lowRISC/ibex

  4. [4]

    Rocket Chip Generator

    2025. Rocket Chip Generator. https://github.com/chipsalliance/rocket-chip

  5. [5]

    sv-parser

    2025. sv-parser. https://github.com/dalance/sv-parser

  6. [6]

    Verilator

    2025. Verilator. https://github.com/verilator/verilator

  7. [7]

    Hammad Ahmad, Yu Huang, and Westley Weimer. 2022. CirFix: Automatically repairing defects in hardware design code. InProceedings of the 27th ACM In- ternational Conference on Architectural Support for Programming Languages and Operating Systems. 990–1003

  8. [8]

    Desire Athow. 2014. Pentium FDIV: The processor bug that shook the world. https://www.techradar.com/news/computing-components/processors/ pentium-fdiv-the-processor-bug-that-shook-the-world-1270773

  9. [9]

    Erick Carvajal Barboza, Sara Jacob, Mahesh Ketkar, Michael Kishinevsky, Paul Gratz, and Jiang Hu. 2021. Automatic microprocessor performance bug detection. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 545–556

  10. [10]

    Harry Foster. 2024. 2024-siemens-eda-and-wilson-research-group-ic-asic- functional-verification-trend-report. https://verificationacademy.com/topics/ planning-measurement-and-analysis/wrg-industry-data-and-trends/2024- siemens-eda-and-wilson-research-group-ic-asic-functional-verification- trend-report/. Last accessed: Feb 2025

  11. [11]

    Xiaolong Guo, Raj Gautam Dutta, Yier Jin, Farimah Farahmandi, and Prabhat Mishra. 2015. Pre-silicon security verification and validation: A formal perspec- tive. InProceedings of the 52nd annual design automation conference. 1–6

  12. [12]

    Jaewon Hur, Suhwan Song, Dongup Kwon, Eunjin Baek, Jangwoo Kim, and Byoungyoung Lee. 2021. Difuzzrtl: Differential fuzz testing to find cpu bugs. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 1286–1303

  13. [13]

    Nursultan Kabylkas, Tommy Thorn, Shreesha Srinath, Polychronis Xekalakis, and Jose Renau. 2021. Effective Processor Verification with Logic Fuzzer Enhanced Co-simulation. InMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture(Virtual Event, Greece)(MICRO ’21). Association for Computing Machinery, New York, NY, USA, 667–678. doi:10.1...

  14. [14]

    Sungmin Kang, Gabin An, and Shin Yoo. 2024. A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization.Proc. ACM Softw. Eng. 1, FSE, Article 64 (July 2024), 23 pages. doi:10.1145/3660771

  15. [15]

    Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, et al. 2020. Spectre attacks: Exploiting speculative execution.Commun. ACM63, 7 (2020), 93–101

  16. [16]

    Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg

  17. [17]

    Meltdown.arXiv preprint arXiv:1801.01207(2018)

  18. [18]

    Transactions of the Association for Computational Linguistics , volume =

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173. doi:10.1162/tacl_a_00638

  19. [19]

    Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Haoyang Zhang, Andrew Quinn, and Baris Kasikci. 2022. Debugging in the brave new world of reconfigurable hardware. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 946–962

  20. [20]

    Samit Shahnawaz Miftah, Amisha Srivastava, Hyunmin Kim, Shiyi Wei, and Kanad Basu. 2025. SymbFuzz: Symbolic Execution Guided Hardware Fuzzing. In Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture®. 1477–1490

  21. [21]

    Sangeetha Sudakrishnan, Janaki Madhavan, E James Whitehead Jr, and Jose Renau. 2008. Understanding bug fix patterns in verilog. InProceedings of the 2008 international working conference on Mining software repositories. 39–42

  22. [22]

    Stuart Sutherland and Don Mills. 2013. Synthesizing systemverilog busting the myth that systemverilog is only for verification.SNUG silicon valley24 (2013)

  23. [23]

    Ilya Wagner, Valeria Bertacco, and Todd Austin. 2005. StressTest: an automatic approach to test generation via activity monitors. InProceedings of the 42nd annual Design Automation Conference. 783–788

  24. [24]

    Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

    W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (2016), 707–740. doi:10.1109/TSE.2016.2521368

  25. [25]

    Jiang Wu, Zhuo Zhang, Deheng Yang, Xiankai Meng, Jiayu He, Xiaoguang Mao, and Yan Lei. 2022. Fault Localization for Hardware Design Code with Time-Aware Program Spectrum. In2022 IEEE 40th International Conference on Computer Design (ICCD). 537–544. doi:10.1109/ICCD56317.2022.00085

  26. [26]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

  27. [27]

    Deheng Yang, Jiayu He, Xiaoguang Mao, Tun Li, Yan Lei, Xin Yi, and Jiang Wu

  28. [28]

    STRIDER: Signal value transition-guided defect repair for HDL program- ming assignments.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems43, 5 (2023), 1594–1607

  29. [29]

    Bingkun Yao, Ning Wang, Jie Zhou, Xi Wang, Hong Gao, Zhe Jiang, and Nan Guan. 2025. Location is Key: Leveraging LLM for Functional Bug Localization in Verilog Design. In2025 62nd ACM/IEEE Design Automation Conference (DAC). 1–7. doi:10.1109/DAC63849.2025.11133280

  30. [30]

    Yanhong Zhou, Tiancheng Wang, Huawei Li, Tao Lv, and Xiaowei Li. 2015. Functional test generation for hard-to-reach states using path constraint solving. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 6 (2015), 999–1011