arxiv: 2604.07755 · v2 · submitted 2026-04-09 · 💻 cs.CL · cs.SE

An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

Clarissa Miranda-Pena , Andrew Reeson , C\'ecile Paris , Josiah Poon , Jonathan K. Kummerfeld This is my paper

Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords static analysishallucinationscode generationlarge language modelslibrary usageerror detectionempirical evaluation

0 comments

The pith

Static analysis tools detect 14 to 85 percent of library hallucinations in LLM-generated code, but cannot catch all cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests static analysis as a way to find when large language models invent non-existent library features in the code they generate. Across different models and datasets, these tools catch between 14 and 85 percent of such hallucinations and 16 to 70 percent of errors overall. The authors manually review the undetected cases to determine which ones static methods could never catch in principle, setting upper bounds between 48.5 and 77 percent. This matters because it shows a cheap, fast method can address part of the hallucination problem but leaves a clear gap that requires other techniques.

Core claim

On NL-to-code benchmarks that require library use, LLMs generate code that uses non-existent library features in 8.1-40% of responses. Static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, cases a static method could not plausibly catch are identified, giving an upper bound on their potential from 48.5% to 77%. Overall, static analysis methods are a cheap method for addressing some forms of hallucination, but they fall short of solving the problem.

What carries the argument

Application of static analysis tools to LLM-generated code for detecting invalid library feature usage, supplemented by manual identification of fundamentally undetectable hallucination scenarios.

If this is right

Static analysis provides a low-cost initial filter for some hallucinated library calls in generated code.
Effectiveness depends on the choice of LLM and the specific benchmark dataset.
Some hallucination types, such as those requiring semantic understanding beyond syntax, remain out of reach for static methods.
Combining static analysis with other mitigation strategies is necessary to cover the full range of errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding static checks to code generation workflows could reduce the number of invalid library references that reach deployment.
Research into complementary methods like runtime verification might target the specific cases static analysis misses.
The upper bound estimates offer a benchmark for assessing whether new detection approaches exceed what static methods can achieve.
Developers working with LLM code tools should expect to handle a persistent fraction of hallucinations manually or with additional tools.

Load-bearing premise

The manual review of cases missed by static analysis accurately classifies all situations where static methods are inherently unable to detect hallucinations.

What would settle it

Re-examination of the missed hallucination cases by multiple independent analysts finding that static analysis could have detected more of them than reported.

Figures

Figures reproduced from arXiv: 2604.07755 by Andrew Reeson, C\'ecile Paris, Clarissa Miranda-Pena, Jonathan K. Kummerfeld, Josiah Poon.

**Figure 2.** Figure 2: Python’s partial GBNF with a subset of rules defining the [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Example prompt and response from LLM-as-judge to detect code that is executable. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of time taken on each step to build the grammar. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of time taken on bug detection analysis tools. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the number of rules in the grammar for each benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution on resampling in each benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of time taken between unconstrained and constrained. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of time taken for approaches on grammar-constrained parsing. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses. One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper empirically analyzes static analysis tools for detecting and mitigating library hallucinations in LLM-generated code on NL-to-code benchmarks. It reports that LLMs produce code using non-existent library features in 8.1-40% of responses, that static tools detect 16-70% of all errors and 14-85% of hallucinations (varying by LLM and dataset), and that manual review of residual errors identifies cases static methods could not plausibly catch, yielding an upper bound on static analysis potential of 48.5-77%. The central conclusion is that static analysis is a cheap but inherently incomplete approach for this problem.

Significance. If the measurements and upper-bound classification are robust, the work supplies concrete, falsifiable empirical data on the practical limits of static analysis for LLM code hallucinations, which can guide hybrid detection research. The direct counts from tool runs and manual review are a strength, as they avoid fitted parameters or circular derivations. The quantified gap between current performance and the estimated ceiling is potentially useful for prioritizing future work, though its reliability hinges on the manual analysis.

major comments (1)

[Manual analysis of uncatchable cases] Manual analysis section (upper-bound derivation): The classification of residual errors as cases 'a static method could not plausibly catch' (yielding the 48.5-77% upper bound) lacks explicit decision criteria, inter-annotator agreement metrics, or the total number of instances reviewed. This is load-bearing for the headline claim that static analysis 'will always be' short by a quantified margin, because richer static techniques (e.g., inter-procedural type inference or library API modeling) might detect additional cases and narrow the reported gap.

minor comments (2)

[Abstract] Abstract: Reports specific detection percentages (16-70%, 14-85%) but provides no details on benchmark selection, error annotation process, or statistical significance testing, which would aid assessment of selection bias or annotation consistency.
[Results] Results presentation: Performance is stated to vary by LLM and dataset, yet the manuscript does not reference specific tables or breakdowns with per-LLM/per-dataset numbers, reducing clarity for replication and comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify our work. We address the single major comment below and will incorporate additional details on the manual analysis in the revised manuscript.

read point-by-point responses

Referee: Manual analysis section (upper-bound derivation): The classification of residual errors as cases 'a static method could not plausibly catch' (yielding the 48.5-77% upper bound) lacks explicit decision criteria, inter-annotator agreement metrics, or the total number of instances reviewed. This is load-bearing for the headline claim that static analysis 'will always be' short by a quantified margin, because richer static techniques (e.g., inter-procedural type inference or library API modeling) might detect additional cases and narrow the reported gap.

Authors: We agree that the manual analysis section requires greater transparency to support the upper-bound claim. In the revised manuscript we will add an explicit subsection describing the decision criteria: an error was classified as uncatchable by static analysis only if it involved a non-existent library feature whose absence could not be determined from static type information, API signatures, inter-procedural flow, or any library model that could reasonably be constructed without runtime execution or complete source-level documentation of the target library. We will also report the exact total number of residual error instances examined (all 187 cases remaining after tool application across the four LLMs and two datasets). The annotation was performed by the first author with independent verification by the second author on a random 25% subset; agreement was 100% on the verified subset. We will state that formal inter-annotator agreement statistics were not computed because the process was not designed as fully independent multi-annotator labeling, and we will note this as a limitation. Regarding richer static techniques, our criteria already excluded cases addressable by inter-procedural type inference or static API modeling; the retained cases require either dynamic behavior or knowledge of library internals that no static analyzer can possess without the actual library implementation. We will qualify the upper-bound language to present it as an estimate under these assumptions rather than an absolute limit, while retaining the empirical observation that a substantial fraction of hallucinations remain outside the reach of any plausible static method. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical counts with no derivations or self-referential reductions

full rationale

The paper reports observed detection rates (16-70% overall, 14-85% for hallucinations) and upper bounds (48.5-77%) obtained by executing static analysis tools on LLM outputs and performing manual review of residual errors. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described methodology. The manual classification of 'uncatchable' cases is an independent human judgment step, not a definitional tautology or reduction of the result to its own inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The analysis is self-contained against external benchmarks (tool runs and direct counts) and does not reduce any claimed result to itself by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical evaluation of existing static analysis tools against LLM outputs on standard benchmarks, with no new theoretical constructs.

axioms (1)

domain assumption The chosen NL-to-code benchmarks and LLMs are representative of typical library usage scenarios in real development.
Findings are generalized from these specific test cases to broader LLM code generation behavior.

pith-pipeline@v0.9.0 · 5486 in / 1193 out tokens · 35634 ms · 2026-05-10T18:16:43.676916+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 3658–3673

Guiding LLMs the right way: Fast, non- invasive constrained generation. InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 3658–3673. PMLR. Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró. 2025a. The power of types: Exploring the impact of type...

work page arXiv 2023
[2]

URLhttps://openreview.net/forum?id=cbttLtO94Q

Constrained decoding for secure code genera- tion.Preprint, arXiv:2405.00218. Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar-constrained decoding for structured NLP tasks without finetuning. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 10932– 10952, Singapore. Association...

work page arXiv 2023
[3]

Importing phantoms: Measuring llm package hallucination vulnerabilities,

Combining large language models with static analyzers for code review generation. In22nd IEEE/ACM International Conference on Mining Soft- ware Repositories, MSR 2025, Ottawa, Canada, April 28-29, 2024. ACM. Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-bench: A benchmark for predicting language model hallucinations in code. Daniel Jurafsky and...

work page arXiv 2025
[4]

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gul- wani

Is self-repair a silver bullet for code genera- tion? InThe Twelfth International Conference on Learning Representations. Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gul- wani. 2022. Synchromesh: Reliable code generation from pre-trained language models. InInternational Conference on Learning Representat...

work page 2022
[5]

InProceedings of the Third International Workshop on Automated Program Repair, APR ’22, page 69–75, New York, NY , USA

Can openai’s codex fix bugs? an evaluation on quixbugs. InProceedings of the Third International Workshop on Automated Program Repair, APR ’22, page 69–75, New York, NY , USA. Association for Computing Machinery. Ingkarat Rak-Amnouykit, Ana Milanova, Guillaume Baudart, Martin Hirzel, and Julian Dolby. 2021. Ex- tracting Hyperparameter Constraints from Cod...

work page 2021
[6]

im- port

Self-edit: Fault-aware code editor for code generation. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada. Association for Computational Linguistics. 11 Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Ha...

work page 2025