An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
Pith reviewed 2026-05-10 18:16 UTC · model grok-4.3
The pith
Static analysis tools detect 14 to 85 percent of library hallucinations in LLM-generated code, but cannot catch all cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On NL-to-code benchmarks that require library use, LLMs generate code that uses non-existent library features in 8.1-40% of responses. Static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, cases a static method could not plausibly catch are identified, giving an upper bound on their potential from 48.5% to 77%. Overall, static analysis methods are a cheap method for addressing some forms of hallucination, but they fall short of solving the problem.
What carries the argument
Application of static analysis tools to LLM-generated code for detecting invalid library feature usage, supplemented by manual identification of fundamentally undetectable hallucination scenarios.
If this is right
- Static analysis provides a low-cost initial filter for some hallucinated library calls in generated code.
- Effectiveness depends on the choice of LLM and the specific benchmark dataset.
- Some hallucination types, such as those requiring semantic understanding beyond syntax, remain out of reach for static methods.
- Combining static analysis with other mitigation strategies is necessary to cover the full range of errors.
Where Pith is reading between the lines
- Adding static checks to code generation workflows could reduce the number of invalid library references that reach deployment.
- Research into complementary methods like runtime verification might target the specific cases static analysis misses.
- The upper bound estimates offer a benchmark for assessing whether new detection approaches exceed what static methods can achieve.
- Developers working with LLM code tools should expect to handle a persistent fraction of hallucinations manually or with additional tools.
Load-bearing premise
The manual review of cases missed by static analysis accurately classifies all situations where static methods are inherently unable to detect hallucinations.
What would settle it
Re-examination of the missed hallucination cases by multiple independent analysts finding that static analysis could have detected more of them than reported.
Figures
read the original abstract
Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses. One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically analyzes static analysis tools for detecting and mitigating library hallucinations in LLM-generated code on NL-to-code benchmarks. It reports that LLMs produce code using non-existent library features in 8.1-40% of responses, that static tools detect 16-70% of all errors and 14-85% of hallucinations (varying by LLM and dataset), and that manual review of residual errors identifies cases static methods could not plausibly catch, yielding an upper bound on static analysis potential of 48.5-77%. The central conclusion is that static analysis is a cheap but inherently incomplete approach for this problem.
Significance. If the measurements and upper-bound classification are robust, the work supplies concrete, falsifiable empirical data on the practical limits of static analysis for LLM code hallucinations, which can guide hybrid detection research. The direct counts from tool runs and manual review are a strength, as they avoid fitted parameters or circular derivations. The quantified gap between current performance and the estimated ceiling is potentially useful for prioritizing future work, though its reliability hinges on the manual analysis.
major comments (1)
- [Manual analysis of uncatchable cases] Manual analysis section (upper-bound derivation): The classification of residual errors as cases 'a static method could not plausibly catch' (yielding the 48.5-77% upper bound) lacks explicit decision criteria, inter-annotator agreement metrics, or the total number of instances reviewed. This is load-bearing for the headline claim that static analysis 'will always be' short by a quantified margin, because richer static techniques (e.g., inter-procedural type inference or library API modeling) might detect additional cases and narrow the reported gap.
minor comments (2)
- [Abstract] Abstract: Reports specific detection percentages (16-70%, 14-85%) but provides no details on benchmark selection, error annotation process, or statistical significance testing, which would aid assessment of selection bias or annotation consistency.
- [Results] Results presentation: Performance is stated to vary by LLM and dataset, yet the manuscript does not reference specific tables or breakdowns with per-LLM/per-dataset numbers, reducing clarity for replication and comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the opportunity to clarify our work. We address the single major comment below and will incorporate additional details on the manual analysis in the revised manuscript.
read point-by-point responses
-
Referee: Manual analysis section (upper-bound derivation): The classification of residual errors as cases 'a static method could not plausibly catch' (yielding the 48.5-77% upper bound) lacks explicit decision criteria, inter-annotator agreement metrics, or the total number of instances reviewed. This is load-bearing for the headline claim that static analysis 'will always be' short by a quantified margin, because richer static techniques (e.g., inter-procedural type inference or library API modeling) might detect additional cases and narrow the reported gap.
Authors: We agree that the manual analysis section requires greater transparency to support the upper-bound claim. In the revised manuscript we will add an explicit subsection describing the decision criteria: an error was classified as uncatchable by static analysis only if it involved a non-existent library feature whose absence could not be determined from static type information, API signatures, inter-procedural flow, or any library model that could reasonably be constructed without runtime execution or complete source-level documentation of the target library. We will also report the exact total number of residual error instances examined (all 187 cases remaining after tool application across the four LLMs and two datasets). The annotation was performed by the first author with independent verification by the second author on a random 25% subset; agreement was 100% on the verified subset. We will state that formal inter-annotator agreement statistics were not computed because the process was not designed as fully independent multi-annotator labeling, and we will note this as a limitation. Regarding richer static techniques, our criteria already excluded cases addressable by inter-procedural type inference or static API modeling; the retained cases require either dynamic behavior or knowledge of library internals that no static analyzer can possess without the actual library implementation. We will qualify the upper-bound language to present it as an estimate under these assumptions rather than an absolute limit, while retaining the empirical observation that a substantial fraction of hallucinations remain outside the reach of any plausible static method. revision: yes
Circularity Check
No circularity: purely empirical counts with no derivations or self-referential reductions
full rationale
The paper reports observed detection rates (16-70% overall, 14-85% for hallucinations) and upper bounds (48.5-77%) obtained by executing static analysis tools on LLM outputs and performing manual review of residual errors. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described methodology. The manual classification of 'uncatchable' cases is an independent human judgment step, not a definitional tautology or reduction of the result to its own inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The analysis is self-contained against external benchmarks (tool runs and direct counts) and does not reduce any claimed result to itself by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chosen NL-to-code benchmarks and LLMs are representative of typical library usage scenarios in real development.
Reference graph
Works this paper leans on
-
[1]
Guiding LLMs the right way: Fast, non- invasive constrained generation. InProceedings of the 41st International Conference on Machine Learn- ing, volume 235 ofProceedings of Machine Learning Research, pages 3658–3673. PMLR. Boqi Chen, José Antonio Hernández López, Gunter Mussbacher, and Dániel Varró. 2025a. The power of types: Exploring the impact of type...
-
[2]
URLhttps://openreview.net/forum?id=cbttLtO94Q
Constrained decoding for secure code genera- tion.Preprint, arXiv:2405.00218. Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. Grammar-constrained decoding for structured NLP tasks without finetuning. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 10932– 10952, Singapore. Association...
-
[3]
Importing phantoms: Measuring llm package hallucination vulnerabilities,
Combining large language models with static analyzers for code review generation. In22nd IEEE/ACM International Conference on Mining Soft- ware Repositories, MSR 2025, Ottawa, Canada, April 28-29, 2024. ACM. Nan Jiang, Qi Li, Lin Tan, and Tianyi Zhang. 2024. Collu-bench: A benchmark for predicting language model hallucinations in code. Daniel Jurafsky and...
-
[4]
Is self-repair a silver bullet for code genera- tion? InThe Twelfth International Conference on Learning Representations. Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gul- wani. 2022. Synchromesh: Reliable code generation from pre-trained language models. InInternational Conference on Learning Representat...
work page 2022
-
[5]
Can openai’s codex fix bugs? an evaluation on quixbugs. InProceedings of the Third International Workshop on Automated Program Repair, APR ’22, page 69–75, New York, NY , USA. Association for Computing Machinery. Ingkarat Rak-Amnouykit, Ana Milanova, Guillaume Baudart, Martin Hirzel, and Julian Dolby. 2021. Ex- tracting Hyperparameter Constraints from Cod...
work page 2021
-
[6]
Self-edit: Fault-aware code editor for code generation. InProceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 769–787, Toronto, Canada. Association for Computational Linguistics. 11 Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Ha...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.