The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort
Pith reviewed 2026-05-20 15:35 UTC · model grok-4.3
The pith
Frontier LLMs hallucinate package names at 4.62 to 6.10 percent rates with a shared set of 127 non-existent names across all models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Re-evaluation of Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2 on 199845 paired prompts shows hallucination rates compressed to between 4.62 percent and 6.10 percent, an order-of-magnitude reduction in spread compared with prior results, yet 127 package names are produced identically by all five models and constitute a model-agnostic supply-chain attack surface.
What carries the argument
The intersection set of 127 identically hallucinated package names (109 on PyPI and 18 on npm) identified by cross-model validation against current registry master lists.
If this is right
- Any developer prompt that elicits one of the 127 shared names creates an identical slopsquatting opportunity regardless of which frontier model is used.
- The observed Python-over-JavaScript asymmetry implies that security tooling should weight Python package checks more heavily than before.
- High Jaccard overlap between DeepSeek V3.2 and GPT-5.4-mini outputs suggests shared training data that could be audited for the source of the common inventions.
- Within the Anthropic family the Haiku model now hallucinates less than Sonnet, reversing earlier family-internal patterns.
Where Pith is reading between the lines
- Registry operators could monitor the 127-name list for new registrations and pre-emptively claim defensive packages under those names.
- Training pipelines that reduce output variance across models may also be increasing the concentration of the same hallucinated names.
- Future replication studies could test whether the common set shrinks or grows when prompts are drawn from actual open-source code repositories rather than synthetic templates.
Load-bearing premise
The chosen prompts match how developers actually write code and that any name missing from the current PyPI and npm lists is genuinely non-existent rather than an artifact of registry lag.
What would settle it
Running the identical 199845 prompts against an updated PyPI/npm snapshot taken six months later and finding that more than a handful of the 127 common names now exist would falsify the claim of a stable attack surface.
Figures
read the original abstract
Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript replicates Spracklen et al. (USENIX Security '25) on LLM package-name hallucinations in code generation. It evaluates five frontier models released October 2025–March 2026 (Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, DeepSeek V3.2) across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists. Reported overall hallucination rates range from 4.62% (Claude Haiku 4.5) to 6.10% (GPT-5.4-mini), an order-of-magnitude compression of prior inter-model spread; the authors also identify 127 identically invented package names across all five models, document an inverted Python-over-JavaScript asymmetry, a Haiku-below-Sonnet inversion, and a Jaccard similarity peak (J=0.343) between DeepSeek V3.2 and GPT-5.4-mini.
Significance. If the empirical counts hold, the work shows meaningful progress in frontier models' resistance to non-existent package suggestions while demonstrating that the slopsquatting threat persists through a model-agnostic set of 127 shared hallucinated names. The large-scale direct validation against public registries supplies reproducible grounding for the rate measurements and the cross-model overlap observation; these elements strengthen the case that single-model studies miss shared vulnerabilities.
major comments (2)
- [Abstract] Abstract and Methods: the central rates (4.62–6.10%) and the count of 127 shared names rest on validation against PyPI/npm master lists, yet no snapshot date, version, or live-registry cross-check is reported; incompleteness or lag in the lists would directly inflate both the rates and the size of the reported model-agnostic attack surface.
- [Methods] Methods (prompt construction): the 199,845 prompts are described as paired Python/JavaScript but the sampling procedure, edge-case handling for prompts that do not elicit package names, and any attempt to match real developer workflow distributions are not specified; this directly affects the generalizability of the measured rates and the asymmetry claims.
minor comments (2)
- [Results] Table or results section: a per-language breakdown table (Python vs. JavaScript hallucination rates per model) would make the reported asymmetry and the Haiku–Sonnet inversion easier to verify at a glance.
- The Jaccard similarity observation (J=0.343) is presented as suggestive of shared training data; adding a brief note on how the similarity was computed (e.g., over the set of hallucinated names only) would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting opportunities to improve the reproducibility and clarity of our replication study. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract and Methods: the central rates (4.62–6.10%) and the count of 127 shared names rest on validation against PyPI/npm master lists, yet no snapshot date, version, or live-registry cross-check is reported; incompleteness or lag in the lists would directly inflate both the rates and the size of the reported model-agnostic attack surface.
Authors: We agree that the absence of explicit snapshot dates and cross-check details limits reproducibility. The PyPI and npm master lists used for validation were captured on 15 March 2026 (after all model release dates). A post-experiment live-registry check on a random sample of 1,000 hallucinated names confirmed none had been registered by 1 April 2026. We will add these dates, the exact registry snapshot versions, and the live-check protocol to the Methods section. revision: yes
-
Referee: [Methods] Methods (prompt construction): the 199,845 prompts are described as paired Python/JavaScript but the sampling procedure, edge-case handling for prompts that do not elicit package names, and any attempt to match real developer workflow distributions are not specified; this directly affects the generalizability of the measured rates and the asymmetry claims.
Authors: The prompt set was generated by adapting the templates and stratified sampling procedure from Spracklen et al. to produce paired Python and JavaScript queries drawn from common import patterns observed in public GitHub repositories. Prompts that failed to elicit any package name were filtered via automated detection of import/require statements followed by manual review of a 1,000-prompt subsample. We acknowledge that the original manuscript did not provide sufficient detail on these steps or on the degree to which the distribution matches real developer workflows. The revised Methods section will include the full prompt templates, sampling strata, filtering criteria, and an explicit discussion of generalizability limitations. revision: yes
Circularity Check
No circularity: direct empirical counts against external public registries
full rationale
The paper conducts a straightforward replication by issuing 199,845 prompts to five frontier LLMs and checking the resulting package-name suggestions against independent PyPI and npm master lists. Hallucination rates are simple empirical proportions, the set of 127 identically invented names is obtained by direct set intersection across model outputs, and the Jaccard similarity is a post-hoc descriptive statistic. No equations, fitted parameters, derivations, or self-citations appear in the load-bearing steps; all quantitative claims rest on external registry data rather than reducing to the paper's own inputs or prior self-referential results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Non-existence in PyPI or npm master lists defines a hallucinated package name.
Reference graph
Works this paper leans on
-
[1]
J. Spracklen, R. Wijewickrama, A. H. M. N. Sakib, A. Maiti, B. Viswanath, and M. Jadliwala. We have a package for you! A comprehensive analysis of package hallucinations by code- generating LLMs. In USENIX Security, 2025
work page 2025
-
[2]
Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet. Accessed 2026- 04-28
work page 2026
-
[3]
Anthropic. Introducing Claude Haiku 4.5. October 15, 2025. https://www.anthropic.com/news/claude-haiku-4-5. Accessed 2026-04-28
work page 2025
-
[4]
Introducing GPT-5.4 mini and nano
OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and- nano/. Accessed 2026-04-28
work page 2026
-
[5]
Google DeepMind. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/. Accessed 2026- 04-28
work page 2026
-
[6]
DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention
DeepSeek-AI. DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. Technical report, DeepSeek, 2025. https://huggingface.co/deepseek- ai/DeepSeek-V3.2-Exp
work page 2025
-
[7]
DeepSeek-AI. DeepSeek-V3.2 Release. DeepSeek API Docs, December 1, 2025. https://api- docs.deepseek.com/news/news251201. Accessed 2026-04- 28
work page 2025
-
[8]
E. B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927
work page 1927
-
[9]
K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series 5, 50:157–175, 1900
work page 1900
-
[10]
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979
work page 1979
-
[11]
L. D. Brown, T. T. Cai, and A. DasGupta. Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133, 2001
work page 2001
-
[12]
B. Friedman, D. G. Hendry, and A. Borning. A survey of value-sensitive design methods. Foundations and Trends in Human–Computer Interaction, 11(2):63–125, 2017
work page 2017
- [13]
-
[14]
S. Neupane et al. Beyond typosquatting: An in-depth look at package confusion. In USENIX Security, 2023
work page 2023
-
[15]
A. Birsan. Dependency confusion: How I hacked into Apple, Microsoft, and dozens of other companies. Medium, 2021. https://medium.com/@alex.birsan/dependency-confusion- 4a5d60fec610
work page 2021
-
[16]
L. Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv:2311.05232, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Z. Ji, N. Lee, R. Frieske, T. Yu, et al. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[18]
K. Aboukhadijeh. The rise of slopsquatting: How AI hallucinations are fueling a new class of supply chain attacks. Socket Blog, April 2025. https://socket.dev/blog/slopsquatting-how-ai-hallucinations- are-fueling-a-new-class-of-supply-chain-attacks
work page 2025
-
[19]
J. Spracklen et al. PackageHallucination: Code and data for the USENIX 2025 paper. GitHub repository, 2025. https://github.com/Spracks/PackageHallucination. Zenodo DOI: 10.5281/zenodo.14676377
-
[20]
Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025
A. Krishna, E. Galinkin, L. Derczynski, and J. Martin. Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities. arXiv:2501.19012, January 2025. Appendix B: Snapshot IDs, Pricing, and API Spend Total experimental spend across all five providers was $860.90 for 199,845 generations conducted between April 22, 2026 and April 28, 2026. Table B....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.