pith. sign in

arxiv: 2605.17062 · v1 · pith:JYHV3VJYnew · submitted 2026-05-16 · 💻 cs.CR · cs.LG· cs.SE

The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort

Pith reviewed 2026-05-20 15:35 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE
keywords LLM hallucinationspackage name inventionslopsquattingsupply chain securitycode generationPython JavaScript promptsreplication studymodel overlap
0
0 comments X

The pith

Frontier LLMs hallucinate package names at 4.62 to 6.10 percent rates with a shared set of 127 non-existent names across all models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates earlier measurements of how code-generating LLMs invent package names absent from PyPI and npm. It finds that rates on five 2025-2026 models have tightened into a narrow band while a core group of 127 invented names appears identically in every model. A reader cares because this common list creates a single attack surface usable against any of the models rather than requiring separate exploits per model. The work also records an inversion where Python prompts now produce more hallucinations than JavaScript ones and notes higher similarity between certain model pairs.

Core claim

Re-evaluation of Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2 on 199845 paired prompts shows hallucination rates compressed to between 4.62 percent and 6.10 percent, an order-of-magnitude reduction in spread compared with prior results, yet 127 package names are produced identically by all five models and constitute a model-agnostic supply-chain attack surface.

What carries the argument

The intersection set of 127 identically hallucinated package names (109 on PyPI and 18 on npm) identified by cross-model validation against current registry master lists.

If this is right

  • Any developer prompt that elicits one of the 127 shared names creates an identical slopsquatting opportunity regardless of which frontier model is used.
  • The observed Python-over-JavaScript asymmetry implies that security tooling should weight Python package checks more heavily than before.
  • High Jaccard overlap between DeepSeek V3.2 and GPT-5.4-mini outputs suggests shared training data that could be audited for the source of the common inventions.
  • Within the Anthropic family the Haiku model now hallucinates less than Sonnet, reversing earlier family-internal patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Registry operators could monitor the 127-name list for new registrations and pre-emptively claim defensive packages under those names.
  • Training pipelines that reduce output variance across models may also be increasing the concentration of the same hallucinated names.
  • Future replication studies could test whether the common set shrinks or grows when prompts are drawn from actual open-source code repositories rather than synthetic templates.

Load-bearing premise

The chosen prompts match how developers actually write code and that any name missing from the current PyPI and npm lists is genuinely non-existent rather than an artifact of registry lag.

What would settle it

Running the identical 199845 prompts against an updated PyPI/npm snapshot taken six months later and finding that more than a handful of the 127 common names now exist would falsify the claim of a stable attack surface.

Figures

Figures reproduced from arXiv: 2605.17062 by Aleksandr Churilov (Independent Researcher).

Figure 1
Figure 1. Figure 1: Slopsquatting attack chain. The developer issues a code [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-model hallucination rate with 95% Wilson confidence intervals (n ≈ 39,969 prompts per model; exact per-provider counts in Appendix B). The dashed reference line marks Spracklen's best 2024 commercial result (3.6%, GPT-4 Turbo); no 2026 frontier model has surpassed the best 2024 model, even as the worst-case rate has fallen sharply. 5.2 Decomposition by Language and Dataset [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise Jaccard similarity between sets of unique hallucinated package names. Higher values indicate models tend t [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript replicates Spracklen et al. (USENIX Security '25) on LLM package-name hallucinations in code generation. It evaluates five frontier models released October 2025–March 2026 (Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, DeepSeek V3.2) across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists. Reported overall hallucination rates range from 4.62% (Claude Haiku 4.5) to 6.10% (GPT-5.4-mini), an order-of-magnitude compression of prior inter-model spread; the authors also identify 127 identically invented package names across all five models, document an inverted Python-over-JavaScript asymmetry, a Haiku-below-Sonnet inversion, and a Jaccard similarity peak (J=0.343) between DeepSeek V3.2 and GPT-5.4-mini.

Significance. If the empirical counts hold, the work shows meaningful progress in frontier models' resistance to non-existent package suggestions while demonstrating that the slopsquatting threat persists through a model-agnostic set of 127 shared hallucinated names. The large-scale direct validation against public registries supplies reproducible grounding for the rate measurements and the cross-model overlap observation; these elements strengthen the case that single-model studies miss shared vulnerabilities.

major comments (2)
  1. [Abstract] Abstract and Methods: the central rates (4.62–6.10%) and the count of 127 shared names rest on validation against PyPI/npm master lists, yet no snapshot date, version, or live-registry cross-check is reported; incompleteness or lag in the lists would directly inflate both the rates and the size of the reported model-agnostic attack surface.
  2. [Methods] Methods (prompt construction): the 199,845 prompts are described as paired Python/JavaScript but the sampling procedure, edge-case handling for prompts that do not elicit package names, and any attempt to match real developer workflow distributions are not specified; this directly affects the generalizability of the measured rates and the asymmetry claims.
minor comments (2)
  1. [Results] Table or results section: a per-language breakdown table (Python vs. JavaScript hallucination rates per model) would make the reported asymmetry and the Haiku–Sonnet inversion easier to verify at a glance.
  2. The Jaccard similarity observation (J=0.343) is presented as suggestive of shared training data; adding a brief note on how the similarity was computed (e.g., over the set of hallucinated names only) would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting opportunities to improve the reproducibility and clarity of our replication study. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Methods: the central rates (4.62–6.10%) and the count of 127 shared names rest on validation against PyPI/npm master lists, yet no snapshot date, version, or live-registry cross-check is reported; incompleteness or lag in the lists would directly inflate both the rates and the size of the reported model-agnostic attack surface.

    Authors: We agree that the absence of explicit snapshot dates and cross-check details limits reproducibility. The PyPI and npm master lists used for validation were captured on 15 March 2026 (after all model release dates). A post-experiment live-registry check on a random sample of 1,000 hallucinated names confirmed none had been registered by 1 April 2026. We will add these dates, the exact registry snapshot versions, and the live-check protocol to the Methods section. revision: yes

  2. Referee: [Methods] Methods (prompt construction): the 199,845 prompts are described as paired Python/JavaScript but the sampling procedure, edge-case handling for prompts that do not elicit package names, and any attempt to match real developer workflow distributions are not specified; this directly affects the generalizability of the measured rates and the asymmetry claims.

    Authors: The prompt set was generated by adapting the templates and stratified sampling procedure from Spracklen et al. to produce paired Python and JavaScript queries drawn from common import patterns observed in public GitHub repositories. Prompts that failed to elicit any package name were filtered via automated detection of import/require statements followed by manual review of a 1,000-prompt subsample. We acknowledge that the original manuscript did not provide sufficient detail on these steps or on the degree to which the distribution matches real developer workflows. The revised Methods section will include the full prompt templates, sampling strata, filtering criteria, and an explicit discussion of generalizability limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical counts against external public registries

full rationale

The paper conducts a straightforward replication by issuing 199,845 prompts to five frontier LLMs and checking the resulting package-name suggestions against independent PyPI and npm master lists. Hallucination rates are simple empirical proportions, the set of 127 identically invented names is obtained by direct set intersection across model outputs, and the Jaccard similarity is a post-hoc descriptive statistic. No equations, fitted parameters, derivations, or self-citations appear in the load-bearing steps; all quantitative claims rest on external registry data rather than reducing to the paper's own inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The measurements rest on the assumption that non-presence in current PyPI and npm registries constitutes a hallucination and that the prompt distribution matches typical developer code-generation tasks; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Non-existence in PyPI or npm master lists defines a hallucinated package name.
    Invoked when validating generated package names against external registries to compute hallucination rates.

pith-pipeline@v0.9.0 · 5864 in / 1428 out tokens · 54872 ms · 2026-05-20T15:35:48.417817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Spracklen, R

    J. Spracklen, R. Wijewickrama, A. H. M. N. Sakib, A. Maiti, B. Viswanath, and M. Jadliwala. We have a package for you! A comprehensive analysis of package hallucinations by code- generating LLMs. In USENIX Security, 2025

  2. [2]

    Claude Sonnet 4.6

    Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet. Accessed 2026- 04-28

  3. [3]

    Introducing Claude Haiku 4.5

    Anthropic. Introducing Claude Haiku 4.5. October 15, 2025. https://www.anthropic.com/news/claude-haiku-4-5. Accessed 2026-04-28

  4. [4]

    Introducing GPT-5.4 mini and nano

    OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and- nano/. Accessed 2026-04-28

  5. [5]

    Gemini 2.5 Pro

    Google DeepMind. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/. Accessed 2026- 04-28

  6. [6]

    DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention

    DeepSeek-AI. DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention. Technical report, DeepSeek, 2025. https://huggingface.co/deepseek- ai/DeepSeek-V3.2-Exp

  7. [7]

    DeepSeek-V3.2 Release

    DeepSeek-AI. DeepSeek-V3.2 Release. DeepSeek API Docs, December 1, 2025. https://api- docs.deepseek.com/news/news251201. Accessed 2026-04- 28

  8. [8]

    E. B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927

  9. [9]

    K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series 5, 50:157–175, 1900

  10. [10]

    S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979

  11. [11]

    L. D. Brown, T. T. Cai, and A. DasGupta. Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133, 2001

  12. [12]

    Friedman, D

    B. Friedman, D. G. Hendry, and A. Borning. A survey of value-sensitive design methods. Foundations and Trends in Human–Computer Interaction, 11(2):63–125, 2017

  13. [13]

    Pearce, B

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? Assessing the security of GitHub Copilot's code contributions. In IEEE Symposium on Security and Privacy (S&P), 2022

  14. [14]

    Neupane et al

    S. Neupane et al. Beyond typosquatting: An in-depth look at package confusion. In USENIX Security, 2023

  15. [15]

    A. Birsan. Dependency confusion: How I hacked into Apple, Microsoft, and dozens of other companies. Medium, 2021. https://medium.com/@alex.birsan/dependency-confusion- 4a5d60fec610

  16. [16]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    L. Huang et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv:2311.05232, 2023

  17. [17]

    Z. Ji, N. Lee, R. Frieske, T. Yu, et al. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

  18. [18]

    Aboukhadijeh

    K. Aboukhadijeh. The rise of slopsquatting: How AI hallucinations are fueling a new class of supply chain attacks. Socket Blog, April 2025. https://socket.dev/blog/slopsquatting-how-ai-hallucinations- are-fueling-a-new-class-of-supply-chain-attacks

  19. [19]

    Spracklen et al

    J. Spracklen et al. PackageHallucination: Code and data for the USENIX 2025 paper. GitHub repository, 2025. https://github.com/Spracks/PackageHallucination. Zenodo DOI: 10.5281/zenodo.14676377

  20. [20]

    Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025

    A. Krishna, E. Galinkin, L. Derczynski, and J. Martin. Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities. arXiv:2501.19012, January 2025. Appendix B: Snapshot IDs, Pricing, and API Spend Total experimental spend across all five providers was $860.90 for 199,845 generations conducted between April 22, 2026 and April 28, 2026. Table B....