pith. sign in

arxiv: 2602.06718 · v2 · pith:7FF2VEHDnew · submitted 2026-02-06 · 💻 cs.CR · cs.AI

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Pith reviewed 2026-05-16 06:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords ghost citationscitation validitylarge language modelsacademic integritycitation hallucinationspeer review
0
0 comments X

The pith

Large language models fabricate citations at rates from 14 to 95 percent, and the fraction of papers containing such errors rose 81 percent in 2025.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GhostCite, an open-source framework for verifying citations across large collections of papers. Benchmarking thirteen language models on citation generation tasks shows that every model produces invalid citations at rates ranging from 14.23 to 94.93 percent. Analysis of 2.2 million citations drawn from 56,381 papers published at AI/ML and security venues between 2020 and 2025 finds that 1.07 percent of papers contain at least one invalid citation, with an 80.9 percent increase observed in 2025. A survey of 97 researchers indicates that 87.2 percent already use AI tools in their writing, yet 76.7 percent of reviewers do not thoroughly check references and 74.5 percent consider peer review ineffective at catching citation errors.

Core claim

Ghost citations, fabricated references produced by large language models, appear at high rates in model-generated text and have begun to appear in the published literature at measurable scale, with the share of affected papers increasing sharply in the most recent year examined.

What carries the argument

The GhostCite framework, which automates large-scale citation verification by comparing stated references against their actual content.

If this is right

  • Citation validity can no longer be treated as an automatic property of published work when language models are used for drafting.
  • Peer-review processes require supplementary automated checks because human reviewers rarely verify every reference.
  • Coordinated community standards and tools will be needed to maintain the reliability of the citation record.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of language-model writing assistants may accelerate the spread of citation errors unless verification steps are built into those tools.
  • Trust in individual scientific claims could decline if readers begin to discount references as potentially fabricated.
  • Fields outside AI and security may show different error rates, so targeted studies in other disciplines would test the generality of the observed trend.

Load-bearing premise

The automated checks in the GhostCite framework correctly identify invalid citations without substantial false positives or negatives.

What would settle it

Manual review of a random sample of papers flagged as containing invalid citations to determine the true error rate.

Figures

Figures reproduced from arXiv: 2602.06718 by Fasheng Miao, Feng Zhang, Fubin Wu, Haozhe Lu, Jiaji Liu, Jialu Li, Luo Jin, Lu Sun, Rui Luo, Xiang Li, Xinran Liu, Xinyi Wang, Yingxian Li, Yuqi Qiu, Yuxin Hu, Zhengze Zhang, Zuyao Xu.

Figure 1
Figure 1. Figure 1: The framework of CITEVERIFIER and experimental pipeline. and targeted analyses have documented hallucinated citations in conference submissions and accepted papers [6, 7], as well as in ACL-focused studies [40]. Yet despite this growing evidence, three critical research gaps remain unaddressed. RG1: How can ghost citations be detected at scale? Unlike AI-generated prose detection, citation verification req… view at source ↗
Figure 2
Figure 2. Figure 2: ECDF of title similarity scores: LLM-generated vs. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Temporal distribution of generated citations by pub [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Time trends of papers with invalid citations. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Key questions on AI adoption, reporting behavior, [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-model citation overlap heatmap. Color repre [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Survey questions overview (bar charts). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, but their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat, we develop \citeb, an open-source framework for large-scale citation verification, and conduct a comprehensive study of citation validity in the LLM era through three complementary experiments. First, we benchmark 13 LLMs on citation generation task in various research domains, finding that all models hallucinate citations at rate from 14.23\% to 94.93\%. Second, we analyze 2.2 million citations from 56,381 papers at AI/ML and Security venues (2020--2025), finding that 1.07\% of papers contain invalid citations, with an 80.9\% increase in 2025. Third, we survey 97 researchers, finding that 87.2\% use AI-powered tools in their workflows, 76.7\% of reviewers do not thoroughly check references, and 74.5\% view peer review as ineffective at catching citation errors. Based on these findings, we argue that ghost citations represent a systemic threat to academic integrity, and call for coordinated efforts from community to address this challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GhostCite, an open-source framework for large-scale citation verification, and reports three experiments: (1) benchmarking 13 LLMs on citation generation showing hallucination rates of 14.23%–94.93%, (2) analysis of 2.2 million citations across 56,381 AI/ML and security papers (2020–2025) finding 1.07% of papers contain invalid citations with an 80.9% increase in 2025, and (3) a survey of 97 researchers indicating high AI tool usage (87.2%) but low reviewer diligence on references (76.7%). The authors conclude that ghost citations pose a systemic threat to academic integrity.

Significance. If the GhostCite detector's accuracy is confirmed, the scale of the citation analysis (2.2M references) combined with LLM benchmarks and the researcher survey would provide the first quantitative evidence of rising invalid citations in the LLM era, supporting calls for improved verification practices. The open-source release of the framework is a clear strength that enables reproducibility and extension.

major comments (3)
  1. [§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.
  2. [§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.
  3. [§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.
minor comments (2)
  1. [Abstract and §5] The abstract and §5 should explicitly state the exact venues and inclusion criteria used for the 56,381 papers to allow readers to judge generalizability beyond AI/ML and security.
  2. [Figure 3] Figure 3 (LLM hallucination rates) would benefit from error bars or bootstrap intervals given the per-model sample sizes are not reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the rigor and reproducibility of our work. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid.

read point-by-point responses
  1. Referee: [§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.

    Authors: We agree that validation metrics are necessary to bound uncertainty in the large-scale results. In the revised manuscript we add a new subsection (4.1.1) reporting a manually annotated gold-standard set of 1,000 citations drawn from the corpus. Two independent annotators labeled each citation as valid or invalid, yielding Cohen's kappa of 0.82. Using this set we compute precision 0.91 and recall 0.87 for the invalid class. These figures are now used to provide error bounds on the reported 1.07% rate and the temporal trend. revision: yes

  2. Referee: [§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.

    Authors: We accept that the original algorithmic description was insufficiently precise. The revised Section 3 now includes (i) explicit decision rules and thresholds for each stage (exact DOI match, fuzzy title/author similarity with Jaccard threshold 0.85, and fallback LLM prompt), (ii) pseudocode for the full validity pipeline, and (iii) a discussion of coverage limitations for non-indexed venues together with the fraction of citations routed to the LLM stage. These additions enable independent reproduction and bias assessment. revision: yes

  3. Referee: [§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.

    Authors: We agree that statistical controls and uncertainty quantification are required. The revised Section 4.2 now reports 95% bootstrap confidence intervals around the year-over-year change, a chi-squared test for trend significance, and a sensitivity analysis restricted to the subset of venues that appear in every year of the corpus (thereby holding indexing coverage constant). The 80.9% increase remains statistically significant under these controls, though the magnitude is modestly attenuated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements rest on external data and benchmarks

full rationale

The paper's central claims derive from three direct empirical components: (1) controlled hallucination rates measured on 13 LLMs via citation-generation tasks, (2) automated counting of invalid citations across 2.2M references from 56,381 papers, and (3) responses from a 97-person survey. None of these steps invoke a derivation, fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation. GhostCite is presented as an open-source verification tool whose outputs are treated as observable counts rather than as a quantity defined in terms of the target statistics. No equations, ansatzes, or uniqueness theorems appear; the 1.07% rate and 80.9% increase are reported as raw frequencies from the sampled corpus. The analysis is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis relies on the domain assumption that citation validity is objectively verifiable at scale; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Citation validity can be systematically detected using automated tools
    This assumption enables the GhostCite framework and the large-scale analysis reported.

pith-pipeline@v0.9.0 · 5612 in / 1225 out tokens · 46683 ms · 2026-05-16T06:57:42.158691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

    cs.DL 2026-04 conditional novelty 7.0

    Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

  2. sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

    cs.DL 2026-04 unverdicted novelty 6.0

    An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.

  3. HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists

    cs.CL 2026-04 unverdicted novelty 5.0

    HalluCiteChecker is a lightweight, offline, CPU-only toolkit that detects hallucinated citations in AI-assisted scientific papers.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

    Ghost references cause many genAI errors | Thinking about Digital Publishing. URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

  2. [2]

    URL:https://openrouter.ai

    OpenRouter. URL:https://openrouter.ai

  3. [3]

    Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

    ScrapingDog vs. Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

  4. [4]

    URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

    Ghost Citations, December 2018. URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

  5. [5]

    https://arxiv.org/archive/cs , 2025

    arxiv e-print archive: Computer science subject classes. https://arxiv.org/archive/cs , 2025. Accessed: 2025-09-24

  6. [6]

    URL: https://gptzero.me/news/ iclr-2026/

    GPTZero uncovers 50+ Hallucinations in ICLR 2026, December 2025. URL: https://gptzero.me/news/ iclr-2026/

  7. [7]

    URL: https://gptz ero.me/news/neurips/

    GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, January 2026. URL: https://gptz ero.me/news/neurips/

  8. [8]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  9. [9]

    The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

    Michael Bailey, David Dittrich, Erin Kenneally, and Doug Maughan. The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

  10. [10]

    1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

    Monya Baker. 1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

  11. [11]

    On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

    Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  12. [12]

    A neural probabilistic language model

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137– 1155, 2003

  13. [13]

    Csrankings: Computer science rank- ings

    Emery Berger et al. Csrankings: Computer science rank- ings. https://csrankings.org , 2025. Venue lists developed in consultation with faculty and community surveys; only top conferences per area are included

  14. [14]

    ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025

    ICLR 2026 Program Chairs. ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025. URL: https://blog.iclr.cc/202 5/11/19/iclr-2026-response-to-llm-generat ed-papers-and-reviews/

  15. [15]

    Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

    Debby RE Cotton, Peter A Cotton, and J Reuben Ship- way. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

  16. [16]

    Yogesh K Dwivedi, Nir Kshetri, Laurie Hughes, Emma Louise Slade, Anand Jeyaraj, Arpan Kumar Kar, Abdullah M Baabdullah, Alex Koohang, Vishal Ragha- van, Manju Ahuja, et al. So what if chatgpt wrote it? multidisciplinary perspectives on opportunities, chal- lenges and implications of generative conversational ai for research, practice and policy.Internatio...

  17. [17]

    Toward the discovery of citation cartels in citation networks

    Iztok Fister Jr, Iztok Fister, and Matjaž Perc. Toward the discovery of citation cartels in citation networks. Frontiers in Physics, 4:49, 2016

  18. [18]

    A Greenberg

    S. A Greenberg. How citation distortions create un- founded authority: analysis of a citation network.BMJ, 339(jul20 3):b2680–b2680, July 2009. URL: https: //www.bmj.com/lookup/doi/10.1136/bmj.b2680 , doi:10.1136/bmj.b2680

  19. [19]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Wei- hua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, tax- onomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023

  20. [20]

    Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text

    IEEE. Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text. https: //journals.ieeeauthorcenter.ieee.org/becom e-an-ieee-journal-author/publishing-ethic s/guidelines-and-policies/submission-and-p eer-review-policies/#ai-generated-content , 2025

  21. [21]

    Standard No

    International Organization for Standardization, Geneva, Switzerland.Information and documentation — Guide- lines for bibliographic references and citations to infor- mation resources, 4th edition, 2021. Standard No. ISO 690:2021. URL: https://www.iso.org/standard /72642.html. 14

  22. [22]

    Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

    John PA Ioannidis. Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

  23. [23]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

  24. [24]

    The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

    Norman Kaplan. The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

  25. [25]

    A watermark for large language models

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

  26. [26]

    The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers

    Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wil- son. The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI conference on human fac- tors in computing systems, pages 1–22, 2025

  27. [27]

    Binary codes capable of cor- recting deletions, insertions, and reversals

    Vladimir I Levenshtein. Binary codes capable of cor- recting deletions, insertions, and reversals. InSoviet Physics Doklady, volume 10, 1966

  28. [28]

    arXiv preprint arXiv:2305.11747 (2023)

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale halluci- nation evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023

  29. [29]

    Truth- fulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truth- fulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

  30. [30]

    Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

    Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

  31. [31]

    Grobid client python

    Patrice Lopez and Luca Foppiano. Grobid client python. https://github.com/kermitt2/grobid-clien t-python, 2017–2025

  32. [32]

    Brady D Lund, Ting Wang, Nishith Reddy Mannuru, Bing Nie, Somipam Shimray, and Ziang Wang. Chat- gpt and a new academic reality: Artificial intelligence- written research papers and the ethics of the large lan- guage models in scholarly publishing.Journal of the Association for Information Science and Technology, 74(5):570–581, 2023

  33. [33]

    21 NVIDIA Corporation

    Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025

  34. [34]

    University of Chicago press, 1973

    Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973

  35. [35]

    Detect- gpt: Zero-shot machine-generated text detection using probability curvature

    Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detect- gpt: Zero-shot machine-generated text detection using probability curvature. InInternational Conference on Machine Learning, pages 24950–24962. PMLR, 2023

  36. [36]

    Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use

    Nature Editorial. Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use. Nature, 613(7945):612, 2023

  37. [37]

    The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

    Mark EJ Newman. The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

  38. [38]

    Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship

    Bolaji David Oladokun, Rexwhite Tega Enakrire, Ade- fila Kolawole Emmanuel, Yusuf Ayodeji Ajani, and Ade- bowale Jeremy Adetayo. Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship. Journal of Web Librarianship, 19(1):62–92, 2025

  39. [39]

    Prudent practices for designing malware experiments: Status quo and out- look

    Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. Prudent practices for designing malware experiments: Status quo and out- look. In2012 IEEE Symposium on Security and Privacy, pages 65–79. IEEE, 2012

  40. [40]

    Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

    Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

  41. [41]

    Confusing Value with Enumeration: Studying the Use of CVEs in Academia

    Moritz Schloegel, Daniel Klischies, Simon Koch, David Klein, Lukas Gerlach, Malte Wessels, Leon Tram- pert, Martin Johns, Mathy Vanhoef, Michael Schwarz, Thorsten Holz, and Jo Van Bulck. Confusing Value with Enumeration: Studying the Use of CVEs in Academia. pages 2887–2906, 2025. URL: https://www.usenix .org/conference/usenixsecurity25/presentat ion/schloegel

  42. [42]

    Ai models collapse when trained on recursively generated data

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024. 15

  43. [43]

    Read before you cite!Complex Syst., 14, 01 2003

    Mikhail Simkin and Vwani Roychowdhury. Read before you cite!Complex Syst., 14, 01 2003. doi:10.25088 /ComplexSystems.14.3.269

  44. [44]

    Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

    Mikhail V Simkin and Vwani P Roychowdhury. Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

  45. [45]

    Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

    Richard Smith. Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

  46. [46]

    Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

    James H Sweetland. Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

  47. [47]

    Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

    Aaron Tay. Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

  48. [48]

    URL: https://aarontay.substack.com/p/ why-ghost-references-still-haunt

  49. [49]

    Chatgpt is fun, but not an author

    H Holden Thorp. Chatgpt is fun, but not an author. Science, 379(6630):313–313, 2023

  50. [50]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  51. [51]

    Machine Learning that Matters

    Kiri Wagstaff. Machine learning that matters.arXiv preprint arXiv:1206.4656, 2012

  52. [52]

    author": [

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yu- long Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, pages 1–46, 2025. A Appendix In the appendix, we provide additional details on our study that are not included in the main ...