GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models
Pith reviewed 2026-05-16 06:57 UTC · model grok-4.3
The pith
Large language models fabricate citations at rates from 14 to 95 percent, and the fraction of papers containing such errors rose 81 percent in 2025.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ghost citations, fabricated references produced by large language models, appear at high rates in model-generated text and have begun to appear in the published literature at measurable scale, with the share of affected papers increasing sharply in the most recent year examined.
What carries the argument
The GhostCite framework, which automates large-scale citation verification by comparing stated references against their actual content.
If this is right
- Citation validity can no longer be treated as an automatic property of published work when language models are used for drafting.
- Peer-review processes require supplementary automated checks because human reviewers rarely verify every reference.
- Coordinated community standards and tools will be needed to maintain the reliability of the citation record.
Where Pith is reading between the lines
- Widespread adoption of language-model writing assistants may accelerate the spread of citation errors unless verification steps are built into those tools.
- Trust in individual scientific claims could decline if readers begin to discount references as potentially fabricated.
- Fields outside AI and security may show different error rates, so targeted studies in other disciplines would test the generality of the observed trend.
Load-bearing premise
The automated checks in the GhostCite framework correctly identify invalid citations without substantial false positives or negatives.
What would settle it
Manual review of a random sample of papers flagged as containing invalid citations to determine the true error rate.
Figures
read the original abstract
Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, but their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat, we develop \citeb, an open-source framework for large-scale citation verification, and conduct a comprehensive study of citation validity in the LLM era through three complementary experiments. First, we benchmark 13 LLMs on citation generation task in various research domains, finding that all models hallucinate citations at rate from 14.23\% to 94.93\%. Second, we analyze 2.2 million citations from 56,381 papers at AI/ML and Security venues (2020--2025), finding that 1.07\% of papers contain invalid citations, with an 80.9\% increase in 2025. Third, we survey 97 researchers, finding that 87.2\% use AI-powered tools in their workflows, 76.7\% of reviewers do not thoroughly check references, and 74.5\% view peer review as ineffective at catching citation errors. Based on these findings, we argue that ghost citations represent a systemic threat to academic integrity, and call for coordinated efforts from community to address this challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GhostCite, an open-source framework for large-scale citation verification, and reports three experiments: (1) benchmarking 13 LLMs on citation generation showing hallucination rates of 14.23%–94.93%, (2) analysis of 2.2 million citations across 56,381 AI/ML and security papers (2020–2025) finding 1.07% of papers contain invalid citations with an 80.9% increase in 2025, and (3) a survey of 97 researchers indicating high AI tool usage (87.2%) but low reviewer diligence on references (76.7%). The authors conclude that ghost citations pose a systemic threat to academic integrity.
Significance. If the GhostCite detector's accuracy is confirmed, the scale of the citation analysis (2.2M references) combined with LLM benchmarks and the researcher survey would provide the first quantitative evidence of rising invalid citations in the LLM era, supporting calls for improved verification practices. The open-source release of the framework is a clear strength that enables reproducibility and extension.
major comments (3)
- [§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.
- [§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.
- [§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.
minor comments (2)
- [Abstract and §5] The abstract and §5 should explicitly state the exact venues and inclusion criteria used for the 56,381 papers to allow readers to judge generalizability beyond AI/ML and security.
- [Figure 3] Figure 3 (LLM hallucination rates) would benefit from error bars or bootstrap intervals given the per-model sample sizes are not reported.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the rigor and reproducibility of our work. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid.
read point-by-point responses
-
Referee: [§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.
Authors: We agree that validation metrics are necessary to bound uncertainty in the large-scale results. In the revised manuscript we add a new subsection (4.1.1) reporting a manually annotated gold-standard set of 1,000 citations drawn from the corpus. Two independent annotators labeled each citation as valid or invalid, yielding Cohen's kappa of 0.82. Using this set we compute precision 0.91 and recall 0.87 for the invalid class. These figures are now used to provide error bounds on the reported 1.07% rate and the temporal trend. revision: yes
-
Referee: [§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.
Authors: We accept that the original algorithmic description was insufficiently precise. The revised Section 3 now includes (i) explicit decision rules and thresholds for each stage (exact DOI match, fuzzy title/author similarity with Jaccard threshold 0.85, and fallback LLM prompt), (ii) pseudocode for the full validity pipeline, and (iii) a discussion of coverage limitations for non-indexed venues together with the fraction of citations routed to the LLM stage. These additions enable independent reproduction and bias assessment. revision: yes
-
Referee: [§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.
Authors: We agree that statistical controls and uncertainty quantification are required. The revised Section 4.2 now reports 95% bootstrap confidence intervals around the year-over-year change, a chi-squared test for trend significance, and a sensitivity analysis restricted to the subset of venues that appear in every year of the corpus (thereby holding indexing coverage constant). The 80.9% increase remains statistically significant under these controls, though the magnitude is modestly attenuated. revision: yes
Circularity Check
No circularity: empirical measurements rest on external data and benchmarks
full rationale
The paper's central claims derive from three direct empirical components: (1) controlled hallucination rates measured on 13 LLMs via citation-generation tasks, (2) automated counting of invalid citations across 2.2M references from 56,381 papers, and (3) responses from a 97-person survey. None of these steps invoke a derivation, fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation. GhostCite is presented as an open-source verification tool whose outputs are treated as observable counts rather than as a quantity defined in terms of the target statistics. No equations, ansatzes, or uniqueness theorems appear; the 1.07% rate and 80.9% increase are reported as raw frequencies from the sampled corpus. The analysis is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Citation validity can be systematically detected using automated tools
Forward citations
Cited by 3 Pith papers
-
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
-
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.
-
HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
HalluCiteChecker is a lightweight, offline, CPU-only toolkit that detects hallucinated citations in AI-assisted scientific papers.
Reference graph
Works this paper leans on
-
[1]
URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/
Ghost references cause many genAI errors | Thinking about Digital Publishing. URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/
- [2]
-
[3]
Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/
ScrapingDog vs. Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/
-
[4]
URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations
Ghost Citations, December 2018. URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations
work page 2018
-
[5]
https://arxiv.org/archive/cs , 2025
arxiv e-print archive: Computer science subject classes. https://arxiv.org/archive/cs , 2025. Accessed: 2025-09-24
work page 2025
-
[6]
URL: https://gptzero.me/news/ iclr-2026/
GPTZero uncovers 50+ Hallucinations in ICLR 2026, December 2025. URL: https://gptzero.me/news/ iclr-2026/
work page 2026
-
[7]
URL: https://gptz ero.me/news/neurips/
GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, January 2026. URL: https://gptz ero.me/news/neurips/
work page 2025
-
[8]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012
Michael Bailey, David Dittrich, Erin Kenneally, and Doug Maughan. The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012
work page 2012
-
[10]
1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016
Monya Baker. 1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016
work page 2016
-
[11]
Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021
work page 2021
-
[12]
A neural probabilistic language model
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137– 1155, 2003
work page 2003
-
[13]
Csrankings: Computer science rank- ings
Emery Berger et al. Csrankings: Computer science rank- ings. https://csrankings.org , 2025. Venue lists developed in consultation with faculty and community surveys; only top conferences per area are included
work page 2025
-
[14]
ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025
ICLR 2026 Program Chairs. ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025. URL: https://blog.iclr.cc/202 5/11/19/iclr-2026-response-to-llm-generat ed-papers-and-reviews/
work page 2026
-
[15]
Debby RE Cotton, Peter A Cotton, and J Reuben Ship- way. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024
work page 2024
-
[16]
Yogesh K Dwivedi, Nir Kshetri, Laurie Hughes, Emma Louise Slade, Anand Jeyaraj, Arpan Kumar Kar, Abdullah M Baabdullah, Alex Koohang, Vishal Ragha- van, Manju Ahuja, et al. So what if chatgpt wrote it? multidisciplinary perspectives on opportunities, chal- lenges and implications of generative conversational ai for research, practice and policy.Internatio...
work page 2023
-
[17]
Toward the discovery of citation cartels in citation networks
Iztok Fister Jr, Iztok Fister, and Matjaž Perc. Toward the discovery of citation cartels in citation networks. Frontiers in Physics, 4:49, 2016
work page 2016
-
[18]
S. A Greenberg. How citation distortions create un- founded authority: analysis of a citation network.BMJ, 339(jul20 3):b2680–b2680, July 2009. URL: https: //www.bmj.com/lookup/doi/10.1136/bmj.b2680 , doi:10.1136/bmj.b2680
-
[19]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Wei- hua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, tax- onomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text
IEEE. Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text. https: //journals.ieeeauthorcenter.ieee.org/becom e-an-ieee-journal-author/publishing-ethic s/guidelines-and-policies/submission-and-p eer-review-policies/#ai-generated-content , 2025
work page 2025
-
[21]
International Organization for Standardization, Geneva, Switzerland.Information and documentation — Guide- lines for bibliographic references and citations to infor- mation resources, 4th edition, 2021. Standard No. ISO 690:2021. URL: https://www.iso.org/standard /72642.html. 14
work page 2021
-
[22]
Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005
John PA Ioannidis. Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005
work page 2005
-
[23]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023
work page 2023
-
[24]
Norman Kaplan. The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965
work page 1965
-
[25]
A watermark for large language models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023
work page 2023
-
[26]
Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wil- son. The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI conference on human fac- tors in computing systems, pages 1–22, 2025
work page 2025
-
[27]
Binary codes capable of cor- recting deletions, insertions, and reversals
Vladimir I Levenshtein. Binary codes capable of cor- recting deletions, insertions, and reversals. InSoviet Physics Doklady, volume 10, 1966
work page 1966
-
[28]
arXiv preprint arXiv:2305.11747 (2023)
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale halluci- nation evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023
-
[29]
Truth- fulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truth- fulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022
work page 2022
-
[30]
Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018
Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018
work page 2018
-
[31]
Patrice Lopez and Luca Foppiano. Grobid client python. https://github.com/kermitt2/grobid-clien t-python, 2017–2025
work page 2017
-
[32]
Brady D Lund, Ting Wang, Nishith Reddy Mannuru, Bing Nie, Somipam Shimray, and Ziang Wang. Chat- gpt and a new academic reality: Artificial intelligence- written research papers and the ethics of the large lan- guage models in scholarly publishing.Journal of the Association for Information Science and Technology, 74(5):570–581, 2023
work page 2023
-
[33]
Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025
-
[34]
University of Chicago press, 1973
Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973
work page 1973
-
[35]
Detect- gpt: Zero-shot machine-generated text detection using probability curvature
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detect- gpt: Zero-shot machine-generated text detection using probability curvature. InInternational Conference on Machine Learning, pages 24950–24962. PMLR, 2023
work page 2023
-
[36]
Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use
Nature Editorial. Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use. Nature, 613(7945):612, 2023
work page 2023
-
[37]
Mark EJ Newman. The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001
work page 2001
-
[38]
Bolaji David Oladokun, Rexwhite Tega Enakrire, Ade- fila Kolawole Emmanuel, Yusuf Ayodeji Ajani, and Ade- bowale Jeremy Adetayo. Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship. Journal of Web Librarianship, 19(1):62–92, 2025
work page 2025
-
[39]
Prudent practices for designing malware experiments: Status quo and out- look
Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. Prudent practices for designing malware experiments: Status quo and out- look. In2012 IEEE Symposium on Security and Privacy, pages 65–79. IEEE, 2012
work page 2012
-
[40]
Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026
-
[41]
Confusing Value with Enumeration: Studying the Use of CVEs in Academia
Moritz Schloegel, Daniel Klischies, Simon Koch, David Klein, Lukas Gerlach, Malte Wessels, Leon Tram- pert, Martin Johns, Mathy Vanhoef, Michael Schwarz, Thorsten Holz, and Jo Van Bulck. Confusing Value with Enumeration: Studying the Use of CVEs in Academia. pages 2887–2906, 2025. URL: https://www.usenix .org/conference/usenixsecurity25/presentat ion/schloegel
work page 2025
-
[42]
Ai models collapse when trained on recursively generated data
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024. 15
work page 2024
-
[43]
Read before you cite!Complex Syst., 14, 01 2003
Mikhail Simkin and Vwani Roychowdhury. Read before you cite!Complex Syst., 14, 01 2003. doi:10.25088 /ComplexSystems.14.3.269
work page 2003
-
[44]
Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005
Mikhail V Simkin and Vwani P Roychowdhury. Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005
work page 2005
-
[45]
Richard Smith. Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006
work page 2006
-
[46]
Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989
James H Sweetland. Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989
work page 1989
-
[47]
Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December
Aaron Tay. Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December
work page 2025
-
[48]
URL: https://aarontay.substack.com/p/ why-ghost-references-still-haunt
-
[49]
Chatgpt is fun, but not an author
H Holden Thorp. Chatgpt is fun, but not an author. Science, 379(6630):313–313, 2023
work page 2023
-
[50]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[51]
Kiri Wagstaff. Machine learning that matters.arXiv preprint arXiv:1206.4656, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[52]
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yu- long Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, pages 1–46, 2025. A Appendix In the appendix, we provide additional details on our study that are not included in the main ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.