GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Fasheng Miao; Feng Zhang; Fubin Wu; Haozhe Lu; Jiaji Liu; Jialu Li; Luo Jin; Lu Sun; Rui Luo; Xiang Li

arxiv: 2602.06718 · v2 · pith:7FF2VEHDnew · submitted 2026-02-06 · 💻 cs.CR · cs.AI

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu , Yuqi Qiu , Lu Sun , Fasheng Miao , Fubin Wu , Xiang Li , Xinyi Wang , Haozhe Lu

show 9 more authors

Zhengze Zhang Yuxin Hu Jialu Li Luo Jin Feng Zhang Rui Luo Xinran Liu Yingxian Li Jiaji Liu

This is my paper

Pith reviewed 2026-05-16 06:57 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords ghost citationscitation validitylarge language modelsacademic integritycitation hallucinationspeer review

0 comments

The pith

Large language models fabricate citations at rates from 14 to 95 percent, and the fraction of papers containing such errors rose 81 percent in 2025.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GhostCite, an open-source framework for verifying citations across large collections of papers. Benchmarking thirteen language models on citation generation tasks shows that every model produces invalid citations at rates ranging from 14.23 to 94.93 percent. Analysis of 2.2 million citations drawn from 56,381 papers published at AI/ML and security venues between 2020 and 2025 finds that 1.07 percent of papers contain at least one invalid citation, with an 80.9 percent increase observed in 2025. A survey of 97 researchers indicates that 87.2 percent already use AI tools in their writing, yet 76.7 percent of reviewers do not thoroughly check references and 74.5 percent consider peer review ineffective at catching citation errors.

Core claim

Ghost citations, fabricated references produced by large language models, appear at high rates in model-generated text and have begun to appear in the published literature at measurable scale, with the share of affected papers increasing sharply in the most recent year examined.

What carries the argument

The GhostCite framework, which automates large-scale citation verification by comparing stated references against their actual content.

If this is right

Citation validity can no longer be treated as an automatic property of published work when language models are used for drafting.
Peer-review processes require supplementary automated checks because human reviewers rarely verify every reference.
Coordinated community standards and tools will be needed to maintain the reliability of the citation record.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of language-model writing assistants may accelerate the spread of citation errors unless verification steps are built into those tools.
Trust in individual scientific claims could decline if readers begin to discount references as potentially fabricated.
Fields outside AI and security may show different error rates, so targeted studies in other disciplines would test the generality of the observed trend.

Load-bearing premise

The automated checks in the GhostCite framework correctly identify invalid citations without substantial false positives or negatives.

What would settle it

Manual review of a random sample of papers flagged as containing invalid citations to determine the true error rate.

Figures

Figures reproduced from arXiv: 2602.06718 by Fasheng Miao, Feng Zhang, Fubin Wu, Haozhe Lu, Jiaji Liu, Jialu Li, Luo Jin, Lu Sun, Rui Luo, Xiang Li, Xinran Liu, Xinyi Wang, Yingxian Li, Yuqi Qiu, Yuxin Hu, Zhengze Zhang, Zuyao Xu.

**Figure 1.** Figure 1: The framework of CITEVERIFIER and experimental pipeline. and targeted analyses have documented hallucinated citations in conference submissions and accepted papers [6, 7], as well as in ACL-focused studies [40]. Yet despite this growing evidence, three critical research gaps remain unaddressed. RG1: How can ghost citations be detected at scale? Unlike AI-generated prose detection, citation verification req… view at source ↗

**Figure 2.** Figure 2: ECDF of title similarity scores: LLM-generated vs. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Temporal distribution of generated citations by pub [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Time trends of papers with invalid citations. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 7.** Figure 7: Key questions on AI adoption, reporting behavior, [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-model citation overlap heatmap. Color repre [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Survey questions overview (bar charts). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, but their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat, we develop \citeb, an open-source framework for large-scale citation verification, and conduct a comprehensive study of citation validity in the LLM era through three complementary experiments. First, we benchmark 13 LLMs on citation generation task in various research domains, finding that all models hallucinate citations at rate from 14.23\% to 94.93\%. Second, we analyze 2.2 million citations from 56,381 papers at AI/ML and Security venues (2020--2025), finding that 1.07\% of papers contain invalid citations, with an 80.9\% increase in 2025. Third, we survey 97 researchers, finding that 87.2\% use AI-powered tools in their workflows, 76.7\% of reviewers do not thoroughly check references, and 74.5\% view peer review as ineffective at catching citation errors. Based on these findings, we argue that ghost citations represent a systemic threat to academic integrity, and call for coordinated efforts from community to address this challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies new benchmarks on LLM citation hallucinations plus a temporal scan of real papers showing a modest rise in invalid citations, but the detection method lacks reported validation so the headline numbers need scrutiny.

read the letter

The core contribution here is the set of fresh measurements: hallucination rates across 13 LLMs ranging from 14% to 95% on citation generation, an analysis of 2.2 million citations drawn from 56k AI/ML and security papers, and a 97-person survey on AI tool use and reviewer habits. The open-source GhostCite framework is a practical addition if it holds up, and the 80.9% year-over-year increase in 2025 is the kind of concrete signal that could prompt follow-up work. Those pieces give readers something to cite or build on without relying on speculation. The survey numbers on AI adoption and lax reference checking are straightforward and worth having on record. The large-scale scan is the part that stands out most, since earlier citation studies did not tie trends directly to LLM rollout. The main gap is the absence of any reported precision, recall, or manual validation set for how GhostCite labels a citation invalid. Without that, it is hard to know whether the 1.07% base rate reflects real errors or includes false positives from obscure but legitimate sources. The abstract and available details do not show inter-annotator checks or error bars on the temporal jump, which leaves the systemic-threat claim resting on untested assumptions about the detector. This is not a fatal flaw for a first cut, but it is the part that needs tightening. The work is aimed at researchers who study research integrity, citation practices, or LLM deployment in writing workflows. A reader already following AI impacts on academia will find usable data points here even if the exact percentages shift after review. It is worth sending to referees so the methods can be stress-tested and the dataset made fully reproducible.

Referee Report

3 major / 2 minor

Summary. The paper introduces GhostCite, an open-source framework for large-scale citation verification, and reports three experiments: (1) benchmarking 13 LLMs on citation generation showing hallucination rates of 14.23%–94.93%, (2) analysis of 2.2 million citations across 56,381 AI/ML and security papers (2020–2025) finding 1.07% of papers contain invalid citations with an 80.9% increase in 2025, and (3) a survey of 97 researchers indicating high AI tool usage (87.2%) but low reviewer diligence on references (76.7%). The authors conclude that ghost citations pose a systemic threat to academic integrity.

Significance. If the GhostCite detector's accuracy is confirmed, the scale of the citation analysis (2.2M references) combined with LLM benchmarks and the researcher survey would provide the first quantitative evidence of rising invalid citations in the LLM era, supporting calls for improved verification practices. The open-source release of the framework is a clear strength that enables reproducibility and extension.

major comments (3)

[§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.
[§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.
[§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.

minor comments (2)

[Abstract and §5] The abstract and §5 should explicitly state the exact venues and inclusion criteria used for the 56,381 papers to allow readers to judge generalizability beyond AI/ML and security.
[Figure 3] Figure 3 (LLM hallucination rates) would benefit from error bars or bootstrap intervals given the per-model sample sizes are not reported.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the rigor and reproducibility of our work. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid.

read point-by-point responses

Referee: [§4] §4 (Large-scale Analysis): The 1.07% invalid-citation rate and 80.9% year-over-year increase rest on GhostCite's labeling of 2.2M references, yet the manuscript provides no precision/recall figures, no manually validated test set, and no inter-annotator agreement for the validity classifier. Without these, false-positive rates cannot be bounded and the headline quantitative claims remain unverified.

Authors: We agree that validation metrics are necessary to bound uncertainty in the large-scale results. In the revised manuscript we add a new subsection (4.1.1) reporting a manually annotated gold-standard set of 1,000 citations drawn from the corpus. Two independent annotators labeled each citation as valid or invalid, yielding Cohen's kappa of 0.82. Using this set we compute precision 0.91 and recall 0.87 for the invalid class. These figures are now used to provide error bounds on the reported 1.07% rate and the temporal trend. revision: yes
Referee: [§3] §3 (GhostCite Framework): The description of how validity is decided (DOI lookup, title/author matching, or LLM-based checking) lacks sufficient algorithmic detail and pseudocode; this prevents independent assessment of whether the detector systematically errs on obscure but real citations from non-indexed venues.

Authors: We accept that the original algorithmic description was insufficiently precise. The revised Section 3 now includes (i) explicit decision rules and thresholds for each stage (exact DOI match, fuzzy title/author similarity with Jaccard threshold 0.85, and fallback LLM prompt), (ii) pseudocode for the full validity pipeline, and (iii) a discussion of coverage limitations for non-indexed venues together with the fraction of citations routed to the LLM stage. These additions enable independent reproduction and bias assessment. revision: yes
Referee: [§4.2] §4.2 (Temporal Trend): The reported 80.9% increase in 2025 is presented without confidence intervals, statistical significance tests, or controls for changes in venue coverage or indexing quality between years, making it impossible to distinguish a genuine rise from detector or data artifacts.

Authors: We agree that statistical controls and uncertainty quantification are required. The revised Section 4.2 now reports 95% bootstrap confidence intervals around the year-over-year change, a chi-squared test for trend significance, and a sensitivity analysis restricted to the subset of venues that appear in every year of the corpus (thereby holding indexing coverage constant). The 80.9% increase remains statistically significant under these controls, though the magnitude is modestly attenuated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements rest on external data and benchmarks

full rationale

The paper's central claims derive from three direct empirical components: (1) controlled hallucination rates measured on 13 LLMs via citation-generation tasks, (2) automated counting of invalid citations across 2.2M references from 56,381 papers, and (3) responses from a 97-person survey. None of these steps invoke a derivation, fitted parameter renamed as prediction, self-definitional loop, or load-bearing self-citation. GhostCite is presented as an open-source verification tool whose outputs are treated as observable counts rather than as a quantity defined in terms of the target statistics. No equations, ansatzes, or uniqueness theorems appear; the 1.07% rate and 80.9% increase are reported as raw frequencies from the sampled corpus. The analysis is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis relies on the domain assumption that citation validity is objectively verifiable at scale; no free parameters or new entities are introduced.

axioms (1)

domain assumption Citation validity can be systematically detected using automated tools
This assumption enables the GhostCite framework and the large-scale analysis reported.

pith-pipeline@v0.9.0 · 5612 in / 1225 out tokens · 46683 ms · 2026-05-16T06:57:42.158691+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
cs.DL 2026-04 conditional novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
cs.DL 2026-04 unverdicted novelty 6.0

An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.
HalluCiteChecker: A Lightweight Toolkit for Hallucinated Citation Detection and Verification in the Era of AI Scientists
cs.CL 2026-04 unverdicted novelty 5.0

HalluCiteChecker is a lightweight, offline, CPU-only toolkit that detects hallucinated citations in AI-assisted scientific papers.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

Ghost references cause many genAI errors | Thinking about Digital Publishing. URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

work page
[2]

URL:https://openrouter.ai

OpenRouter. URL:https://openrouter.ai

work page
[3]

Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

ScrapingDog vs. Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

work page
[4]

URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

Ghost Citations, December 2018. URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

work page 2018
[5]

https://arxiv.org/archive/cs , 2025

arxiv e-print archive: Computer science subject classes. https://arxiv.org/archive/cs , 2025. Accessed: 2025-09-24

work page 2025
[6]

URL: https://gptzero.me/news/ iclr-2026/

GPTZero uncovers 50+ Hallucinations in ICLR 2026, December 2025. URL: https://gptzero.me/news/ iclr-2026/

work page 2026
[7]

URL: https://gptz ero.me/news/neurips/

GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, January 2026. URL: https://gptz ero.me/news/neurips/

work page 2025
[8]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

Michael Bailey, David Dittrich, Erin Kenneally, and Doug Maughan. The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

work page 2012
[10]

1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

Monya Baker. 1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

work page 2016
[11]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

work page 2021
[12]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137– 1155, 2003

work page 2003
[13]

Csrankings: Computer science rank- ings

Emery Berger et al. Csrankings: Computer science rank- ings. https://csrankings.org , 2025. Venue lists developed in consultation with faculty and community surveys; only top conferences per area are included

work page 2025
[14]

ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025

ICLR 2026 Program Chairs. ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025. URL: https://blog.iclr.cc/202 5/11/19/iclr-2026-response-to-llm-generat ed-papers-and-reviews/

work page 2026
[15]

Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

Debby RE Cotton, Peter A Cotton, and J Reuben Ship- way. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

work page 2024
[16]

Yogesh K Dwivedi, Nir Kshetri, Laurie Hughes, Emma Louise Slade, Anand Jeyaraj, Arpan Kumar Kar, Abdullah M Baabdullah, Alex Koohang, Vishal Ragha- van, Manju Ahuja, et al. So what if chatgpt wrote it? multidisciplinary perspectives on opportunities, chal- lenges and implications of generative conversational ai for research, practice and policy.Internatio...

work page 2023
[17]

Toward the discovery of citation cartels in citation networks

Iztok Fister Jr, Iztok Fister, and Matjaž Perc. Toward the discovery of citation cartels in citation networks. Frontiers in Physics, 4:49, 2016

work page 2016
[18]

A Greenberg

S. A Greenberg. How citation distortions create un- founded authority: analysis of a citation network.BMJ, 339(jul20 3):b2680–b2680, July 2009. URL: https: //www.bmj.com/lookup/doi/10.1136/bmj.b2680 , doi:10.1136/bmj.b2680

work page doi:10.1136/bmj.b2680 2009
[19]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Wei- hua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, tax- onomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text

IEEE. Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text. https: //journals.ieeeauthorcenter.ieee.org/becom e-an-ieee-journal-author/publishing-ethic s/guidelines-and-policies/submission-and-p eer-review-policies/#ai-generated-content , 2025

work page 2025
[21]

Standard No

International Organization for Standardization, Geneva, Switzerland.Information and documentation — Guide- lines for bibliographic references and citations to infor- mation resources, 4th edition, 2021. Standard No. ISO 690:2021. URL: https://www.iso.org/standard /72642.html. 14

work page 2021
[22]

Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

John PA Ioannidis. Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

work page 2005
[23]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

work page 2023
[24]

The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

Norman Kaplan. The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

work page 1965
[25]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

work page 2023
[26]

The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers

Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wil- son. The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI conference on human fac- tors in computing systems, pages 1–22, 2025

work page 2025
[27]

Binary codes capable of cor- recting deletions, insertions, and reversals

Vladimir I Levenshtein. Binary codes capable of cor- recting deletions, insertions, and reversals. InSoviet Physics Doklady, volume 10, 1966

work page 1966
[28]

arXiv preprint arXiv:2305.11747 (2023)

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale halluci- nation evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023

work page arXiv 2023
[29]

Truth- fulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truth- fulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

work page 2022
[30]

Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

work page 2018
[31]

Grobid client python

Patrice Lopez and Luca Foppiano. Grobid client python. https://github.com/kermitt2/grobid-clien t-python, 2017–2025

work page 2017
[32]

Brady D Lund, Ting Wang, Nishith Reddy Mannuru, Bing Nie, Somipam Shimray, and Ziang Wang. Chat- gpt and a new academic reality: Artificial intelligence- written research papers and the ethics of the large lan- guage models in scholarly publishing.Journal of the Association for Information Science and Technology, 74(5):570–581, 2023

work page 2023
[33]

21 NVIDIA Corporation

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025

work page arXiv 2025
[34]

University of Chicago press, 1973

Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973

work page 1973
[35]

Detect- gpt: Zero-shot machine-generated text detection using probability curvature

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detect- gpt: Zero-shot machine-generated text detection using probability curvature. InInternational Conference on Machine Learning, pages 24950–24962. PMLR, 2023

work page 2023
[36]

Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use

Nature Editorial. Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use. Nature, 613(7945):612, 2023

work page 2023
[37]

The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

Mark EJ Newman. The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

work page 2001
[38]

Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship

Bolaji David Oladokun, Rexwhite Tega Enakrire, Ade- fila Kolawole Emmanuel, Yusuf Ayodeji Ajani, and Ade- bowale Jeremy Adetayo. Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship. Journal of Web Librarianship, 19(1):62–92, 2025

work page 2025
[39]

Prudent practices for designing malware experiments: Status quo and out- look

Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. Prudent practices for designing malware experiments: Status quo and out- look. In2012 IEEE Symposium on Security and Privacy, pages 65–79. IEEE, 2012

work page 2012
[40]

Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

work page arXiv 2026
[41]

Confusing Value with Enumeration: Studying the Use of CVEs in Academia

Moritz Schloegel, Daniel Klischies, Simon Koch, David Klein, Lukas Gerlach, Malte Wessels, Leon Tram- pert, Martin Johns, Mathy Vanhoef, Michael Schwarz, Thorsten Holz, and Jo Van Bulck. Confusing Value with Enumeration: Studying the Use of CVEs in Academia. pages 2887–2906, 2025. URL: https://www.usenix .org/conference/usenixsecurity25/presentat ion/schloegel

work page 2025
[42]

Ai models collapse when trained on recursively generated data

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024. 15

work page 2024
[43]

Read before you cite!Complex Syst., 14, 01 2003

Mikhail Simkin and Vwani Roychowdhury. Read before you cite!Complex Syst., 14, 01 2003. doi:10.25088 /ComplexSystems.14.3.269

work page 2003
[44]

Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

Mikhail V Simkin and Vwani P Roychowdhury. Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

work page 2005
[45]

Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

Richard Smith. Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

work page 2006
[46]

Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

James H Sweetland. Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

work page 1989
[47]

Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

Aaron Tay. Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

work page 2025
[48]

URL: https://aarontay.substack.com/p/ why-ghost-references-still-haunt

work page
[49]

Chatgpt is fun, but not an author

H Holden Thorp. Chatgpt is fun, but not an author. Science, 379(6630):313–313, 2023

work page 2023
[50]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[51]

Machine Learning that Matters

Kiri Wagstaff. Machine learning that matters.arXiv preprint arXiv:1206.4656, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[52]

author": [

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yu- long Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, pages 1–46, 2025. A Appendix In the appendix, we provide additional details on our study that are not included in the main ...

work page 2025

[1] [1]

URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

Ghost references cause many genAI errors | Thinking about Digital Publishing. URL: https://www.consul tmu.co.uk/ghost-references-cause-many-gen ai-errors/

work page

[2] [2]

URL:https://openrouter.ai

OpenRouter. URL:https://openrouter.ai

work page

[3] [3]

Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

ScrapingDog vs. Scrape.do: Which API Is Better? URL: https://scrape.do/compare/scrapingdog-vs-s crapedo/

work page

[4] [4]

URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

Ghost Citations, December 2018. URL: https://fo rums.zotero.org/discussion/75155/ghost-cit ations

work page 2018

[5] [5]

https://arxiv.org/archive/cs , 2025

arxiv e-print archive: Computer science subject classes. https://arxiv.org/archive/cs , 2025. Accessed: 2025-09-24

work page 2025

[6] [6]

URL: https://gptzero.me/news/ iclr-2026/

GPTZero uncovers 50+ Hallucinations in ICLR 2026, December 2025. URL: https://gptzero.me/news/ iclr-2026/

work page 2026

[7] [7]

URL: https://gptz ero.me/news/neurips/

GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers, January 2026. URL: https://gptz ero.me/news/neurips/

work page 2025

[8] [8]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

Michael Bailey, David Dittrich, Erin Kenneally, and Doug Maughan. The menlo report.IEEE Security & Privacy, 10(2):71–75, 2012

work page 2012

[10] [10]

1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

Monya Baker. 1,500 scientists lift the lid on repro- ducibility.Nature, 533(7604):452–454, 2016

work page 2016

[11] [11]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

work page 2021

[12] [12]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137– 1155, 2003

work page 2003

[13] [13]

Csrankings: Computer science rank- ings

Emery Berger et al. Csrankings: Computer science rank- ings. https://csrankings.org , 2025. Venue lists developed in consultation with faculty and community surveys; only top conferences per area are included

work page 2025

[14] [14]

ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025

ICLR 2026 Program Chairs. ICLR 2026 Response to LLM-Generated Papers and Reviews – ICLR Blog, November 2025. URL: https://blog.iclr.cc/202 5/11/19/iclr-2026-response-to-llm-generat ed-papers-and-reviews/

work page 2026

[15] [15]

Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

Debby RE Cotton, Peter A Cotton, and J Reuben Ship- way. Chatting and cheating: Ensuring academic integrity in the era of chatgpt.Innovations in Education and Teaching International, 61(2):228–239, 2024

work page 2024

[16] [16]

Yogesh K Dwivedi, Nir Kshetri, Laurie Hughes, Emma Louise Slade, Anand Jeyaraj, Arpan Kumar Kar, Abdullah M Baabdullah, Alex Koohang, Vishal Ragha- van, Manju Ahuja, et al. So what if chatgpt wrote it? multidisciplinary perspectives on opportunities, chal- lenges and implications of generative conversational ai for research, practice and policy.Internatio...

work page 2023

[17] [17]

Toward the discovery of citation cartels in citation networks

Iztok Fister Jr, Iztok Fister, and Matjaž Perc. Toward the discovery of citation cartels in citation networks. Frontiers in Physics, 4:49, 2016

work page 2016

[18] [18]

A Greenberg

S. A Greenberg. How citation distortions create un- founded authority: analysis of a citation network.BMJ, 339(jul20 3):b2680–b2680, July 2009. URL: https: //www.bmj.com/lookup/doi/10.1136/bmj.b2680 , doi:10.1136/bmj.b2680

work page doi:10.1136/bmj.b2680 2009

[19] [19]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Wei- hua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, tax- onomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text

IEEE. Submission and peer review policies: Guidelines for artificial intelligence (ai)-generated text. https: //journals.ieeeauthorcenter.ieee.org/becom e-an-ieee-journal-author/publishing-ethic s/guidelines-and-policies/submission-and-p eer-review-policies/#ai-generated-content , 2025

work page 2025

[21] [21]

Standard No

International Organization for Standardization, Geneva, Switzerland.Information and documentation — Guide- lines for bibliographic references and citations to infor- mation resources, 4th edition, 2021. Standard No. ISO 690:2021. URL: https://www.iso.org/standard /72642.html. 14

work page 2021

[22] [22]

Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

John PA Ioannidis. Why most published research find- ings are false.PLoS medicine, 2(8):e124, 2005

work page 2005

[23] [23]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1– 38, 2023

work page 2023

[24] [24]

The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

Norman Kaplan. The norms of citation behavior: Pro- legomena to the footnote.American documentation, 16(3):179–184, 1965

work page 1965

[25] [25]

A watermark for large language models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning, pages 17061–17084. PMLR, 2023

work page 2023

[26] [26]

The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers

Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wil- son. The impact of generative ai on critical thinking: Self-reported reductions in cognitive effort and confi- dence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI conference on human fac- tors in computing systems, pages 1–22, 2025

work page 2025

[27] [27]

Binary codes capable of cor- recting deletions, insertions, and reversals

Vladimir I Levenshtein. Binary codes capable of cor- recting deletions, insertions, and reversals. InSoviet Physics Doklady, volume 10, 1966

work page 1966

[28] [28]

arXiv preprint arXiv:2305.11747 (2023)

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale halluci- nation evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747, 2023

work page arXiv 2023

[29] [29]

Truth- fulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truth- fulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022

work page 2022

[30] [30]

Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17(1):45– 77, 2018

work page 2018

[31] [31]

Grobid client python

Patrice Lopez and Luca Foppiano. Grobid client python. https://github.com/kermitt2/grobid-clien t-python, 2017–2025

work page 2017

[32] [32]

Brady D Lund, Ting Wang, Nishith Reddy Mannuru, Bing Nie, Somipam Shimray, and Ziang Wang. Chat- gpt and a new academic reality: Artificial intelligence- written research papers and the ethics of the large lan- guage models in scholarly publishing.Journal of the Association for Information Science and Technology, 74(5):570–581, 2023

work page 2023

[33] [33]

21 NVIDIA Corporation

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Cap- stick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, et al. Artificial intelligence index report 2025.arXiv preprint arXiv:2504.07139, 2025

work page arXiv 2025

[34] [34]

University of Chicago press, 1973

Robert K Merton.The sociology of science: Theoretical and empirical investigations. University of Chicago press, 1973

work page 1973

[35] [35]

Detect- gpt: Zero-shot machine-generated text detection using probability curvature

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detect- gpt: Zero-shot machine-generated text detection using probability curvature. InInternational Conference on Machine Learning, pages 24950–24962. PMLR, 2023

work page 2023

[36] [36]

Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use

Nature Editorial. Tools such as chatgpt threaten trans- parent science; here are our ground rules for their use. Nature, 613(7945):612, 2023

work page 2023

[37] [37]

The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

Mark EJ Newman. The structure of scientific collabora- tion networks.Proceedings of the national academy of sciences, 98(2):404–409, 2001

work page 2001

[38] [38]

Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship

Bolaji David Oladokun, Rexwhite Tega Enakrire, Ade- fila Kolawole Emmanuel, Yusuf Ayodeji Ajani, and Ade- bowale Jeremy Adetayo. Hallucitation in scientific writ- ing: Exploring evidence from chatgpt versions 3.5 and 4o in responses to selected questions in librarianship. Journal of Web Librarianship, 19(1):62–92, 2025

work page 2025

[39] [39]

Prudent practices for designing malware experiments: Status quo and out- look

Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. Prudent practices for designing malware experiments: Status quo and out- look. In2012 IEEE Symposium on Security and Privacy, pages 65–79. IEEE, 2012

work page 2012

[40] [40]

Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

work page arXiv 2026

[41] [41]

Confusing Value with Enumeration: Studying the Use of CVEs in Academia

Moritz Schloegel, Daniel Klischies, Simon Koch, David Klein, Lukas Gerlach, Malte Wessels, Leon Tram- pert, Martin Johns, Mathy Vanhoef, Michael Schwarz, Thorsten Holz, and Jo Van Bulck. Confusing Value with Enumeration: Studying the Use of CVEs in Academia. pages 2887–2906, 2025. URL: https://www.usenix .org/conference/usenixsecurity25/presentat ion/schloegel

work page 2025

[42] [42]

Ai models collapse when trained on recursively generated data

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024. 15

work page 2024

[43] [43]

Read before you cite!Complex Syst., 14, 01 2003

Mikhail Simkin and Vwani Roychowdhury. Read before you cite!Complex Syst., 14, 01 2003. doi:10.25088 /ComplexSystems.14.3.269

work page 2003

[44] [44]

Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

Mikhail V Simkin and Vwani P Roychowdhury. Stochastic modeling of citation slips.Scientometrics, 62(3):367–384, 2005

work page 2005

[45] [45]

Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

Richard Smith. Peer review: a flawed process at the heart of science and journals.Journal of the royal society of medicine, 99(4):178–182, 2006

work page 2006

[46] [46]

Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

James H Sweetland. Errors in bibliographic citations: A continuing problem.The library quarterly, 59(4):291– 304, 1989

work page 1989

[47] [47]

Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

Aaron Tay. Why Ghost References Still Haunt Us in 2025—And Why It’s Not Just About LLMs, December

work page 2025

[48] [48]

URL: https://aarontay.substack.com/p/ why-ghost-references-still-haunt

work page

[49] [49]

Chatgpt is fun, but not an author

H Holden Thorp. Chatgpt is fun, but not an author. Science, 379(6630):313–313, 2023

work page 2023

[50] [50]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[51] [51]

Machine Learning that Matters

Kiri Wagstaff. Machine learning that matters.arXiv preprint arXiv:1206.4656, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[52] [52]

author": [

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yu- long Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models.Computa- tional Linguistics, pages 1–46, 2025. A Appendix In the appendix, we provide additional details on our study that are not included in the main ...

work page 2025