CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3
The pith
CiteAudit introduces a multi-agent pipeline that detects hallucinated citations more accurately than direct queries to state-of-the-art LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The multi-agent verification pipeline in CiteAudit achieves superior verification performance over state-of-the-art LLMs and commercial baselines on a human-validated dataset that covers diverse domains and hallucination types, enabling scalable auditing of scientific references.
What carries the argument
A multi-agent pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment.
If this is right
- Automated citation auditing becomes feasible at the scale of entire research corpora.
- Scholarly publishing can incorporate routine checks that flag fabricated references before submission.
- The benchmark dataset supports standardized evaluation of future citation-verification tools.
- Trust in LLM-assisted literature reviews increases when outputs pass the multi-agent filter.
Where Pith is reading between the lines
- Writing assistants could embed similar stepwise verification to prevent citation errors at the drafting stage.
- The decomposition approach may generalize to checking other generated factual claims beyond references.
- Publishers might adopt automated audits as a standard gate before peer review.
Load-bearing premise
The human-validated dataset spanning diverse domains and hallucination types is representative of real-world citation errors and the multi-agent pipeline generalizes beyond the test set.
What would settle it
A fresh collection of hallucinated citations generated by a different LLM or drawn from an unseen domain where the pipeline's accuracy falls below the best baseline would falsify the superiority claim.
Figures
read the original abstract
Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi-agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment. To evaluate this, we construct a large-scale, human-validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state-of-the-art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CiteAudit, a benchmark and multi-agent verification framework for detecting hallucinated scientific citations generated by LLMs. The framework decomposes verification into metadata extraction, memory lookup, web-based retrieval, and final judgment stages. It is evaluated on a large-scale human-validated dataset spanning diverse domains and hallucination types, with experiments claiming superior performance over state-of-the-art LLMs and commercial baselines. Code is released at a public GitHub repository.
Significance. If the superiority result holds under a representative test distribution and with full methodological transparency, the work supplies practical infrastructure for scalable citation auditing, directly addressing a pressing integrity issue in LLM-assisted scholarly writing. The multi-agent decomposition and public code release are concrete strengths that could enable follow-on research.
major comments (2)
- [Abstract / Dataset section] Abstract and dataset construction section: the claim of a 'large-scale, human-validated dataset spanning diverse domains and hallucination types' is load-bearing for the superiority result, yet no information is provided on the generation protocol, sampling strategy, hallucination-type distribution, or inter-annotator agreement; without these, it is impossible to rule out that the test distribution was constructed in a manner that favors the proposed multi-agent pipeline over the baselines.
- [Experiments] Experiments section: the reported superior verification performance lacks error bars, ablation results on individual pipeline components (e.g., retrieval vs. judgment), or statistical significance tests against the SOTA LLM and commercial baselines; this prevents assessment of whether the gains are robust or merely artifacts of the evaluation setup.
minor comments (2)
- [Dataset description] The manuscript should clarify the exact definition of each hallucination type and how the human validation protocol was conducted to improve reproducibility.
- [Results figures/tables] Figure and table captions could be expanded to include the precise metrics (e.g., precision, recall, F1) and baseline model versions used.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for improving transparency and rigor. We address each major comment below and will incorporate the suggested additions in the revised manuscript to strengthen the presentation of the dataset and experimental results.
read point-by-point responses
-
Referee: Abstract and dataset construction section: the claim of a 'large-scale, human-validated dataset spanning diverse domains and hallucination types' is load-bearing for the superiority result, yet no information is provided on the generation protocol, sampling strategy, hallucination-type distribution, or inter-annotator agreement; without these, it is impossible to rule out that the test distribution was constructed in a manner that favors the proposed multi-agent pipeline over the baselines.
Authors: We agree that the original manuscript provided insufficient detail on dataset construction. In the revision, we will expand the Dataset section with: (1) the full generation protocol for hallucinated citations (including how plausible fakes were created across domains), (2) the sampling strategy used to ensure domain diversity and balanced hallucination types, (3) the exact distribution of hallucination categories in the final dataset, and (4) inter-annotator agreement statistics (e.g., Fleiss' kappa or percentage agreement) from the human validation process. These additions will allow independent assessment of whether the test distribution is representative and unbiased with respect to our pipeline. revision: yes
-
Referee: Experiments section: the reported superior verification performance lacks error bars, ablation results on individual pipeline components (e.g., retrieval vs. judgment), or statistical significance tests against the SOTA LLM and commercial baselines; this prevents assessment of whether the gains are robust or merely artifacts of the evaluation setup.
Authors: We concur that the experimental reporting requires greater statistical transparency. The revised Experiments section will include: (1) error bars (standard deviation across 5 random seeds or cross-validation folds) for all reported metrics, (2) ablation studies isolating the contribution of each pipeline stage (metadata extraction, memory lookup, web retrieval, and judgment), and (3) statistical significance tests (paired t-tests with Bonferroni correction or McNemar's test) comparing CiteAudit against each baseline. These results will be presented in new tables and figures to demonstrate that the performance gains are robust rather than setup-dependent. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical benchmark and multi-agent pipeline for citation verification, with performance claims resting on experiments against external baselines using a human-validated dataset. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential reductions appear in the provided text. The evaluation relies on external web retrieval and human labels rather than any self-citation chain or ansatz that collapses the central result to its own inputs. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 4 Pith papers
-
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
-
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.
-
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
-
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing
An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.
Reference graph
Works this paper leans on
-
[1]
GPTZero: AI Text Detection Service
2023. GPTZero: AI Text Detection Service. https://gptzero.me/. Accessed: 2026-02
work page 2023
-
[2]
CiteCheck: AI-powered Citation Verification
2024. CiteCheck: AI-powered Citation Verification. https://citecheck.ai/. Ac- cessed: 2026-02
work page 2024
-
[3]
2024. Citely: AI Citation Assistant. https://citely.ai/. Accessed: 2026-02
work page 2024
-
[4]
2024. RefCheck AI. https://github.com/HuaHenry/RefCheck_ai. Accessed: 2026-02
work page 2024
-
[5]
SwanRef: Reference Verification Platform
2024. SwanRef: Reference Verification Platform. https://www.swanref.org/. Accessed: 2026-02
work page 2024
-
[6]
GPTZero finds over 50 hallucinations in ICLR 2026 submissions
2025. GPTZero finds over 50 hallucinations in ICLR 2026 submissions. https: //gptzero.me/news/iclr-2026
work page 2025
-
[7]
AI conference’s papers contaminated by AI hallucinations
2026. AI conference’s papers contaminated by AI hallucinations. https://www. theregister.com/2026/01/22/neurips_papers_contaiminated_ai_hallucinations/
work page 2026
-
[8]
Dang Anh-Hoang, Vu Tran, and Le-Minh Nguyen. 2025. Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior.Frontiers in Artificial Intelligence8 (2025), 1622292
work page 2025
-
[9]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Mikaël Chelli, Jules Descamps, Vincent Lavoué, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, Caro- line Ruetsch-Chelli, et al. 2024. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis.Journal of medical Internet research26, 1 (2024), e53164
work page 2024
-
[11]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav
-
[12]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
C. Dai. 2021. Literary runaway: Increasingly more references cited per research paper.PLoS ONE16 (2021), e0255849
work page 2021
-
[14]
Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. 2025. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 32779–32798
work page 2025
-
[16]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55
work page 2025
-
[17]
LJ Janse van Rensburg. 2025. AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research.arXiv e-prints(2025), arXiv–2511
work page 2025
- [18]
-
[19]
Microsoft. [n. d.]. Playwright for Python. https://github.com/microsoft/ playwright-python. Accessed: 2026-02
work page 2026
-
[20]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al
-
[21]
Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Subhey Sadi Rahman, Md Adnanul Islam, Md Mahbub Alam, Musarrat Zeba, Md Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, and Sami Azam. 2026. Hallucination to truth: a review of fact-checking and factuality evaluation in large language models.Artificial Intelligence Review(2026)
work page 2026
-
[23]
Y. Sakai, H. Kamigaito, and T. Watanabe. 2026. HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences. https://arxiv.org/abs/2601.18724
-
[24]
Maria Janina Sarol, Shufan Ming, Shruthan Radhakrishna, Jodi Schneider, and Halil Kilicoglu. 2024. Assessing citation integrity in biomedical publications: corpus annotation and NLP models.Bioinformatics40, 7 (2024), btae420
work page 2024
- [25]
- [26]
-
[27]
SMTI Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.013136 (2024)
work page internal anchor Pith review arXiv 2024
-
[28]
Ludo Waltman. 2016. A review of the literature on citation impact indicators. Journal of informetrics10, 2 (2016), 365–391
work page 2016
-
[29]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
work page 2022
- [30]
- [31]
-
[32]
Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. 2025. MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference.arXiv e-prints(2025), arXiv–2510
work page 2025
-
[33]
Zheyuan Zhang, Yiyang Li, Nhi Ha Lan Le, Zehong Wang, Tianyi Ma, Vincent Galassi, Keerthiram Murugesan, Nuno Moniz, Werner Geyer, Nitesh V Chawla, et al. 2025. NGQA: a nutritional graph question answering benchmark for personalized health-aware nutritional reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics...
work page 2025
-
[34]
Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthi- ram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. 2025. Agen- tRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering.arXiv preprint arXiv:2510.05445(2025)
-
[35]
Zheyuan Zhang, Zehong Wang, Tianyi Ma, Varun Sameer Taneja, Sofia Nelson, Nhi Ha Lan Le, Keerthiram Murugesan, Mingxuan Ju, Nitesh V Chawla, Chuxu Zhang, et al. 2025. Mopi-hfrs: A multi-objective personalized health-aware food recommendation system with llm-enhanced interpretation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.