pith. sign in

arxiv: 2602.23452 · v3 · submitted 2026-02-26 · 💻 cs.CL · cs.DL

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3

classification 💻 cs.CL cs.DL
keywords citation verificationhallucinated referencesLLMmulti-agent pipelinebenchmark datasetscientific integrityreference checkingfabricated citations
0
0 comments X

The pith

CiteAudit introduces a multi-agent pipeline that detects hallucinated citations more accurately than direct queries to state-of-the-art LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Scientific papers depend on citation integrity, yet LLMs frequently generate plausible but nonexistent references. The paper presents CiteAudit as both a benchmark and a detection framework to address this risk at scale. It decomposes verification into metadata extraction, memory lookup, web retrieval, and judgment using multiple agents. A large human-validated dataset spanning domains and hallucination types serves as the testbed. Experiments show the pipeline outperforms existing LLMs and commercial baselines, supplying infrastructure for ongoing citation audits.

Core claim

The multi-agent verification pipeline in CiteAudit achieves superior verification performance over state-of-the-art LLMs and commercial baselines on a human-validated dataset that covers diverse domains and hallucination types, enabling scalable auditing of scientific references.

What carries the argument

A multi-agent pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment.

If this is right

  • Automated citation auditing becomes feasible at the scale of entire research corpora.
  • Scholarly publishing can incorporate routine checks that flag fabricated references before submission.
  • The benchmark dataset supports standardized evaluation of future citation-verification tools.
  • Trust in LLM-assisted literature reviews increases when outputs pass the multi-agent filter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Writing assistants could embed similar stepwise verification to prevent citation errors at the drafting stage.
  • The decomposition approach may generalize to checking other generated factual claims beyond references.
  • Publishers might adopt automated audits as a standard gate before peer review.

Load-bearing premise

The human-validated dataset spanning diverse domains and hallucination types is representative of real-world citation errors and the multi-agent pipeline generalizes beyond the test set.

What would settle it

A fresh collection of hallucinated citations generated by a different LLM or drawn from an unseen domain where the pipeline's accuracy falls below the best baseline would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2602.23452 by Kaiwen Shi, Lichao Sun, Nitesh V. Chawla, Weixiang Sun, Yanfang Ye, Zheyuan Zhang.

Figure 1
Figure 1. Figure 1: Motivation overview of our citation hallucination [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of citation hallucination types. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CiteAudit, a SOP-driven multi-agent citation verification framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case studies of citation verification plausible in title, authors, venue, or year while remaining nonex￾istent or inconsistent with real scholarly records, making them difficult to detect through surface inspection alone. In this work, we introduce an open, standardized, and scalable benchmark for hallucinated citation detection, covering both con￾trolled perturbations and naturally occurring real-world ci… view at source ↗
read the original abstract

Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi-agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment. To evaluate this, we construct a large-scale, human-validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state-of-the-art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CiteAudit, a benchmark and multi-agent verification framework for detecting hallucinated scientific citations generated by LLMs. The framework decomposes verification into metadata extraction, memory lookup, web-based retrieval, and final judgment stages. It is evaluated on a large-scale human-validated dataset spanning diverse domains and hallucination types, with experiments claiming superior performance over state-of-the-art LLMs and commercial baselines. Code is released at a public GitHub repository.

Significance. If the superiority result holds under a representative test distribution and with full methodological transparency, the work supplies practical infrastructure for scalable citation auditing, directly addressing a pressing integrity issue in LLM-assisted scholarly writing. The multi-agent decomposition and public code release are concrete strengths that could enable follow-on research.

major comments (2)
  1. [Abstract / Dataset section] Abstract and dataset construction section: the claim of a 'large-scale, human-validated dataset spanning diverse domains and hallucination types' is load-bearing for the superiority result, yet no information is provided on the generation protocol, sampling strategy, hallucination-type distribution, or inter-annotator agreement; without these, it is impossible to rule out that the test distribution was constructed in a manner that favors the proposed multi-agent pipeline over the baselines.
  2. [Experiments] Experiments section: the reported superior verification performance lacks error bars, ablation results on individual pipeline components (e.g., retrieval vs. judgment), or statistical significance tests against the SOTA LLM and commercial baselines; this prevents assessment of whether the gains are robust or merely artifacts of the evaluation setup.
minor comments (2)
  1. [Dataset description] The manuscript should clarify the exact definition of each hallucination type and how the human validation protocol was conducted to improve reproducibility.
  2. [Results figures/tables] Figure and table captions could be expanded to include the precise metrics (e.g., precision, recall, F1) and baseline model versions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving transparency and rigor. We address each major comment below and will incorporate the suggested additions in the revised manuscript to strengthen the presentation of the dataset and experimental results.

read point-by-point responses
  1. Referee: Abstract and dataset construction section: the claim of a 'large-scale, human-validated dataset spanning diverse domains and hallucination types' is load-bearing for the superiority result, yet no information is provided on the generation protocol, sampling strategy, hallucination-type distribution, or inter-annotator agreement; without these, it is impossible to rule out that the test distribution was constructed in a manner that favors the proposed multi-agent pipeline over the baselines.

    Authors: We agree that the original manuscript provided insufficient detail on dataset construction. In the revision, we will expand the Dataset section with: (1) the full generation protocol for hallucinated citations (including how plausible fakes were created across domains), (2) the sampling strategy used to ensure domain diversity and balanced hallucination types, (3) the exact distribution of hallucination categories in the final dataset, and (4) inter-annotator agreement statistics (e.g., Fleiss' kappa or percentage agreement) from the human validation process. These additions will allow independent assessment of whether the test distribution is representative and unbiased with respect to our pipeline. revision: yes

  2. Referee: Experiments section: the reported superior verification performance lacks error bars, ablation results on individual pipeline components (e.g., retrieval vs. judgment), or statistical significance tests against the SOTA LLM and commercial baselines; this prevents assessment of whether the gains are robust or merely artifacts of the evaluation setup.

    Authors: We concur that the experimental reporting requires greater statistical transparency. The revised Experiments section will include: (1) error bars (standard deviation across 5 random seeds or cross-validation folds) for all reported metrics, (2) ablation studies isolating the contribution of each pipeline stage (metadata extraction, memory lookup, web retrieval, and judgment), and (3) statistical significance tests (paired t-tests with Bonferroni correction or McNemar's test) comparing CiteAudit against each baseline. These results will be presented in new tables and figures to demonstrate that the performance gains are robust rather than setup-dependent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical benchmark and multi-agent pipeline for citation verification, with performance claims resting on experiments against external baselines using a human-validated dataset. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential reductions appear in the provided text. The evaluation relies on external web retrieval and human labels rather than any self-citation chain or ansatz that collapses the central result to its own inputs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work rests on standard LLM prompting and web-search primitives.

pith-pipeline@v0.9.0 · 5467 in / 935 out tokens · 39841 ms · 2026-05-15T18:48:07.452185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

    cs.CL 2026-05 accept novelty 7.0

    CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.

  2. Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.

  3. BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

    cs.DL 2026-04 conditional novelty 7.0

    Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

  4. sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

    cs.DL 2026-04 unverdicted novelty 6.0

    An open-source local linter verifies reference integrity and claim support in scientific manuscripts using public databases and consumer hardware, with an experimental contribution scoring extension.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 4 Pith papers · 5 internal anchors

  1. [1]

    GPTZero: AI Text Detection Service

    2023. GPTZero: AI Text Detection Service. https://gptzero.me/. Accessed: 2026-02

  2. [2]

    CiteCheck: AI-powered Citation Verification

    2024. CiteCheck: AI-powered Citation Verification. https://citecheck.ai/. Ac- cessed: 2026-02

  3. [3]

    Citely: AI Citation Assistant

    2024. Citely: AI Citation Assistant. https://citely.ai/. Accessed: 2026-02

  4. [4]

    RefCheck AI

    2024. RefCheck AI. https://github.com/HuaHenry/RefCheck_ai. Accessed: 2026-02

  5. [5]

    SwanRef: Reference Verification Platform

    2024. SwanRef: Reference Verification Platform. https://www.swanref.org/. Accessed: 2026-02

  6. [6]

    GPTZero finds over 50 hallucinations in ICLR 2026 submissions

    2025. GPTZero finds over 50 hallucinations in ICLR 2026 submissions. https: //gptzero.me/news/iclr-2026

  7. [7]

    AI conference’s papers contaminated by AI hallucinations

    2026. AI conference’s papers contaminated by AI hallucinations. https://www. theregister.com/2026/01/22/neurips_papers_contaiminated_ai_hallucinations/

  8. [8]

    Dang Anh-Hoang, Vu Tran, and Le-Minh Nguyen. 2025. Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior.Frontiers in Artificial Intelligence8 (2025), 1622292

  9. [9]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  10. [10]

    Mikaël Chelli, Jules Descamps, Vincent Lavoué, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, Caro- line Ruetsch-Chelli, et al. 2024. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: comparative analysis.Journal of medical Internet research26, 1 (2024), e53164

  11. [11]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  12. [12]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)

  13. [13]

    C. Dai. 2021. Literary runaway: Increasingly more references cited per research paper.PLoS ONE16 (2021), e0255849

  14. [14]

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. 2025. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678(2025)

  15. [15]

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. 2025. Hiagent: Hierarchical working memory management for solving long-horizon agent tasks with large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 32779–32798

  16. [16]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, chal- lenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55

  17. [17]

    LJ Janse van Rensburg. 2025. AI-Powered Citation Auditing: A Zero-Assumption Protocol for Systematic Reference Verification in Academic Research.arXiv e-prints(2025), arXiv–2511

  18. [18]

    Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. 2025. AutoData: A Multi-Agent System for Open Web Data Collection.arXiv preprint arXiv:2505.15859(2025)

  19. [19]

    Microsoft. [n. d.]. Playwright for Python. https://github.com/microsoft/ playwright-python. Accessed: 2026-02

  20. [20]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al

  21. [21]

    Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

  22. [22]

    Subhey Sadi Rahman, Md Adnanul Islam, Md Mahbub Alam, Musarrat Zeba, Md Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, and Sami Azam. 2026. Hallucination to truth: a review of fact-checking and factuality evaluation in large language models.Artificial Intelligence Review(2026)

  23. [23]

    Hallucitation matters: Revealing the impact of hallu- cinated references with 300 hallucinated papers in acl conferences.arXiv preprint arXiv:2601.18724, 2026

    Y. Sakai, H. Kamigaito, and T. Watanabe. 2026. HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences. https://arxiv.org/abs/2601.18724

  24. [24]

    Maria Janina Sarol, Shufan Ming, Shruthan Radhakrishna, Jodi Schneider, and Halil Kilicoglu. 2024. Assessing citation integrity in biomedical publications: corpus annotation and NLP models.Bioinformatics40, 7 (2024), btae420

  25. [25]

    Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galass, Chuxu Zhang, and Yanfang Ye. 2025. NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering.arXiv preprint arXiv:2510.09854(2025)

  26. [26]

    Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Tyler Vergho, Mauricio Rivera, Mayank Goel, Zachary Yang, Jean-Francois Godbout, Reihaneh Rabbany, and Kellin Pel- rine. 2024. Web retrieval agents for evidence-based misinformation detection. arXiv preprint arXiv:2409.00009(2024)

  27. [27]

    SMTI Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.013136 (2024)

  28. [28]

    Ludo Waltman. 2016. A review of the literature on citation impact indicators. Journal of informetrics10, 2 (2016), 365–391

  29. [29]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  30. [30]

    Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al . 2025. Llms4all: A re- view of large language models across academic disciplines.arXiv preprint arXiv:2509.19580(2025)

  31. [31]

    Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, et al. 2024. Mora: Enabling generalist video generation via a multi-agent framework.arXiv preprint arXiv:2403.13248 (2024)

  32. [32]

    Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. 2025. MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference.arXiv e-prints(2025), arXiv–2510

  33. [33]

    Zheyuan Zhang, Yiyang Li, Nhi Ha Lan Le, Zehong Wang, Tianyi Ma, Vincent Galassi, Keerthiram Murugesan, Nuno Moniz, Werner Geyer, Nitesh V Chawla, et al. 2025. NGQA: a nutritional graph question answering benchmark for personalized health-aware nutritional reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics...

  34. [34]

    Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthi- ram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. 2025. Agen- tRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering.arXiv preprint arXiv:2510.05445(2025)

  35. [35]

    citation_id

    Zheyuan Zhang, Zehong Wang, Tianyi Ma, Varun Sameer Taneja, Sofia Nelson, Nhi Ha Lan Le, Keerthiram Murugesan, Mingxuan Ju, Nitesh V Chawla, Chuxu Zhang, et al. 2025. Mopi-hfrs: A multi-objective personalized health-aware food recommendation system with llm-enhanced interpretation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and...