arxiv: 2605.08583 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: no theorem link

Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

Mingzhe Li , Zhiqiang Lin , Shiqing Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords citation hallucination detectionmulti-agent frameworkbibliographic verificationLLM reliabilitysynthetic benchmarkfield-level adjudicationfabricated references

0 comments

The pith

CiteTracer detects citation hallucinations at 97.1 percent accuracy by retrieving evidence across sources and adjudicating each citation field against a three-class taxonomy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that citation hallucination detection improves when reframed as taxonomy-aligned field-level adjudication rather than binary found-or-not decisions. It introduces CiteTracer, a cascading multi-agent system that extracts citations from PDFs and BibTeX files, gathers evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to specialist judgers. The system is evaluated on a new benchmark of 2,450 synthetic citations created from real seeds with controlled mutations and 957 real-world fabricated citations from conference submissions. A sympathetic reader would care because large language models are now used for scientific writing and can insert plausible but unverifiable references that undermine the integrity of published work.

Core claim

CiteTracer is a cascading multi-agent detector built on a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. It extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. On the synthetic benchmark it reaches 97.1 percent accuracy with class-level F1 scores of 97.0, 95.8, and 98.5; on the real-world set it detects 97.1 percent of fabrications without abstaining.

What carries the argument

CiteTracer, the cascading multi-agent detector that performs taxonomy-aligned field-level adjudication after multi-source evidence retrieval and deterministic matching.

If this is right

Auditors receive field-level signals rather than simple binary verification outcomes.
The detector identifies 97.1 percent of fabrications in a collection of desk-rejected real submissions.
Class-level F1 scores exceed 95 percent across Real, Potential, and Hallucinated categories on synthetic data.
The released benchmark of mutated real seeds paired with actual fabrications supports further detector development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into writing assistants could flag suspicious citations during drafting instead of after submission.
Performance may drop for citations from low-indexed or non-English sources where web retrieval is incomplete.
Extending the taxonomy or adding domain-specific retrieval agents could address edge cases the current pipeline misses.

Load-bearing premise

The retrieval pipeline of cache lookup, URL fetch, scholar connectors, and web search plus deterministic field matching will produce sufficient evidence for most cases, and the controlled LLM mutations in the synthetic benchmark adequately represent real-world citation fabrications.

What would settle it

A set of real-world fabricated citations where the retrieval pipeline returns no usable evidence for any field, causing the system to miss the fabrications or abstain, would show that the accuracy claims do not hold outside the tested conditions.

Figures

Figures reproduced from arXiv: 2605.08583 by Mingzhe Li, Shiqing Ma, Zhiqiang Lin.

**Figure 1.** Figure 1: Overview of CITETRACER . Four stages run in sequence: (1) the Reference Extractor parses each citation block into a structured field-level record; (2) the Cascading Evidence Collector walks a memory cache, URL fetch, eight Scholar Connectors, and web search; (3) the Field Matcher compares the record against the evidence field by field; (4) Class-specialist Judgers adjudicate ambiguous cases and emit a taxo… view at source ↗

**Figure 2.** Figure 2: Confusion matrix on BibTeX input [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Seed-pool composition of the 2,270 synthetic entries that derive from a real publication. Left: distribution over the six Scholar Connectors that returned the canonical record. Right: distribution over the 15 research topics used to query the connectors. P3 pure fabrications (180 entries) are excluded from both panels by construction. A.3 Per-code Mutation Operators For every non-R1 code we apply a small f… view at source ↗

**Figure 4.** Figure 4: Per-subtype TPR (left, %) and FPR (right, %) across the four chatbot baselines and [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CiteTracer gives a practical multi-agent detector with a 12-code taxonomy that beats binary checks on controlled tests, but the real-world fabricated set's selection and labeling are underspecified.

read the letter

The core advance here is reframing citation checks as field-level adjudication with a 12-code taxonomy across Real, Potential, and Hallucinated categories, then routing cases through a cascading multi-agent pipeline that does retrieval, deterministic matching, and specialist judgers for the ambiguous ones. That is more structured than prior binary found/not-found tools, and the released code plus the synthetic benchmark of 2,450 controlled mutations make the performance claims easier to inspect. The 97.1% accuracy and strong per-class F1 scores on the synthetic set, plus the same detection rate on the 957 real examples, show the system works on the data they tested.

Referee Report

1 major / 2 minor

Summary. The paper introduces CiteTracer, a cascading multi-agent detector for citation hallucinations that uses a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Citations are extracted from PDF/BibTeX, evidence is retrieved via cache/URL/scholar/web search, deterministic field matching is applied, and ambiguous cases are routed to specialist judgers. It reports 97.1% accuracy (with per-class F1 of 97.0/95.8/98.5) on a synthetic benchmark of 2,450 controlled LLM-mutated citations and 97.1% detection of 957 real-world fabricated citations from ICLR 2026 and desk-rejected submissions, with code released.

Significance. If the results hold, the work offers a practical, taxonomy-aligned alternative to binary found/not-found detectors for a timely problem in LLM-assisted scientific writing. The release of code, the synthetic benchmark built from real seeds, and the real-world set constitute concrete contributions that could support follow-on research and auditing tools.

major comments (1)

[Abstract] Abstract: the 957 real-world fabricated citations are described only as 'drawn from ICLR 2026 and an anonymous conference desk-rejected submissions' with no account of identification, verification, labeling, inclusion criteria, or how many candidates were screened. This detail is load-bearing for the 97.1% detection claim, because the reported rate could be conditioned on an easier subset (e.g., obvious non-existent DOIs or failures already caught by retrieval pipelines similar to CiteTracer's) rather than the full distribution of citation hallucinations.

minor comments (2)

[Evaluation] No error bars, ablation results on the retrieval components, or failure-case analysis are mentioned, which would help assess robustness beyond the headline numbers.
[Abstract] The abstract refers to 'an anonymous conference' without further clarification; if possible, more detail on the source distribution would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for highlighting an important point about transparency in the real-world evaluation. We address the major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the 957 real-world fabricated citations are described only as 'drawn from ICLR 2026 and an anonymous conference desk-rejected submissions' with no account of identification, verification, labeling, inclusion criteria, or how many candidates were screened. This detail is load-bearing for the 97.1% detection claim, because the reported rate could be conditioned on an easier subset (e.g., obvious non-existent DOIs or failures already caught by retrieval pipelines similar to CiteTracer's) rather than the full distribution of citation hallucinations.

Authors: We agree that the current description of the real-world dataset is insufficiently detailed and that this information is necessary to evaluate the 97.1% detection rate and rule out selection bias. In the revised manuscript we will expand the Experiments section (and update the abstract accordingly) with a full account of dataset construction. This will include: the identification process (initial flagging of suspicious citations in ICLR 2026 submissions and desk-rejected papers via automated DOI/URL checks combined with reviewer or organizer reports); verification steps (multi-source retrieval attempts confirming absence or mismatch, followed by author adjudication); labeling procedure (application of the 12-code taxonomy by multiple annotators, with reported inter-annotator agreement); explicit inclusion criteria (citations that were fabricated yet presented in a form that could plausibly pass casual inspection); and screening statistics (total candidates examined and the fraction retained as the final 957). We will also add a short discussion of the distribution of hallucination types in this set to demonstrate that it is not limited to trivial cases already caught by basic retrieval. These changes will make the evaluation transparent while respecting the anonymity constraints of the source conference. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on externally constructed benchmarks independent of the detector definition.

full rationale

The paper introduces a taxonomy, retrieval pipeline, and multi-agent adjudication framework whose definitions and components are specified prior to and independently of the reported accuracy numbers. The synthetic benchmark is generated from real citation seeds via controlled external LLM mutations, and the real-world set is drawn from conference submissions; neither is defined in terms of the detector's outputs or fitted parameters. No equations, self-referential predictions, or load-bearing self-citations appear in the provided text that would reduce the 97.1% accuracy or F1 scores to tautological inputs by construction. The evaluation is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard assumptions about LLM citation behavior and retrieval reliability plus the newly introduced taxonomy and agent structure. No free parameters are explicitly fitted in the abstract; the system uses deterministic matching rules.

axioms (1)

domain assumption LLMs can generate plausible but non-existent citations that require external verification
Stated as background motivation in the abstract.

invented entities (2)

12-code taxonomy for Real, Potential, and Hallucinated citations no independent evidence
purpose: To provide fine-grained classification beyond binary decisions
Introduced by the authors to structure adjudication
CiteTracer cascading multi-agent detector no independent evidence
purpose: To extract, retrieve, match, and judge citations
Core system contribution of the paper

pith-pipeline@v0.9.0 · 5543 in / 1418 out tokens · 48090 ms · 2026-05-12T01:14:42.048041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Transparency report on AI-generated citations in ACM CCS 2026 submissions

ACM CCS 2026 Program Committee. Transparency report on AI-generated citations in ACM CCS 2026 submissions. https://github.com/ACM-CCS-2026/Transparency-Report ,

work page 2026
[2]

Claude (opus 4.7 version) [large language model], 2026

Anthropic. Claude (opus 4.7 version) [large language model], 2026

work page 2026
[3]

UMass citation field extraction dataset

Sam Anzaroot and Andrew McCallum. UMass citation field extraction dataset. http://www. iesl.cs.umass.edu/data/data-umasscitationfield, 2013

work page 2013
[4]

Learning soft linear constraints with application to citation field extraction.arXiv preprint arXiv:1403.1349, 2014

Sam Anzaroot, Alexandre Passos, David Belanger, and Andrew McCallum. Learning soft linear constraints with application to citation field extraction.arXiv preprint arXiv:1403.1349, 2014

work page arXiv 2014
[5]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis.Journal of Medical Internet Research, 26(1):e53164, 2024

Mikaël Chelli, Jules Descamps, Vincent Lavoué, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, Caroline Ruetsch-Chelli, et al. Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis.Journal of Medical Internet Research, 26(1):e53164, 2024

work page 2024
[7]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

CiteCheck: Ai-powered citation verification

CiteCheck. CiteCheck: Ai-powered citation verification. https://citecheck.ai/, 2024. Accessed: 2026-04

work page 2024
[9]

Citely: AI citation assistant.https://citely.ai/, 2024

Citely. Citely: AI citation assistant.https://citely.ai/, 2024. Accessed: 2026-04

work page 2024
[10]

Gemini (3.1 pro version) [large language model], 2026

Google. Gemini (3.1 pro version) [large language model], 2026

work page 2026
[11]

GPTZero finds over 50 hallucinations in ICLR 2026 submissions

GPTZero. GPTZero finds over 50 hallucinations in ICLR 2026 submissions. https:// gptzero.me/news/iclr-2026, 2025

work page 2026
[12]

GPTZero flags fabricated citations in NeurIPS submissions

GPTZero. GPTZero flags fabricated citations in NeurIPS submissions. https://gptzero. me/news/neurips/, 2025. Accessed: 2026-05

work page 2025
[13]

GPTZero: Detecting AI-generated text, 2023

GPTZero Team. GPTZero: Detecting AI-generated text, 2023

work page 2023
[14]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[15]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026

work page 2026
[16]

Autodata: A multi-agent system for open web data collection.arXiv preprint arXiv:2505.15859, 2025

Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. Autodata: A multi-agent system for open web data collection.arXiv preprint arXiv:2505.15859, 2025

work page arXiv 2025
[17]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. InProceedings of EMNLP, 2023

work page 2023
[18]

ChatGPT (5.5 version) [large language model], 2026

OpenAI. ChatGPT (5.5 version) [large language model], 2026

work page 2026
[19]

Hallucination to truth: a review of fact-checking and factuality evaluation in large language models.Artificial Intelligence Review, 2026

Subhey Sadi Rahman, Md Adnanul Islam, Md Mahbub Alam, Musarrat Zeba, Md Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, and Sami Azam. Hallucination to truth: a review of fact-checking and factuality evaluation in large language models.Artificial Intelligence Review, 2026

work page 2026
[20]

RefCheck-AI

RefCheck-AI. RefCheck-AI. https://github.com/HuaHenry/RefCheck_ai, 2024. Ac- cessed: 2026-04. 10

work page 2024
[21]

Academic urban legends.Social Studies of Science, 44(4):638–654, 2014

Ole Bjørn Rekdal. Academic urban legends.Social Studies of Science, 44(4):638–654, 2014

work page 2014
[22]

arXiv:2601.18724

Y . Sakai, H. Kamigaito, and T. Watanabe. HalluCitation matters: Revealing the impact of hallucinated references with 300 hallucinated papers in ACL conferences. https://arxiv. org/abs/2601.18724, 2026

work page arXiv 2026
[23]

Assessing citation integrity in biomedical publications: corpus annotation and NLP models

Maria Janina Sarol, Shufan Ming, Shruthan Radhakrishna, Jodi Schneider, and Halil Kilicoglu. Assessing citation integrity in biomedical publications: corpus annotation and NLP models. Bioinformatics, 40(7):btae420, 2024

work page 2024
[24]

Hallucinator: A citation hallucination checker

Gianluca Sbardella. Hallucinator: A citation hallucination checker. https://github.com/ gianlucasb/hallucinator, 2024

work page 2024
[25]

SwanRef: Reference verification platform

SwanRef. SwanRef: Reference verification platform. https://www.swanref.org/, 2024. Accessed: 2026-04

work page 2024
[26]

AI conference’s papers contaminated by AI hallucinations

The Register. AI conference’s papers contaminated by AI hallucinations. https: //www.theregister.com/2026/01/22/neurips_papers_contaiminated_ai_ hallucinations/, 2026

work page 2026
[27]

Web retrieval agents for evidence-based misinformation detection.arXiv preprint arXiv:2409.00009, 2024

Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Tyler Vergho, Mauricio Rivera, Mayank Goel, Zachary Yang, Jean-Francois Godbout, Reihaneh Rabbany, and Kellin Pelrine. Web retrieval agents for evidence-based misinformation detection.arXiv preprint arXiv:2409.00009, 2024

work page arXiv 2024
[28]

S. M. T. I. Tonmoy, S. M. Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313, 2024

work page arXiv 2024
[29]

L. J. Janse van Rensburg. Ai-powered citation auditing: A zero-assumption protocol for systematic reference verification in academic research, 2025

work page 2025
[30]

Walters and Esther Isabelle Wilder

William H. Walters and Esther Isabelle Wilder. Fabrication and errors in the bibliographic citations generated by ChatGPT.Scientific Reports, 13(1):14045, 2023

work page 2023
[31]

A review of the literature on citation impact indicators.Journal of informetrics, 10(2):365–391, 2016

Ludo Waltman. A review of the literature on citation impact indicators.Journal of informetrics, 10(2):365–391, 2016

work page 2016
[32]

Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,

Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026

work page arXiv 2026
[33]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V Chawla, and Yanfang Ye. Citeaudit: You cited it, but did you read it? a benchmark for verifying scientific references in the llm era.arXiv preprint arXiv:2602.23452, 2026. A Benchmark Details This appendix expands Section 3 with the per-code prose, mutation operator schemas, and quality- cont...

work page internal anchor Pith review Pith/arXiv arXiv 2026