CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

Alexander Tessier; Khashayar Khajavi; Rise Adhikari; Shaghayegh Sadeghi

arxiv: 2605.27700 · v1 · pith:3SI5TAFEnew · submitted 2026-05-26 · 💻 cs.DL · cs.AI

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

Khashayar Khajavi , Shaghayegh Sadeghi , Rise Adhikari , Alexander Tessier This is my paper

Pith reviewed 2026-06-29 13:55 UTC · model grok-4.3

classification 💻 cs.DL cs.AI

keywords citation hallucination detectionLLM scientific textretrieval grounded verificationmetadata fidelityphysics benchmark

0 comments

The pith

CiteCheck detects LLM citation hallucinations by retrieving real publications and using a structured verifier to label them Exact, Minor, or Major.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate scientific text that includes citations which may be fabricated or contain incorrect details. The paper introduces CiteCheck, which retrieves candidate publications from scholarly sources and compares the given citation to them using a structured LLM-based verifier. This produces labels of Exact match, Minor issues, or Major problems. A benchmark of 982 physics citations with controlled corruptions is used to test the system. The method outperforms several LLM baselines on accuracy and F1 score for this task.

Core claim

CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. On the held-out test set of the 982-citation physics benchmark with controlled corruptions, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines including web-search and few-shot variants.

What carries the argument

The hybrid framework of scholarly retrieval followed by structured LLM comparison that maps to Exact, Minor, and Major labels.

If this is right

Citation hallucinations can be caught more reliably when external sources are used to ground the check rather than relying on model knowledge alone.
Controlled corruptions in a benchmark provide a way to test for both subtle metadata errors and complete fabrications.
Structured decision rules improve upon unstructured LLM judgments for citation verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-verifier pattern could apply to citation checking in fields other than physics if suitable databases exist.
Embedding the checker into LLM generation pipelines might prevent bad citations before they appear in output.
Minor-issue detections could be used to suggest automatic fixes for metadata drift.

Load-bearing premise

The 982-citation physics benchmark with controlled corruptions sufficiently represents the citation hallucinations that LLMs produce when generating scientific text.

What would settle it

Evaluating CiteCheck on citations actually produced by LLMs during real scientific writing tasks and measuring agreement with verified ground-truth references.

Figures

Figures reproduced from arXiv: 2605.27700 by Alexander Tessier, Khashayar Khajavi, Rise Adhikari, Shaghayegh Sadeghi.

**Figure 1.** Figure 1: The CITECHECK pipeline. A raw citation string is first parsed into structured metadata, then matched against candidates from a retrieval cascade. An LLM verifier scores the citation against the retrieved candidate and the score is mapped to one of three labels (Exact / Minor / Major). When the candidate’s identifier match is suspicious, a Reviewer LLM performs a second-pass check before the final label is … view at source ↗

**Figure 2.** Figure 2: Performance of CITECHECK when the verifier (and web-search fallback) are instantiated with different LLMs, across exact, minor, major, macro F1, and accuracy. The shaded region marks the performance gap between the strongest and weakest configurations. confuse with acceptable citation variation. CITECHECK improves most clearly in this difficult regime, reaching 81.7 MINOR F1 compared with 76.6 for the stro… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible while containing corrupted metadata or pointing to papers that do not exist. We introduce CiteCheck, a hybrid framework for citation hallucination detection that verifies whether a citation corresponds to a real scholarly work and whether its metadata is faithful to that work. CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. We also construct a 982-citation physics benchmark with controlled corruptions that capture both subtle metadata drift and fully fabricated references. On the held-out test set, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines, including web-search and few-shot variants. These results show that reliable citation verification benefits from combining scholarly retrieval, structured LLM-based comparison, and calibrated decision rules.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CiteCheck pairs retrieval with a structured LLM verifier and beats baselines on its synthetic physics benchmark, but the controlled corruptions may not match the citation errors LLMs actually produce.

read the letter

The paper's concrete contribution is the hybrid pipeline: it pulls candidate papers from scholarly sources, runs a structured comparison through an LLM, and outputs one of three labels. They also release a 982-citation physics set created by applying fixed corruptions to real references. The reported 88.7 macro-F1 and outperformance over GPT, Claude, and Gemini variants (including search and few-shot) show the combination can work under those conditions.

The approach is straightforward and the three-label scheme gives more granularity than binary checks. Grounding the decision in external retrieval is a clear step beyond pure model self-assessment.

The soft spot is the benchmark. The corruptions are deliberate metadata drift and outright fabrications, but real LLM hallucinations often look different, such as citing a genuine but unrelated paper or blending details across multiple sources into a single plausible entry. If those patterns dominate in practice, the held-out numbers will not transfer. The work is also scoped to physics, which limits how far the results can be read.

This is for people building or evaluating tools that clean up AI-generated scientific text. A reader who wants a ready-to-test method and baseline comparisons will get usable material.

It deserves peer review. The method is implementable and the results are stated clearly enough for referees to examine the benchmark construction and generalization questions directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CiteCheck, a hybrid retrieval-plus-LLM framework that retrieves candidate publications from external scholarly sources, applies a structured LLM verifier to compare a citation against retrieved candidates, and maps the resulting scores to Exact/Minor/Major labels. The authors also construct a 982-citation physics benchmark whose positive and negative examples are generated by applying controlled corruptions (subtle metadata drift and fully fabricated references). On a held-out test split, CiteCheck reports 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines (including web-search and few-shot variants). The central claim is that this combination yields reliable detection of citation hallucinations in scientific text.

Significance. If the benchmark distribution is representative, the work supplies a practical, retrieval-grounded method for citation verification that improves on pure LLM prompting. The explicit construction of a controlled benchmark and the head-to-head comparison against multiple LLM variants constitute a concrete, reproducible contribution that other researchers can extend.

major comments (2)

[Benchmark Construction] Benchmark Construction (abstract and §4): the headline 88.7 macro-F1 is obtained on a held-out split of 982 synthetic examples created by a fixed set of controlled corruptions. The manuscript does not report any validation that these corruptions reproduce the error distribution of actual LLM-generated citations (e.g., citing a real but topically unrelated paper, fabricating plausible DOIs that survive retrieval, or mixing author lists across multiple real works). Without such validation the transfer claim to “scientific text” is not yet supported.
[Evaluation] Evaluation (§5): the experimental setup, statistical tests, and potential selection biases in the 982-citation physics benchmark are not described in sufficient detail to allow independent verification of the reported macro-F1 and accuracy figures or to assess whether the held-out split preserves the corruption distribution.

minor comments (2)

[Abstract] The abstract states performance numbers but does not mention the size of the held-out test set or the train/test split ratio; these details should be added for completeness.
[Method] Notation for the three output labels (Exact, Minor, Major) is introduced without an explicit decision rule or threshold table; a small table or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction (abstract and §4): the headline 88.7 macro-F1 is obtained on a held-out split of 982 synthetic examples created by a fixed set of controlled corruptions. The manuscript does not report any validation that these corruptions reproduce the error distribution of actual LLM-generated citations (e.g., citing a real but topically unrelated paper, fabricating plausible DOIs that survive retrieval, or mixing author lists across multiple real works). Without such validation the transfer claim to “scientific text” is not yet supported.

Authors: We agree that the absence of direct validation against real LLM-generated citation errors is a limitation. Our controlled corruptions were designed to isolate specific failure modes (metadata drift and full fabrication), but we did not collect or compare against a corpus of actual LLM outputs with verified hallucinations. In the revised manuscript we will add an explicit discussion of this gap in §4 and the Limitations section, including why constructing such a real-world validation set is non-trivial and outlining it as future work. We do not claim the current benchmark fully replicates real distributions. revision: partial
Referee: [Evaluation] Evaluation (§5): the experimental setup, statistical tests, and potential selection biases in the 982-citation physics benchmark are not described in sufficient detail to allow independent verification of the reported macro-F1 and accuracy figures or to assess whether the held-out split preserves the corruption distribution.

Authors: We will expand §5 with the requested details: the precise splitting procedure and checks confirming preservation of corruption-type distributions; any statistical tests or confidence intervals computed for the metrics; and an analysis of domain-specific biases (physics-only corpus and synthetic generation). We will also release the benchmark dataset and code upon acceptance to support independent verification. revision: yes

Circularity Check

0 steps flagged

No circularity; evaluation uses external retrieval and held-out split of author-constructed benchmark

full rationale

The paper defines CiteCheck via external scholarly retrieval plus structured LLM comparison, then reports macro-F1 on a held-out test split of its 982-citation benchmark. No equations, parameters, or labels are fitted on the test data and then re-presented as predictions. No self-citations appear in the provided text, let alone load-bearing ones. The benchmark construction (controlled corruptions) is an input to evaluation rather than a derived output that loops back. The reported accuracy is a direct measurement on the held-out portion and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities used in the work.

pith-pipeline@v0.9.1-grok · 5717 in / 1193 out tokens · 55843 ms · 2026-06-29T13:55:38.885158+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 2 canonical work pages

[1]

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t

doi: 10.1162/qss\ a\ 00022. Ho, X., Wu, Y .-A., Kumar, S., Xia, T. C., Boudin, F., Greiner-Petter, A., and Aizawa, A. SciClaimEval: Cross- modal Claim Verification in Scientific Papers.arXiv preprint arXiv:2602.07621, 2026. Huang, L., Feng, X., Ma, W., Gu, Y ., Zhong, W., Feng, X., Yu, W., Peng, W., Tang, D., Tu, D., et al. Learning fine- grained grounded...

work page doi:10.1162/qss 2026
[2]

Ignore display truncation: ‘‘...’’ at the end of titles is display formatting, not a difference .\par
[3]

‘‘et al.’’ is acceptable and should not be treated as an author mismatch.\par
[4]

Minor punctuation, capitalization, and spacing differences are not hallucinations.\par
[5]

Provide a score, brief reasoning, and any key differences found

Compare only the citation and retrieved source metadata; do not infer missing fields.\par\ vspace{3pt} Citation from report:\par - Authors: \{citation\_authors\}\par - Year: \{citation\_year\}\par - Title: \{citation\_title\}\par - ArXiv ID: \{citation\_arxiv\_id\}\par - URL: \{citation\_url\}\par\vspace{3pt} Best matching source, matched by \{match\_meth...
[6]

Your classification: one of exact\_match, minor\ _hallucination, or major\_hallucination.\par
[7]

Smith" ->

Brief reasoning for your decision. C. Dataset Construction Details C.1. Physics Subdomain Coverage The citation pool is organized into 42 topically coherent collections across nine physics subdomains. Each collection corresponds to a specific theme within a subdomain, so that citations within the same collection share technical vocabu- lary and topical co...

work page arXiv 2009

[1] [1]

Karpukhin, V ., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t

doi: 10.1162/qss\ a\ 00022. Ho, X., Wu, Y .-A., Kumar, S., Xia, T. C., Boudin, F., Greiner-Petter, A., and Aizawa, A. SciClaimEval: Cross- modal Claim Verification in Scientific Papers.arXiv preprint arXiv:2602.07621, 2026. Huang, L., Feng, X., Ma, W., Gu, Y ., Zhong, W., Feng, X., Yu, W., Peng, W., Tang, D., Tu, D., et al. Learning fine- grained grounded...

work page doi:10.1162/qss 2026

[2] [2]

Ignore display truncation: ‘‘...’’ at the end of titles is display formatting, not a difference .\par

[3] [3]

‘‘et al.’’ is acceptable and should not be treated as an author mismatch.\par

[4] [4]

Minor punctuation, capitalization, and spacing differences are not hallucinations.\par

[5] [5]

Provide a score, brief reasoning, and any key differences found

Compare only the citation and retrieved source metadata; do not infer missing fields.\par\ vspace{3pt} Citation from report:\par - Authors: \{citation\_authors\}\par - Year: \{citation\_year\}\par - Title: \{citation\_title\}\par - ArXiv ID: \{citation\_arxiv\_id\}\par - URL: \{citation\_url\}\par\vspace{3pt} Best matching source, matched by \{match\_meth...

[6] [6]

Your classification: one of exact\_match, minor\ _hallucination, or major\_hallucination.\par

[7] [7]

Smith" ->

Brief reasoning for your decision. C. Dataset Construction Details C.1. Physics Subdomain Coverage The citation pool is organized into 42 topically coherent collections across nine physics subdomains. Each collection corresponds to a specific theme within a subdomain, so that citations within the same collection share technical vocabu- lary and topical co...

work page arXiv 2009