arxiv: 2604.03159 · v1 · submitted 2026-04-03 · 💻 cs.DL · cs.CL

Recognition: 1 theorem link

· Lean Theorem

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

Delip Rao , Chris Callison-Burch

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:11 UTC · model grok-4.3

classification 💻 cs.DL cs.CL

keywords BibTeXcitation hallucinationsLLM agentsscientific publishingerror mitigationbenchmarktwo-stage integrationZotero

0 comments

The pith

Two-stage integration of LLM BibTeX outputs with Zotero and CrossRef records raises field accuracy to 91.5 percent and fully correct entries to 78.3 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models with web search still produce BibTeX entries containing pervasive field-level errors when used in scientific publishing agents. A benchmark of 931 papers across four domains and three citation tiers shows overall field accuracy at 83.6 percent but only 50.9 percent of entries fully correct, with accuracy falling sharply for recent papers. The clibib tool supports a two-stage workflow that revises baseline model outputs against version-aware authoritative records, lifting field accuracy by eight points to 91.5 percent and fully correct entries to 78.3 percent while keeping regression at 0.8 percent. An ablation establishes that separating search from revision outperforms single-stage integration and reduces regression rates. The paper releases the benchmark, error taxonomy, and tool to enable further work on citation reliability.

Core claim

Frontier models generate BibTeX entries with 83.6 percent field accuracy across nine fields yet only 50.9 percent of entries are fully correct; a two-stage integration that revises these outputs against deterministic retrieval from the Zotero Translation Server and CrossRef raises field accuracy to 91.5 percent and fully correct entries to 78.3 percent with 0.8 percent regression, while separating search from revision produces larger gains and lower regression than combined single-stage processing.

What carries the argument

clibib, a deterministic BibTeX retrieval tool from the Zotero Translation Server with CrossRef fallback, used to revise baseline LLM-generated entries in a two-stage integration against version-aware ground truth.

If this is right

Accuracy improves eight percentage points when baseline LLM entries are revised against authoritative records.
Fully correct entries rise from 50.9 percent to 78.3 percent with only 0.8 percent regression.
Separating search from revision yields larger gains and lower regression than single-stage integration.
Field-error co-occurrence reveals two distinct failure modes: wholesale entry substitution and isolated field errors.
Accuracy drops 27.7 points for recent post-cutoff papers, showing continued heavy reliance on parametric memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage revision pattern could apply to other structured outputs such as reference lists or data tables in scientific writing.
Writing-assistant tools and journal submission systems could embed similar retrieval steps to reduce citation errors at scale.
The released benchmark enables direct comparison of future models and mitigation techniques on the same 931-paper set.
Extending the benchmark to additional languages or citation styles would test whether the observed failure modes and mitigation gains generalize.

Load-bearing premise

The version-aware ground truth built from Zotero and CrossRef is accurate and complete for the 931 papers and the nine-field scoring plus six-way error taxonomy captures real citation quality needs.

What would settle it

Independent manual inspection of a random sample of the constructed ground truth entries against original paper sources that finds any mismatches among the nine scored fields.

Figures

Figures reproduced from arXiv: 2604.03159 by Chris Callison-Burch, Delip Rao.

**Figure 2.** Figure 2: Distribution of cited-by counts across citation tiers (log scale). Popular papers [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Version composition of papers by domain. Each bar shows the proportion of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: BibTeX retrieval outcomes by domain. Quantum Computing and Materials Science [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Canonical field coverage by domain (percentage of papers with a resolved value). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Error-type distribution per field (all models combined). Each row shows the [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Per-field accuracy by model. Gemini 3 Flash leads on all nine fields; GPT-5 and [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Error composition by model (excluding correct and not-applicable fields). Missing [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Per-domain accuracy by model. Medicine is the hardest domain; materials science [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Field-level accuracy by citation tier and model with 95% bootstrap confidence [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy conditioned on search invocation, by model. GPT-5 achieves 85.1% [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Conditional co-error probability matrix. Cell [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Per-field accuracy before (baseline) and after single-stage [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Correction rate by original error type, per model (single-stage [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Overall accuracy (percentage of evaluable fields labeled C) before and after [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Per-error-type correction rates for single-stage vs. two-stage integration, by [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Concordance and discordance between single-stage and two-stage integration on [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Label transition matrix between single-stage (rows) and two-stage (columns) [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Field-level accuracy by model and integration strategy (baseline, single-stage [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Per-field accuracy before (baseline) and after two-stage [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Two-stage correction rate by original error type, per model. GPT-5 shows the [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

read the original abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a practical two-stage fix using their clibib tool lifts citation accuracy to 91.5% with low regression, and supplies a new benchmark plus taxonomy to measure the problem.

read the letter

The main thing to know is that this paper gives a workable way to cut down on BibTeX errors from LLMs by using a two-stage process with their clibib retrieval tool, pushing accuracy from 83.6% to 91.5% and fully correct entries up to 78.3% with just 0.8% regression. What they did well is create a benchmark of 931 papers split into popular, low-citation, and recent post-cutoff tiers to separate memory from search effects. The version-aware ground truth and six-way error taxonomy let them show that errors often come from either swapping the whole entry or isolated field problems. The ablation on single versus two-stage integration is a clear result: separating the steps works better and reduces regressions. Releasing the benchmark, taxonomy, and tool is helpful for others. The numbers are based on a lot of observations across three models, and the drop in performance on recent papers highlights the ongoing issue with parametric knowledge. A soft spot is the reliance on Zotero and CrossRef for ground truth, which clibib also uses. This setup could make the gains look stronger than they would be against independent sources, especially for recent papers where database coverage might lag. The taxonomy lacks a slot for version selection errors, so that kind of mistake might not be fully captured. The abstract mentions the construction but without more detail on selection criteria, it's tough to assess how well it represents real citation needs across fields. This work is for researchers and developers focused on LLM agents for scientific publishing and citation accuracy. It shows clear thinking on the problem and has open resources, so it deserves peer review to get feedback on the evaluation setup and to refine the claims.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates BibTeX entry generation by three search-enabled frontier LLMs on a new benchmark of 931 papers spanning four domains and three citation tiers (popular, low-citation, recent post-cutoff). It reports 83.6% field-level accuracy but only 50.9% fully correct entries, with a 27.7pp accuracy drop for recent papers, and introduces the clibib tool for deterministic retrieval from Zotero Translation Server with CrossRef fallback. A two-stage integration of baseline LLM output with clibib revision raises accuracy to 91.5%, fully correct entries to 78.3%, and keeps regression at 0.8%; an ablation shows separating search from revision outperforms single-stage integration.

Significance. If the evaluation holds, the work supplies a concrete, released benchmark and error taxonomy for citation hallucinations in publishing agents, plus an open-source mitigation whose +8.0pp gain and low regression are tied to integration architecture rather than model scale. The ~23,000 field observations and tiered design that isolates parametric memory from search are useful contributions for the field.

major comments (2)

[Benchmark Construction] Benchmark Construction section: The version-aware ground truth is built exclusively from Zotero Translation Server + CrossRef; no independent cross-check against other databases or manual verification is reported for the recent post-cutoff tier, where the largest accuracy drop (27.7pp) occurs. This leaves open the possibility that database lag or missing versions inflate the reported gains and understate regression.
[Error Taxonomy and Scoring] Error Taxonomy and Scoring section: The six-way taxonomy and nine-field scoring contain no explicit category for version-selection or database-lag mismatches. Because clibib retrieves from the identical sources used to construct ground truth, this omission means the evaluation cannot distinguish retrieval fidelity from ground-truth incompleteness.

minor comments (2)

[Abstract] Abstract: The derivation of the ~23,000 field-level observations (931 papers × 9 fields × 3 models) should be stated explicitly so readers can verify the count.
[Results] Table or figure presenting per-tier results: Add confidence intervals or per-model breakdowns to the accuracy and regression numbers to support the cross-tier claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on benchmark construction and error taxonomy. Both points highlight important considerations for the reliability of our evaluation, particularly for the recent post-cutoff tier. We address each below and indicate where revisions will be made.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The version-aware ground truth is built exclusively from Zotero Translation Server + CrossRef; no independent cross-check against other databases or manual verification is reported for the recent post-cutoff tier, where the largest accuracy drop (27.7pp) occurs. This leaves open the possibility that database lag or missing versions inflate the reported gains and understate regression.

Authors: We acknowledge that exclusive reliance on Zotero Translation Server and CrossRef without reported independent verification for the recent tier represents a limitation, as database lag could affect the measured accuracy drop. These sources were selected for their direct provision of structured, version-aware BibTeX records that align with our evaluation needs. In the revised manuscript we will add a limitations paragraph discussing potential coverage gaps for post-cutoff papers and include results from a manual cross-check of a random sample of 50 recent papers against publisher sites and Google Scholar to quantify any discrepancies. revision: partial
Referee: [Error Taxonomy and Scoring] Error Taxonomy and Scoring section: The six-way taxonomy and nine-field scoring contain no explicit category for version-selection or database-lag mismatches. Because clibib retrieves from the identical sources used to construct ground truth, this omission means the evaluation cannot distinguish retrieval fidelity from ground-truth incompleteness.

Authors: We agree this is a substantive gap: the taxonomy does not isolate version-selection or lag mismatches, and shared sources between clibib and ground truth mean retrieval success cannot be fully separated from source completeness. We will revise the taxonomy section to introduce an explicit 'version mismatch' error category and add a short discussion clarifying that clibib performance is measured relative to the chosen authoritative sources rather than an absolute ground truth. This change will not alter the reported accuracy numbers but will improve interpretability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper is a direct empirical evaluation that constructs a benchmark of 931 papers using external Zotero and CrossRef sources, scores model-generated BibTeX entries on nine fields, and measures accuracy gains from integrating the clibib retrieval tool. No load-bearing steps involve mathematical derivations, fitted parameters renamed as predictions, self-definitional claims, or self-citations that justify uniqueness or ansatzes. The reported +8.0pp accuracy lift and 0.8% regression are measured experimental outcomes against an independently sourced ground truth, not reductions by construction. This matches the default expectation for non-circular empirical work with released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that LLMs with search are used in scientific publishing agents and that Zotero/CrossRef records constitute reliable ground truth; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Large language models with web search are increasingly used in scientific publishing agents
Stated directly in the opening sentence of the abstract as the motivating context.

pith-pipeline@v0.9.0 · 5644 in / 1451 out tokens · 41357 ms · 2026-05-13T18:11:54.342581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 6 internal anchors

[1]

arXiv:2602.15871

URL http://arxiv.org/abs/2602.15871. arXiv:2602.15871. Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Tauman Kalai. Do Language Models Know When They’re Hallucinating References?, March

work page arXiv
[2]

Agrawal, M

URL http://arxiv. org/abs/2305.18248. arXiv:2305.18248. Andres Algaba, Carmen Mazijn, Vincent Holst, Floriano Tori, Sylvia Wenmackers, and Vin- cent Ginis. Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias, August

work page arXiv
[3]

arXiv:2405.15739

URLhttp://arxiv.org/abs/2405.15739. arXiv:2405.15739. Fadi Aljamaan, Mohamad-Hani Temsah, Ibraheem Altamimi, Ayman Al-Eyadhy, Amr Jamal, Khalid Alhasan, Tamer A Mesallam, Mohamed Farahat, and Khalid H Malki. Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.JMIR Medical Informatics, 12:e54345, July

work page arXiv
[4]

doi: 10.2196/54345

ISSN 2291-9694. doi: 10.2196/54345. URLhttps://medinform.jmir.org/2024/1/e54345. Hussam Alkaissi and Samy I McFarlane. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.Cureus, February

work page doi:10.2196/54345 2024
[5]

URL https://www.cureus.com/articles/ 138667-artificial-hallucinations-in-chatgpt-implications-in-scientific-writing

doi: 10.7759/cureus.35179. URL https://www.cureus.com/articles/ 138667-artificial-hallucinations-in-chatgpt-implications-in-scientific-writing . Samar Ansari. Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025, February

work page doi:10.7759/cureus.35179 2025
[6]

arXiv:2602.05930

URL http://arxiv.org/abs/ 2602.05930. arXiv:2602.05930. Moses Boudourides. Structural Hallucination in Large Language Models: A Network- Based Evaluation of Knowledge Organization and Citation Integrity,

work page arXiv
[7]

Mika¨el Chelli, Jules Descamps, Vincent Lavou´e, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, and Caroline Ruetsch-Chelli

URL https: //arxiv.org/abs/2603.01341. Mika¨el Chelli, Jules Descamps, Vincent Lavou´e, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, and Caroline Ruetsch-Chelli. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.Journal of Medical Internet Resear...

work page arXiv
[8]

URLhttps://www.jmir.org/2024/1/e53164

doi: 10.2196/53164. URLhttps://www.jmir.org/2024/1/e53164. I.-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios, July

work page doi:10.2196/53164 2024
[9]

arXiv preprint arXiv:2307.13528 , year =

URL http://arxiv.org/abs/2307.13528. arXiv:2307.13528. 35 Preprint. Under review. Yee Man Choi, Xuehang Guo, Yi R. Fung, and Qingyun Wang. CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation, January

work page arXiv
[10]

CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation

URL http: //arxiv.org/abs/2510.17853. arXiv:2510.17853. Mourad Gridach, Jay Nanavati, Christina Mack, Khaldoun Zine El Abidine, and Lenon Mendes. Agentic AI for Scientific Discovery: A Survey of Progress, Challenges, and Future Directions. InProceedings of the Thirteenth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv:2602.23075

URLhttp://arxiv.org/abs/2602.23075. arXiv:2602.23075. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, November

work page arXiv
[12]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

URLhttp://arxiv.org/abs/2311.05232. arXiv:2311.05232. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 55(12):1–38, December

work page internal anchor Pith review Pith/arXiv arXiv
[13]

ACM Comput

doi: 10.1145/3571730. URLhttps://dl.acm.org/doi/10.1145/3571730. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, April

work page doi:10.1145/3571730
[14]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URLhttp://arxiv.org/abs/2005.11401. arXiv:2005.11401. Peiran Li, Fangzhou Lin, Shuo Xing, Xiang Zheng, Xi Hong, Siyuan Yang, Jiashuo Sun, Zhengzhong Tu, and Chaoqun Ni. BibAgent: An Agentic Framework for Traceable Miscitation Detection in Scientific Literature, January

work page internal anchor Pith review Pith/arXiv arXiv 2005
[15]

arXiv:2601.16993

URL http://arxiv.org/abs/ 2601.16993. arXiv:2601.16993. Delip Rao Nair. autorubric: LLM-as-a-judge with weighted rubric criteria.arXiv preprint arXiv:2603.00077,

work page arXiv
[16]

WebGPT: Browser-assisted question-answering with human feedback

URLhttp://arxiv.org/abs/2112.09332. arXiv:2112.09332. Junichiro Niimi. Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe

URLhttps://arxiv.org/abs/2510.25378. Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences, January

work page arXiv
[18]

arXiv:2601.18724

URLhttp://arxiv.org/abs/2601.18724. arXiv:2601.18724. Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. InFindings of the Association for Computational Linguistics: EMNLP 2025, pp. 5977–6043, Suzhou, China,

work page arXiv 2025
[19]

doi: 10.18653/v1/2025.findings-emnlp.320

Association for Computational Lin- guistics. doi: 10.18653/v1/2025.findings-emnlp.320. URL https://aclanthology.org/ 2025.findings-emnlp.320. Stefan Szeider. Unmediated AI-Assisted Scholarly Citations, February

work page doi:10.18653/v1/2025.findings-emnlp.320 2025
[20]

arXiv:2602.01686

URL http: //arxiv.org/abs/2602.01686. arXiv:2602.01686. William H. Walters and Esther Isabelle Wilder. Fabrication and errors in the bibliographic ci- tations generated by ChatGPT.Scientific Reports, 13(1):14045, September

work page arXiv
[21]

arXiv:2602.06718

URL http://arxiv.org/abs/2602.06718. arXiv:2602.06718. Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V . Chawla, and Yanfang Ye. CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era, February

work page internal anchor Pith review arXiv
[22]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

URL http://arxiv.org/abs/2602.23452. arXiv:2602.23452. 37

work page internal anchor Pith review Pith/arXiv arXiv