BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

· 2026 · cs.DL · arXiv 2604.03159

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

representative citing papers

Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences

cs.DL · 2026-07-01 · conditional · novelty 7.0

Empirical audit finds hallucinated citations in roughly 5% of 2025 NeurIPS and USENIX Security papers, with post-ChatGPT increases and failures even in award papers.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Phantom References: Hallucinated Citations That Survive Peer Review at Top-Tier Conferences cs.DL · 2026-07-01 · conditional · none · ref 10 · internal anchor
Empirical audit finds hallucinated citations in roughly 5% of 2025 NeurIPS and USENIX Security papers, with post-ChatGPT increases and failures even in award papers.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

fields

years

verdicts

representative citing papers

citing papers explorer