Citation Failure: Definition, Analysis and Efficient Mitigation

Iryna Gurevych; Jan Buchmann

arxiv: 2510.20303 · v3 · submitted 2025-10-23 · 💻 cs.CL

Citation Failure: Definition, Analysis and Efficient Mitigation

Jan Buchmann , Iryna Gurevych This is my paper

Pith reviewed 2026-05-18 05:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords citation failureRAG systemsLLM citationbenchmarkcitation generationretrieval augmented generationevidence linkingCITECONTROL

0 comments

The pith

Citation failures in LLM RAG systems increase with relational complexity between response and evidence but can be substantially reduced by integrating generative, attention-based, and retrieval-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper separates citation failure, where an LLM produces a correct response but does not link it to complete evidence, from response failure where the answer itself is incorrect. The authors build CITECONTROL, a benchmark that alters how a response connects to its evidence to expose when citation problems arise. Experiments reveal that more complex relations cause more failures and that single citation techniques fall short. To fix this, they introduce CITENTION, a framework that merges generative, attention-based, and retrieval-based approaches, delivering clear gains on the benchmark and when moved to new tasks. Readers care because accurate citations let people verify AI claims without first doubting the entire answer.

Core claim

Citation failure occurs when a model generates a helpful response but fails to generate citations to complete evidence, in contrast to response failure where citing complete evidence is impossible because the response itself is flawed. The CITECONTROL benchmark systematically varies the relation between response and evidence to study failure modes, showing that failures increase with relational complexity. The CITENTION framework integrates generative, attention-based, and retrieval-based methods and demonstrates substantial citation improvements on CITECONTROL and in transfer settings.

What carries the argument

CITENTION, a framework that integrates generative, attention-based, and retrieval-based citation methods to address citation failure after response quality is assured.

If this is right

Citation quality drops as the link between a generated response and supporting evidence grows more complex.
Merging multiple citation techniques outperforms any one method alone on controlled tests.
Gains from the integrated framework carry over to tasks outside the original benchmark.
Varying response-evidence relations in a benchmark isolates distinct failure modes that single-method approaches miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems could detect and repair only the citation step without regenerating the main response.
The controlled variation technique could be applied to test other verification challenges such as factuality or source attribution.
Domain-specific adaptations of the framework might handle evidence relations common in legal or scientific queries.

Load-bearing premise

That citation failure can be cleanly separated from response failure in practice and that the synthetic variations in the CITECONTROL benchmark capture the dominant real-world causes of citation problems.

What would settle it

A human study on real RAG queries that finds many supposed citation failures are actually response flaws instead, or an evaluation showing that CITENTION produces no citation gains over single methods on a large set of out-of-domain queries.

Figures

Figures reproduced from arXiv: 2510.20303 by Iryna Gurevych, Jan Buchmann.

**Figure 1.** Figure 1: Citation Example: An LLM receives multiple documents and a question. The confusion matrix shows the possible outcomes for generated response and evidence. The response-evidence relation has reasoning type multi-hop. It is explicit for the response and [4], and implicit for the response and [2]. intertextuality (Kuznetsov et al., 2022), we define key properties of this relation and propose CITECONTROL, a b… view at source ↗

**Figure 2.** Figure 2: Evaluation strategy on CITECONTROL: For Rk, all predictions are evaluated for evidence recall @k, while for Rk f , only predictions with correct responses are evaluated. the response-evidence relation, which is recognized by more general research on intertextuality (Kuznetsov et al., 2022), and distinguishes: • implicit: the response does not appear verbatim, but the evidence document is relevant (e.g do… view at source ↗

**Figure 3.** Figure 3: (A) Citation precision decreases with the order of appearance in generation. (B) R [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of CITENTION. Left: Individual scores for each document are obtained from generation-based, attention-based and retrieval-based methods. Attention scores are averaged over individual heads. Right: Scores from individual methods are summed to obtain a final citation prediction. Attention head weights θ and method weights w, b are omitted. investigate attention-based citation, and the combination … view at source ↗

**Figure 5.** Figure 5: (A) CITENTION methods improve citation Rk f scores on CITECONTROL (§6.1). (B) CITENTION methods trained on CITECONTROL improve Rk f scores on unseen tasks (§6.2). Bars show proportion of answer statements that are attributable to the evidence (averaged over train datasets). Whiskers show standard deviation. For the unaggregated data see Tab. 7. able to improve over generative citation (GEN), with the exc… view at source ↗

**Figure 6.** Figure 6: Detailed results for NeoQA. Hop -1/0: Rk f per hop on multi-hop instances. multi-hop / intersection: average Rk f for the respective instance type. For analysis and discussion see §4. 0 25 50 75 100 Rkf per hop Hop -3 Hop -2 Llama-3.2-1B Llama-3.1-8B Qwen3-1.7B Qwen3-8B 0 25 50 75 100 Rkf per hop Hop -1 Llama-3.2-1B Llama-3.1-8B Qwen3-1.7B Qwen3-8B Hop 0 (A) Recall per Hop on MuSiQue 0 25 50 75 100 Rkf per… view at source ↗

**Figure 7.** Figure 7: Recall per hop on MuSiQue and NeoQA [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Citations from LLM-based RAG systems are supposed to simplify response verification. However, this goal is undermined in cases of citation failure, where a model generates a helpful response, but fails to generate citations to complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated efficiently. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to enable the analysis of failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To study the efficient improvement of LLM citation, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly separates citation failure from response failure and supplies a benchmark plus a combined mitigation method, but the response-quality controls across complexity levels need explicit checks.

read the letter

The paper separates citation failure from response failure in RAG and tests ways to fix the citation part. That distinction matters if you want to improve verifiability without fixing the whole answer first. They introduce CITECONTROL to vary the relation between the generated response and the evidence, such as number of supporting facts or chaining depth. Results indicate citation issues grow with that complexity. Then they propose CITENTION, which pulls together generative citation, attention mechanisms, and retrieval methods. It shows gains on the new benchmark and some transfer settings. Releasing the data and code is a plus for anyone who wants to build on this. The main question is whether response quality stays steady as the relational complexity increases in CITECONTROL. If harder setups also make the base responses less accurate, the citation problems could be mixed in with general failures. The abstract does not include checks or metrics for that separation, so the full paper needs to show how they kept response correctness constant or measured it. This work fits for groups focused on making retrieval-augmented generation more reliable, especially where citations need to be trustworthy. The benchmark adds a systematic test that prior work lacked, and the combined approach is worth testing further. It has enough substance to warrant peer review rather than a desk reject. I would send it to referees and ask them to look closely at the response quality controls and the exact scoring of citation success.

Referee Report

1 major / 1 minor

Summary. The paper defines citation failure in LLM-based RAG systems as the generation of a helpful response without citations to complete evidence, explicitly distinguishing this from response failure where the response itself is flawed. It introduces the CITECONTROL benchmark, which systematically varies the relational complexity (e.g., number of supporting facts, chaining depth) between response and evidence to analyze when citation failures occur. Experiments show that failures increase with relational complexity. The authors then propose CITENTION, a framework integrating generative, attention-based, and retrieval-based citation methods, and report substantial citation improvements on CITECONTROL and in transfer settings. Data and code are released publicly.

Significance. If the separation of citation failure from response failure holds under controlled conditions and the reported gains from CITENTION are robust, this work could meaningfully advance reliable, verifiable outputs in retrieval-augmented generation. The public release of data and code is a clear strength that supports reproducibility and community follow-up.

major comments (1)

[CITECONTROL benchmark description and experimental setup] The core premise that citation failures can be cleanly disentangled from response failures and that failures increase specifically due to relational complexity requires explicit validation that response correctness remains constant across CITECONTROL complexity levels. The benchmark description provides no indication of per-instance response-quality controls, accuracy metrics, or inter-annotator agreement confirming that response accuracy does not degrade with added facts, deeper chaining, or logical conjunctions. Without this, the observed failure increase and the gains attributed to combining citation methods could be partly driven by response degradation rather than citation-specific issues.

minor comments (1)

[Abstract and results presentation] The abstract states that failures increase with relational complexity and that CITENTION yields substantial gains, but provides no quantitative metrics, error bars, or details on citation quality scoring; the full results section should include these to allow assessment of effect sizes and statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding validation of response correctness in the CITECONTROL benchmark below and will update the manuscript accordingly.

read point-by-point responses

Referee: The core premise that citation failures can be cleanly disentangled from response failures and that failures increase specifically due to relational complexity requires explicit validation that response correctness remains constant across CITECONTROL complexity levels. The benchmark description provides no indication of per-instance response-quality controls, accuracy metrics, or inter-annotator agreement confirming that response accuracy does not degrade with added facts, deeper chaining, or logical conjunctions. Without this, the observed failure increase and the gains attributed to combining citation methods could be partly driven by response degradation rather than citation-specific issues.

Authors: We appreciate the referee highlighting this point. In the CITECONTROL benchmark, responses are constructed directly from the complete supporting evidence provided for each instance, ensuring factual accuracy and helpfulness by design independent of relational complexity (number of facts, chaining depth, or conjunctions). Citation failure is measured only on instances where the response is correct but citations to the full evidence are missing, consistent with our separation from response failure. To explicitly address the concern and strengthen the paper, we will add a subsection detailing the response generation protocol, automated accuracy verification, and any manual checks or inter-annotator agreement performed during benchmark creation. This will confirm response correctness remains stable across complexity levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark and framework are externally grounded

full rationale

The paper defines citation failure by explicit contrast to response failure, introduces the CITECONTROL benchmark to vary relational complexity between response and evidence, and evaluates the CITENTION framework via direct experiments on that benchmark plus transfer settings. No derivation step reduces a reported result to a quantity defined by the authors' own prior fitted parameters, self-citation chain, or ansatz smuggled through citation. The central claims rest on observable performance deltas measured against the new benchmark rather than on any self-referential redefinition or renaming of known quantities. The analysis therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that citation quality can be measured independently of response correctness and that synthetic relational variations in a benchmark generalize to real RAG failures. No free parameters or invented physical entities are described.

axioms (1)

domain assumption Citation failure can be isolated from response failure by construction in the benchmark design.
Stated in the abstract as the motivation for disentangling the two phenomena.

pith-pipeline@v0.9.0 · 5722 in / 1106 out tokens · 24940 ms · 2026-05-18T05:05:37.087574+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA

Attribute or abstain: Large language models as long document assistants. InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su

work page 2024
[2]

InThe Thirteenth International Conference on Learning Repre- sentations

Attention in large language models yields efficient zero-shot re-rankers. InThe Thirteenth International Conference on Learning Repre- sentations. Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, and Wen tau Yih. 2025. Selfcite: Self- supervised alignment for context attribution i...

work page arXiv 2025
[3]

Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan

Neoqa: Evidence-based question an- swering with generated news events.ArXiv Preprint, abs/2505.05949. Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan. 2024. Efficient data gener- ation for source-grounded information-seeking dialogs: A use case for meeting transcripts. In Findings of the Association for...

work page arXiv 2024
[4]

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias

Laquer: Localized attribution queries in content-grounded generation.ArXiv Preprint, abs/2506.01187. Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re- evaluating factual consistency evaluation. In Proceedings of the Second DialDo...

work page arXiv 2022
[5]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand

Improving attributed text generation of large language models via preference learn- ing. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand. Association for Computational Linguistics. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 7...

work page 2024
[6]

ExpertQA: Expert-curated questions and attributed answers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers), pages 3025–3045, Mexico City, Mexico. Association for Computational Linguistics. Kenton Murray and David Chiang. 2018. Cor- re...

work page 2024
[7]

why should i trust you?

Know what you don’t know: Unan- swerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics. Pritika Ramu, Koustava Goswami, Apoorv Sax- ena, and Balaji Vasan Srinivasan. 2024. En- hancing pos...

work page arXiv 2024
[8]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop questions via single-hop question composition.Transactions ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Qwen3 Technical Report

Qwen3 technical report.ArXiv Preprint, abs/2505.09388. Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

Longcite: Enabling llms to generate fine-grained citations in long-context qa.ArXiv Preprint, abs/2409.02897. Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, and Xi Ye. 2025. Query-focused retrieval heads improve long-context reasoning and re- ranking.ArXiv Preprint, abs/2506.09944. A Replication Information A.1 Prompts For each dataset, we use a 3-sho...

work page arXiv 2025
[11]

yes" or

{document_1} ... Question: {question} <assistant_input_start> Answer:_ A.2 CITECONTROLDetails Data processingSQuAD and BoolQ come with a single context paragraph. For each instance, we combine it with 19 randomly selected distractor paragraphs from other instances. For NeoQA, we select 20 articles as source documents per instance, such that 1 or 2 of them...

work page 2018
[12]

<think></think>

[4] These costs include research, development, and one-time production costs, such as expenses for testing equipment. [3] Table 4: Format explanations used in the prompts in our experiments. Purpose Package Base for CITENTION AT2 (Cohen-Wang et al., 2025) Generation Huggingface Transformers (Wolf et al., 2020) BM25 retrieval Rank-BM257 Dense retrieval Sen...

work page 2025

[1] [1]

InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA

Attribute or abstain: Large language models as long document assistants. InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su

work page 2024

[2] [2]

InThe Thirteenth International Conference on Learning Repre- sentations

Attention in large language models yields efficient zero-shot re-rankers. InThe Thirteenth International Conference on Learning Repre- sentations. Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, and Wen tau Yih. 2025. Selfcite: Self- supervised alignment for context attribution i...

work page arXiv 2025

[3] [3]

Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan

Neoqa: Evidence-based question an- swering with generated news events.ArXiv Preprint, abs/2505.05949. Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan. 2024. Efficient data gener- ation for source-grounded information-seeking dialogs: A use case for meeting transcripts. In Findings of the Association for...

work page arXiv 2024

[4] [4]

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias

Laquer: Localized attribution queries in content-grounded generation.ArXiv Preprint, abs/2506.01187. Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re- evaluating factual consistency evaluation. In Proceedings of the Second DialDo...

work page arXiv 2022

[5] [5]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand

Improving attributed text generation of large language models via preference learn- ing. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand. Association for Computational Linguistics. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 7...

work page 2024

[6] [6]

ExpertQA: Expert-curated questions and attributed answers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers), pages 3025–3045, Mexico City, Mexico. Association for Computational Linguistics. Kenton Murray and David Chiang. 2018. Cor- re...

work page 2024

[7] [7]

why should i trust you?

Know what you don’t know: Unan- swerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics. Pritika Ramu, Koustava Goswami, Apoorv Sax- ena, and Balaji Vasan Srinivasan. 2024. En- hancing pos...

work page arXiv 2024

[8] [8]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop questions via single-hop question composition.Transactions ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [9]

Qwen3 Technical Report

Qwen3 technical report.ArXiv Preprint, abs/2505.09388. Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

Longcite: Enabling llms to generate fine-grained citations in long-context qa.ArXiv Preprint, abs/2409.02897. Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, and Xi Ye. 2025. Query-focused retrieval heads improve long-context reasoning and re- ranking.ArXiv Preprint, abs/2506.09944. A Replication Information A.1 Prompts For each dataset, we use a 3-sho...

work page arXiv 2025

[11] [11]

yes" or

{document_1} ... Question: {question} <assistant_input_start> Answer:_ A.2 CITECONTROLDetails Data processingSQuAD and BoolQ come with a single context paragraph. For each instance, we combine it with 19 randomly selected distractor paragraphs from other instances. For NeoQA, we select 20 articles as source documents per instance, such that 1 or 2 of them...

work page 2018

[12] [12]

<think></think>

[4] These costs include research, development, and one-time production costs, such as expenses for testing equipment. [3] Table 4: Format explanations used in the prompts in our experiments. Purpose Package Base for CITENTION AT2 (Cohen-Wang et al., 2025) Generation Huggingface Transformers (Wolf et al., 2020) BM25 retrieval Rank-BM257 Dense retrieval Sen...

work page 2025