pith. sign in

arxiv: 2510.20303 · v3 · submitted 2025-10-23 · 💻 cs.CL

Citation Failure: Definition, Analysis and Efficient Mitigation

Pith reviewed 2026-05-18 05:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords citation failureRAG systemsLLM citationbenchmarkcitation generationretrieval augmented generationevidence linkingCITECONTROL
0
0 comments X

The pith

Citation failures in LLM RAG systems increase with relational complexity between response and evidence but can be substantially reduced by integrating generative, attention-based, and retrieval-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper separates citation failure, where an LLM produces a correct response but does not link it to complete evidence, from response failure where the answer itself is incorrect. The authors build CITECONTROL, a benchmark that alters how a response connects to its evidence to expose when citation problems arise. Experiments reveal that more complex relations cause more failures and that single citation techniques fall short. To fix this, they introduce CITENTION, a framework that merges generative, attention-based, and retrieval-based approaches, delivering clear gains on the benchmark and when moved to new tasks. Readers care because accurate citations let people verify AI claims without first doubting the entire answer.

Core claim

Citation failure occurs when a model generates a helpful response but fails to generate citations to complete evidence, in contrast to response failure where citing complete evidence is impossible because the response itself is flawed. The CITECONTROL benchmark systematically varies the relation between response and evidence to study failure modes, showing that failures increase with relational complexity. The CITENTION framework integrates generative, attention-based, and retrieval-based methods and demonstrates substantial citation improvements on CITECONTROL and in transfer settings.

What carries the argument

CITENTION, a framework that integrates generative, attention-based, and retrieval-based citation methods to address citation failure after response quality is assured.

If this is right

  • Citation quality drops as the link between a generated response and supporting evidence grows more complex.
  • Merging multiple citation techniques outperforms any one method alone on controlled tests.
  • Gains from the integrated framework carry over to tasks outside the original benchmark.
  • Varying response-evidence relations in a benchmark isolates distinct failure modes that single-method approaches miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems could detect and repair only the citation step without regenerating the main response.
  • The controlled variation technique could be applied to test other verification challenges such as factuality or source attribution.
  • Domain-specific adaptations of the framework might handle evidence relations common in legal or scientific queries.

Load-bearing premise

That citation failure can be cleanly separated from response failure in practice and that the synthetic variations in the CITECONTROL benchmark capture the dominant real-world causes of citation problems.

What would settle it

A human study on real RAG queries that finds many supposed citation failures are actually response flaws instead, or an evaluation showing that CITENTION produces no citation gains over single methods on a large set of out-of-domain queries.

Figures

Figures reproduced from arXiv: 2510.20303 by Iryna Gurevych, Jan Buchmann.

Figure 1
Figure 1. Figure 1: Citation Example: An LLM receives multiple documents and a question. The confusion matrix shows the possible outcomes for generated response and evidence. The response-evidence relation has reasoning type multi-hop. It is explicit for the response and [4], and implicit for the response and [2]. intertextuality (Kuznetsov et al., 2022), we de￾fine key properties of this relation and propose CITECONTROL, a b… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation strategy on CITECONTROL: For Rk, all predictions are evaluated for evidence recall @k, while for Rk f , only predictions with correct responses are evaluated. the response-evidence relation, which is recog￾nized by more general research on intertextuality (Kuznetsov et al., 2022), and distinguishes: • implicit: the response does not appear verbatim, but the evidence document is rel￾evant (e.g do… view at source ↗
Figure 3
Figure 3. Figure 3: (A) Citation precision decreases with the order of appearance in generation. (B) R [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of CITENTION. Left: Individual scores for each document are obtained from generation-based, attention-based and retrieval-based methods. Attention scores are averaged over indi￾vidual heads. Right: Scores from individual methods are summed to obtain a final citation prediction. Attention head weights θ and method weights w, b are omitted. investigate attention-based citation, and the com￾bination … view at source ↗
Figure 5
Figure 5. Figure 5: (A) CITENTION methods improve citation Rk f scores on CITECONTROL (§6.1). (B) CITEN￾TION methods trained on CITECONTROL improve Rk f scores on unseen tasks (§6.2). Bars show propor￾tion of answer statements that are attributable to the evidence (averaged over train datasets). Whiskers show standard deviation. For the unaggregated data see Tab. 7. able to improve over generative citation (GEN), with the exc… view at source ↗
Figure 6
Figure 6. Figure 6: Detailed results for NeoQA. Hop -1/0: Rk f per hop on multi-hop instances. multi-hop / intersection: average Rk f for the respective instance type. For analysis and discussion see §4. 0 25 50 75 100 Rkf per hop Hop -3 Hop -2 Llama-3.2-1B Llama-3.1-8B Qwen3-1.7B Qwen3-8B 0 25 50 75 100 Rkf per hop Hop -1 Llama-3.2-1B Llama-3.1-8B Qwen3-1.7B Qwen3-8B Hop 0 (A) Recall per Hop on MuSiQue 0 25 50 75 100 Rkf per… view at source ↗
Figure 7
Figure 7. Figure 7: Recall per hop on MuSiQue and NeoQA [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
read the original abstract

Citations from LLM-based RAG systems are supposed to simplify response verification. However, this goal is undermined in cases of citation failure, where a model generates a helpful response, but fails to generate citations to complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated efficiently. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to enable the analysis of failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To study the efficient improvement of LLM citation, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper defines citation failure in LLM-based RAG systems as the generation of a helpful response without citations to complete evidence, explicitly distinguishing this from response failure where the response itself is flawed. It introduces the CITECONTROL benchmark, which systematically varies the relational complexity (e.g., number of supporting facts, chaining depth) between response and evidence to analyze when citation failures occur. Experiments show that failures increase with relational complexity. The authors then propose CITENTION, a framework integrating generative, attention-based, and retrieval-based citation methods, and report substantial citation improvements on CITECONTROL and in transfer settings. Data and code are released publicly.

Significance. If the separation of citation failure from response failure holds under controlled conditions and the reported gains from CITENTION are robust, this work could meaningfully advance reliable, verifiable outputs in retrieval-augmented generation. The public release of data and code is a clear strength that supports reproducibility and community follow-up.

major comments (1)
  1. [CITECONTROL benchmark description and experimental setup] The core premise that citation failures can be cleanly disentangled from response failures and that failures increase specifically due to relational complexity requires explicit validation that response correctness remains constant across CITECONTROL complexity levels. The benchmark description provides no indication of per-instance response-quality controls, accuracy metrics, or inter-annotator agreement confirming that response accuracy does not degrade with added facts, deeper chaining, or logical conjunctions. Without this, the observed failure increase and the gains attributed to combining citation methods could be partly driven by response degradation rather than citation-specific issues.
minor comments (1)
  1. [Abstract and results presentation] The abstract states that failures increase with relational complexity and that CITENTION yields substantial gains, but provides no quantitative metrics, error bars, or details on citation quality scoring; the full results section should include these to allow assessment of effect sizes and statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding validation of response correctness in the CITECONTROL benchmark below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: The core premise that citation failures can be cleanly disentangled from response failures and that failures increase specifically due to relational complexity requires explicit validation that response correctness remains constant across CITECONTROL complexity levels. The benchmark description provides no indication of per-instance response-quality controls, accuracy metrics, or inter-annotator agreement confirming that response accuracy does not degrade with added facts, deeper chaining, or logical conjunctions. Without this, the observed failure increase and the gains attributed to combining citation methods could be partly driven by response degradation rather than citation-specific issues.

    Authors: We appreciate the referee highlighting this point. In the CITECONTROL benchmark, responses are constructed directly from the complete supporting evidence provided for each instance, ensuring factual accuracy and helpfulness by design independent of relational complexity (number of facts, chaining depth, or conjunctions). Citation failure is measured only on instances where the response is correct but citations to the full evidence are missing, consistent with our separation from response failure. To explicitly address the concern and strengthen the paper, we will add a subsection detailing the response generation protocol, automated accuracy verification, and any manual checks or inter-annotator agreement performed during benchmark creation. This will confirm response correctness remains stable across complexity levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark and framework are externally grounded

full rationale

The paper defines citation failure by explicit contrast to response failure, introduces the CITECONTROL benchmark to vary relational complexity between response and evidence, and evaluates the CITENTION framework via direct experiments on that benchmark plus transfer settings. No derivation step reduces a reported result to a quantity defined by the authors' own prior fitted parameters, self-citation chain, or ansatz smuggled through citation. The central claims rest on observable performance deltas measured against the new benchmark rather than on any self-referential redefinition or renaming of known quantities. The analysis therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that citation quality can be measured independently of response correctness and that synthetic relational variations in a benchmark generalize to real RAG failures. No free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Citation failure can be isolated from response failure by construction in the benchmark design.
    Stated in the abstract as the motivation for disentangling the two phenomena.

pith-pipeline@v0.9.0 · 5722 in / 1106 out tokens · 24940 ms · 2026-05-18T05:05:37.087574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA

    Attribute or abstain: Large language models as long document assistants. InPro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 8113–8140, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su

  2. [2]

    InThe Thirteenth International Conference on Learning Repre- sentations

    Attention in large language models yields efficient zero-shot re-rankers. InThe Thirteenth International Conference on Learning Repre- sentations. Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, and Wen tau Yih. 2025. Selfcite: Self- supervised alignment for context attribution i...

  3. [3]

    Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan

    Neoqa: Evidence-based question an- swering with generated news events.ArXiv Preprint, abs/2505.05949. Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, and Ido Dagan. 2024. Efficient data gener- ation for source-grounded information-seeking dialogs: A use case for meeting transcripts. In Findings of the Association for...

  4. [4]

    Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias

    Laquer: Localized attribution queries in content-grounded generation.ArXiv Preprint, abs/2506.01187. Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Co- hen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, and Yossi Matias. 2022. TRUE: Re- evaluating factual consistency evaluation. In Proceedings of the Second DialDo...

  5. [5]

    InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand

    Improving attributed text generation of large language models via preference learn- ing. InFindings of the Association for Computational Linguistics: ACL 2024, pages 5079–5101, Bangkok, Thailand. Association for Computational Linguistics. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 7...

  6. [6]

    ExpertQA: Expert-curated questions and attributed answers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers), pages 3025–3045, Mexico City, Mexico. Association for Computational Linguistics. Kenton Murray and David Chiang. 2018. Cor- re...

  7. [7]

    why should i trust you?

    Know what you don’t know: Unan- swerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics. Pritika Ramu, Koustava Goswami, Apoorv Sax- ena, and Balaji Vasan Srinivasan. 2024. En- hancing pos...

  8. [8]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop questions via single-hop question composition.Transactions ...

  9. [9]

    Qwen3 Technical Report

    Qwen3 technical report.ArXiv Preprint, abs/2505.09388. Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li

  10. [10]

    LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

    Longcite: Enabling llms to generate fine-grained citations in long-context qa.ArXiv Preprint, abs/2409.02897. Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, and Xi Ye. 2025. Query-focused retrieval heads improve long-context reasoning and re- ranking.ArXiv Preprint, abs/2506.09944. A Replication Information A.1 Prompts For each dataset, we use a 3-sho...

  11. [11]

    yes" or

    {document_1} ... Question: {question} <assistant_input_start> Answer:_ A.2 CITECONTROLDetails Data processingSQuAD and BoolQ come with a single context paragraph. For each instance, we combine it with 19 randomly selected distractor paragraphs from other instances. For NeoQA, we select 20 articles as source documents per instance, such that 1 or 2 of them...

  12. [12]

    <think></think>

    [4] These costs include research, development, and one-time production costs, such as expenses for testing equipment. [3] Table 4: Format explanations used in the prompts in our experiments. Purpose Package Base for CITENTION AT2 (Cohen-Wang et al., 2025) Generation Huggingface Transformers (Wolf et al., 2020) BM25 retrieval Rank-BM257 Dense retrieval Sen...