Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models
Pith reviewed 2026-05-18 22:11 UTC · model grok-4.3
The pith
A survey of 134 papers creates a unified taxonomy for evidence-based text generation with large language models that relies on citations, attribution, and quotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By systematically reviewing 134 papers, the authors establish a unified taxonomy of evidence-based text generation with large language models that focuses on methods using citations, attribution, or quotations, while surveying 300 evaluation metrics in seven key dimensions to describe distinctive characteristics, representative methods, open challenges, and promising future directions.
What carries the argument
The unified taxonomy that classifies approaches to evidence-based text generation relying on citations, attribution, or quotations.
If this is right
- Researchers gain a common structure for organizing and comparing work on verifiable language model outputs.
- Evaluation can draw on the surveyed set of 300 metrics across seven dimensions for more consistent testing.
- Open challenges highlighted in the survey can direct future efforts toward improving traceability.
- Representative methods become easier to identify and extend in new systems.
Where Pith is reading between the lines
- The taxonomy could support creation of shared benchmarks that measure how effectively different methods link outputs to evidence.
- It might clarify links to related problems such as reducing unverified claims in language model responses.
- Later surveys in nearby areas like automated fact checking could adapt the same organizational structure.
- Testing the taxonomy against papers published after this review would check whether it remains comprehensive.
Load-bearing premise
The 134 selected papers sufficiently represent the field and the taxonomy organizes all relevant approaches without major gaps or selection biases.
What would settle it
A follow-up review that identifies many papers or methods on the topic that fall outside the proposed taxonomy categories would show the survey is incomplete.
Figures
read the original abstract
The increasing adoption of large language models (LLMs) has raised serious concerns about their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a systematic survey of evidence-based text generation with large language models, focusing on approaches that employ citations, attribution, or quotations. It analyzes 134 papers, proposes a unified taxonomy, catalogs and investigates 300 evaluation metrics across seven dimensions, reviews representative methods and their characteristics, and identifies open challenges along with future research directions.
Significance. If the paper selection and synthesis hold, this work provides a useful consolidation of a fragmented area by offering a shared taxonomy and an extensive catalog of metrics. The scale of the review (134 papers and 300 metrics) and the explicit scoping to citation/attribution/quotation methods constitute a concrete organizational contribution that can help standardize evaluation practices and highlight gaps in trustworthy LLM generation.
major comments (1)
- [§2] §2 (Survey Methodology): the description of the literature search strategy, databases, keywords, time range, and inclusion/exclusion criteria is insufficiently detailed to support the claim of a systematic analysis of 134 papers. Reproducibility and assessment of coverage bias require explicit reporting of these steps.
minor comments (2)
- [Abstract] The abstract states that metrics are examined 'across seven key dimensions' but does not enumerate them; adding the list would improve immediate readability.
- [Figures 2 and 4] Figure captions and axis labels in the taxonomy overview and metric summary figures should be expanded to be self-contained, as some rely on abbreviations defined only in the main text.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for minor revision. We are encouraged by the recognition of the survey's scope and organizational contributions. We address the single major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [§2] §2 (Survey Methodology): the description of the literature search strategy, databases, keywords, time range, and inclusion/exclusion criteria is insufficiently detailed to support the claim of a systematic analysis of 134 papers. Reproducibility and assessment of coverage bias require explicit reporting of these steps.
Authors: We thank the referee for highlighting this point. Section 2 currently provides a high-level overview of our paper selection process leading to the 134 papers, but we agree that greater specificity is needed for full reproducibility. In the revised manuscript, we will expand §2 with explicit details on: the databases and repositories searched (arXiv, Google Scholar, ACL Anthology, and NeurIPS/ICLR/EMNLP proceedings); the keyword combinations and Boolean queries employed (e.g., “large language model” AND (“citation” OR “attribution” OR “quotation” OR “evidence-based generation”)); the time range (primarily 2018–2024, with emphasis on post-2022 works); and the inclusion/exclusion criteria (must involve LLMs for text generation, incorporate explicit evidence mechanisms such as citations/attributions/quotations, be in English, and report empirical results; exclude non-LLM methods, purely theoretical papers without generation components, and works focused solely on retrieval without generation). These additions will enable readers to evaluate coverage and potential biases while preserving the integrity of the taxonomy and the analysis of 300 metrics. revision: yes
Circularity Check
No significant circularity
full rationale
This is a literature survey paper that systematically reviews 134 external papers, proposes a taxonomy for evidence-based LLM text generation via citations/attribution/quotations, and catalogs 300 metrics across seven dimensions. The contribution consists of descriptive synthesis and organization of existing work rather than any derivations, equations, fitted parameters, or predictions. No load-bearing step reduces to the paper's own inputs by construction, self-citation, or ansatz smuggling. The scoping and methodology are explicitly stated as a review process drawing on independent sources, rendering the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The field of evidence-based text generation with LLMs is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
focus on approaches that use citations, attribution, or quotations for evidence-based text generation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
Enforcing sentence-level citations degrades LLM attribution quality by 16-276% versus paragraph-level, with larger models penalized more due to disrupted semantic synthesis.
Reference graph
Works this paper leans on
-
[1]
Attribute or abstain: Large language models as long document assistants. In Pro- ceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing , pages 8113–8140, Miami, Florida, USA. Asso- ciation for Computational Linguistics. Courtni Byun, Piper Vasicek, and Kevin Seppi
work page 2024
-
[2]
This reference does not exist: An explo- ration of LLM citation accuracy and relevance. In Proceedings of the Third Workshop on Bridg- ing Human–Computer Interaction and Natu- ral Language Processing, pages 28–39, Mexico City, Mexico. Association for Computational Linguistics. Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, and Kelvin Guu. 2023...
work page 2023
-
[3]
CORAL: Benchmarking multi-turn con- versational retrieval-augmented generation. In Findings of the Association for Computational Linguistics: NAACL 2025 , pages 1308–1330, Albuquerque, New Mexico. Association for Computational Linguistics. Zheng Chu, Jingchang Chen, Zhongjie Wang, Guo Tang, Qianglong Chen, Ming Liu, and Bing Qin. 2025. Towards faithful mu...
work page 2025
-
[4]
EWEK-QA : Enhanced web and efficient knowledge graph retrieval for citation-based question answering systems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14169–14187, Bangkok, Thailand. Association for Computational Linguistics. Haolin Deng, Chang Wang, Li Xin, Dezhang Yuan, Jun...
work page 2024
-
[5]
In Interna- tional Conference on Learning Representations
Wizard of wikipedia: Knowledge- powered conversational agents. In Interna- tional Conference on Learning Representations. Hyo Jin Do, Rachel Ostrand, Justin D. Weisz, Casey Dugan, Prasanna Sattigeri, Dennis Wei, Keerthiram Murugesan, and Werner Geyer
-
[6]
Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter
Facilitating human-llm collaboration through factuality scores and source attribu- tions. Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2022. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computa- tional Linguistics, 10:1066–1083. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven S...
work page 2022
-
[7]
ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 3558–3567, Florence, Italy. Association for Computational Linguistics. Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A survey on rag meeting llms: Towards retri...
work page 2024
-
[8]
HGOT: Hierarchical graph of thoughts for retrieval-augmented in-context learning in factuality evaluation. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024) , pages 118–144, Mexico City, Mexico. Association for Compu- tational Linguistics. Michael Färber and Adam Jatowt. 2020. Cita- tion recommendation: approach...
-
[9]
RAG-RewardBench: Benchmarking re- ward models in retrieval augmented generation for preference alignment. In Findings of the As- sociation for Computational Linguistics: ACL 2025, pages 17061–17090, Vienna, Austria. As- sociation for Computational Linguistics. Kristiina Jokinen. 2024. The need for grounding in LLM-based dialogue systems. In Proceed- ings ...
work page 2025
-
[10]
From matching to generation: A sur- vey on generative information retrieval. ACM Trans. Inf. Syst., 43(3). Xinze Li, Yixin Cao, Liangming Pan, Yubo Ma, and Aixin Sun. 2024d. Towards verifiable gen- eration: A benchmark for knowledge-aware language model attribution. In Findings of the Association for Computational Linguistics: ACL 2024, pages 493–516, Ban...
work page 2024
-
[11]
On learning text style transfer with di- rect rewards. In Proceedings of the 2021 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 4262–4273, Online. Association for Computational Linguis- tics. Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC:...
work page 2021
-
[12]
ExpertQA: Expert-curated questions and attributed answers. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers) , pages 3025–3045, Mexico City, Mexico. Association for Computational Linguistics. Alex Mallen, Akari Asai, Victor Zhong, Ra- ...
work page 2024
-
[13]
FActScore: Fine-grained atomic eval- uation of factual precision in long form text generation. In Proceedings of the 2023 Con- ference on Empirical Methods in Natural Lan- guage Processing, pages 12076–12100, Singa- pore. Association for Computational Linguis- tics. Mazda Moayeri, Elham Tabassi, and Soheil Feizi
work page 2023
-
[14]
Worldbench: Quantifying geographic disparities in llm factual recall. In Proceed- ings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, page 1211–1228, New York, NY , USA. Associ- ation for Computing Machinery. Benjamin Muller, John Wieting, Jonathan Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Soares, Roee Aharoni, J...
work page 2024
-
[15]
Gpt-4 technical report. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Chris- tiano, Jan Leike, and Ryan Lowe. 2022. Train- ing language models to foll...
work page 2022
-
[16]
Towards improved multi-source attribu- tion for long-form answer generation. In Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 3906– 3919, Mexico City, Mexico. Association for Computational Linguistics. Vinzent Penzkofer and...
work page 2024
-
[17]
Association for Computational Linguistics
Are large language model temporally grounded? In Proceedings of the 2024 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pages 7064–7083, Mexico City, Mex- ico. Association for Computational Linguistics. Pritika Ramu, Koustava Goswami, Apoorv Sax- ena, and ...
work page 2024
-
[18]
A comprehensive survey of hallucination in large language, image, video and audio foun- dation models. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 11709–11724, Miami, Florida, USA. As- sociation for Computational Linguistics. Phillip Schneider, Tim Schopf, Juraj Vladika, Mikhail Galkin, Elena Simperl, and Florian Matth...
work page 2024
-
[19]
Adaptive question answering: Enhanc- ing language model proficiency for addressing knowledge conflicts with source citations. In Proceedings of the 2024 Conference on Empir- ical Methods in Natural Language Processing , pages 17226–17239, Miami, Florida, USA. As- sociation for Computational Linguistics. Jiajun Shen, Tong Zhou, Yubo Chen, and Kang Liu. 202...
work page 2024
-
[20]
Effective large language model adapta- tion for improved grounding and citation gener- ation. In Proceedings of the 2024 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers), pages 6237–6251, Mexico City, Mexico. Association for Computational Linguistics. Wenhao ...
work page 2024
-
[21]
A survey of knowledge-enhanced text generation. ACM Comput. Surv., 54(11s). Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. 2025a. LongCite: Enabling LLMs to generate fine-grained citations in long-context QA. In Findings of the Association for Computational Linguistics: ACL ...
work page 2025
-
[22]
Chatgpt hallucinates when attributing answers. In Proceedings of the Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , SIGIR-AP ’23, page 46–51, New York, NY , USA. Association for Computing Machinery. A Supplementary Material The appendix provides supplementary material supportin...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.