Explicit Evidence Grounding via Structured Inline Citation Generation
Pith reviewed 2026-06-27 21:55 UTC · model grok-4.3
The pith
LLMs can select relevant documents for citations but struggle to identify the precise evidence spans that support each claim.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FullCite produces structured inline citations that link each generated claim to its source document and to the exact evidence span within that document. The three tested strategies—prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment—are evaluated on ASQA, BioASQ, and ExpertQA along the dimensions of document-level correctness, evidence span identification, and claim-citation faithfulness. Results show reliable document selection but markedly weaker performance on identifying precise supporting spans, leading to the conclusion that faithful attributed QA requires stronger emphasis on span-level grounding.
What carries the argument
FullCite, a framework that generates structured inline citations linking each claim to both its source document and its supporting evidence span via prompt-based, constrained-decoding, or posthoc-alignment strategies.
If this is right
- Achieving faithful attributed QA will require methods that improve precise evidence span identification.
- All three citation-generation strategies face the same core difficulty with span accuracy.
- Future systems should prioritize explicit mechanisms for locating and citing exact supporting passages rather than relying solely on document retrieval.
Where Pith is reading between the lines
- Training or decoding objectives that directly optimize for span boundaries could narrow the observed performance gap.
- The same span-identification weakness is likely to appear in other attribution-heavy tasks such as long-form summarization.
- Automated metrics that score span precision without requiring human annotation would speed up iteration on these methods.
Load-bearing premise
The three chosen QA benchmarks and the three evaluation dimensions are sufficient to reveal the main limitations of current citation strategies.
What would settle it
A method that reaches high accuracy on precise evidence-span identification across the same three benchmarks while maintaining document-level performance would falsify the claim that span identification remains the primary bottleneck.
Figures
read the original abstract
As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FullCite, a framework for generating structured inline citations that explicitly link each generated claim to both its source document and the precise supporting evidence span. It defines three generation strategies (prompt-based, constrained decoding over a citation grammar, and posthoc span alignment) and evaluates citation quality on the ASQA, BioASQ, and ExpertQA benchmarks along three axes: document-level correctness, evidence span identification, and claim-citation faithfulness. The central empirical claim is that LLMs identify relevant documents effectively but struggle to identify the precise supporting spans.
Significance. If the evaluation is made rigorous with baselines and explicit metrics, the work would usefully document a concrete limitation in current LLM attribution capabilities and motivate targeted research on span-level grounding. The structured citation format itself is a clear contribution over document-only citation.
major comments (3)
- [Evaluation] Evaluation section (and abstract): no baselines are reported for the evidence-span identification task (e.g., sentence retrieval, TF-IDF overlap, or prior citation-generation methods). Without them the headline gap between document-level and span-level performance cannot be distinguished from inherent task difficulty or metric choice.
- [Evaluation] Evaluation section: the concrete scoring function used for "precise supporting spans" (exact match, token-level F1, ROUGE, etc.) and the precise definition of each of the three evaluation dimensions are not stated, rendering the quantitative claim that models "struggle" impossible to interpret or reproduce.
- [§3] §3 (Strategies): the three generation strategies are described at a high level only; no implementation details, grammar definition for constrained decoding, or alignment procedure for the posthoc method are supplied. These details are load-bearing for assessing whether the observed span-identification failures are strategy-specific or fundamental.
minor comments (2)
- [Abstract] Abstract: the sentence "Our evaluation shows…" asserts a quantitative finding but supplies no numbers, tables, or error bars; this should be moved to the results section or qualified.
- Notation: the term "structured inline citation" is used without an explicit formal definition or example early in the paper; a short illustrative figure or example would improve clarity.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We agree that the evaluation section requires strengthening with baselines and explicit metric definitions, and that Section 3 needs additional implementation details. We will revise the manuscript to incorporate these changes, which will make the empirical claims more interpretable and reproducible while preserving the core contribution on structured inline citations.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (and abstract): no baselines are reported for the evidence-span identification task (e.g., sentence retrieval, TF-IDF overlap, or prior citation-generation methods). Without them the headline gap between document-level and span-level performance cannot be distinguished from inherent task difficulty or metric choice.
Authors: We acknowledge this limitation. In the revised manuscript we will add baselines for evidence-span identification, including TF-IDF-based sentence retrieval, simple overlap methods, and any applicable prior citation-generation approaches. These will be reported alongside the main results to better contextualize the observed gap between document-level and span-level performance. revision: yes
-
Referee: [Evaluation] Evaluation section: the concrete scoring function used for "precise supporting spans" (exact match, token-level F1, ROUGE, etc.) and the precise definition of each of the three evaluation dimensions are not stated, rendering the quantitative claim that models "struggle" impossible to interpret or reproduce.
Authors: We will expand the Evaluation section to explicitly define all three dimensions (document-level correctness, evidence span identification, and claim-citation faithfulness) and state the precise scoring functions used for each (including token-level F1 and any other metrics for span identification). This will ensure full reproducibility of the quantitative results. revision: yes
-
Referee: [§3] §3 (Strategies): the three generation strategies are described at a high level only; no implementation details, grammar definition for constrained decoding, or alignment procedure for the posthoc method are supplied. These details are load-bearing for assessing whether the observed span-identification failures are strategy-specific or fundamental.
Authors: We agree that the current description is insufficiently detailed. The revised version will include the full citation grammar used for constrained decoding, the exact prompt templates for the prompt-based strategy, and the complete procedure (including any similarity thresholds or alignment algorithms) for the posthoc span alignment method, along with all relevant hyperparameters. revision: yes
Circularity Check
No significant circularity: empirical evaluation of proposed citation strategies on external benchmarks.
full rationale
The paper introduces FullCite and three generation strategies, then reports direct empirical measurements on the independent ASQA, BioASQ, and ExpertQA benchmarks across three explicitly defined dimensions. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The central claim is an observation from the evaluation results rather than a reduction to any self-referential definition or prior author work. The evaluation chain is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Scientific data , volume=
BioASQ-QA: A manually curated corpus for Biomedical Question Answering , author=. Scientific data , volume=. 2023 , publisher=
2023
-
[2]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Ground every sentence: Improving retrieval-augmented llms with interleaved reference-claim generation , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[3]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
MedCite: Can Language Models Generate Verifiable Text for Medicine? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[4]
Proceedings of the 17th International Natural Language Generation Conference , pages=
Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics , author=. Proceedings of the 17th International Natural Language Generation Conference , pages=
-
[5]
Nature Communications , volume=
An automated framework for assessing how well LLMs cite relevant medical references , author=. Nature Communications , volume=. 2025 , publisher=
2025
-
[6]
arXiv preprint arXiv:2212.08037 , year=
Attributed question answering: Evaluation and modeling for attributed large language models , author=. arXiv preprint arXiv:2212.08037 , year=
-
[7]
BMC bioinformatics , volume=
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=
2015
-
[8]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Enabling large language models to generate text with citations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
-
[9]
Wallat, Jonas and Heuss, Maria and de Rijke, Maarten and Anand, Avishek , journal=
-
[10]
Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei , booktitle=
-
[11]
doi:10.18653/v1/2022.emnlp-main.566
Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth , booktitle =. doi:10.18653/v1/2022.emnlp-main.566
-
[12]
arXiv preprint arXiv:2601.03267 , year=
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
-
[13]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[14]
https://arxiv.org/abs/2403.08295
Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , url="https://arxiv.org/abs/2403.08295", year=
-
[15]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[16]
Proceedings of the international multiconference of engineers and computer scientists , volume=
Using of Jaccard coefficient for keywords similarity , author=. Proceedings of the international multiconference of engineers and computer scientists , volume=
-
[17]
Proceedings of the second conference on machine translation , pages=
Popovi. Proceedings of the second conference on machine translation , pages=
-
[18]
Zhang, Jiajie and Bai, Yushi and Lv, Xin and Gu, Wanjun and Liu, Danqing and Zou, Minhao and Cao, Shulin and Hou, Lei and Dong, Yuxiao and Feng, Ling and others , booktitle=
-
[19]
and Li, Shang-Wen and Yih, Wen-Tau , booktitle =
Chuang, Yung-Sung and Cohen-Wang, Benjamin and Shen, Zejiang and Wu, Zhaofeng and Xu, Hu and Lin, Xi Victoria and Glass, James R. and Li, Shang-Wen and Yih, Wen-Tau , booktitle =. 2025 , editor =
2025
-
[20]
arXiv preprint arXiv:2307.09702 , year=
Efficient guided generation for large language models , author=. arXiv preprint arXiv:2307.09702 , year=
-
[21]
arXiv preprint arXiv:2112.09332 , year=
Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=
-
[22]
RARR : Researching and Revising What Language Models Say, Using Language Models
Gao, Luyu and Dai, Zhuyun and Pasupat, Panupong and Chen, Anthony and Chaganty, Arun Tejasvi and Fan, Yicheng and Zhao, Vincent and Lao, Ni and Lee, Hongrae and Juan, Da-Cheng and Guu, Kelvin. RARR : Researching and Revising What Language Models Say, Using Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguis...
-
[23]
Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
2020
-
[24]
arXiv preprint arXiv:2303.18223 , volume=
A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=
-
[25]
Decision Support Systems , volume=
Know where to go: Make LLM a relevant, responsible, and trustworthy searchers , author=. Decision Support Systems , volume=. 2025 , publisher=
2025
-
[26]
Citation: A Key to Building Responsible and Accountable Large Language Models
Huang, Jie and Chang, Kevin. Citation: A Key to Building Responsible and Accountable Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.31
-
[27]
arXiv preprint arXiv:2311.03731 , year=
A survey of large language models attribution , author=. arXiv preprint arXiv:2311.03731 , year=
-
[28]
Med-HALT : Medical domain hallucination test for large language models
Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan. M ed- HALT : Medical Domain Hallucination Test for Large Language Models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.21
-
[29]
arXiv preprint arXiv:2311.01463 , year=
Creating trustworthy llms: Dealing with hallucinations in healthcare ai , author=. arXiv preprint arXiv:2311.01463 , year=
-
[30]
Survey of hallucination in natural language generation,
Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =
-
[31]
arXiv preprint arXiv:2508.15396 , year=
Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models , author=. arXiv preprint arXiv:2508.15396 , year=
-
[32]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =. Advances in Neural Information Processing Systems , editor =
-
[33]
Proceedings of the 37th International Conference on Machine Learning , articleno =
Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =
2020
-
[34]
arXiv preprint arXiv:2312.10997 , volume=
Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=. 2023 , url=
Pith/arXiv arXiv 2023
-
[35]
Evaluating Verifiability in Generative Search Engines
Liu, Nelson and Zhang, Tianyi and Liang, Percy. Evaluating Verifiability in Generative Search Engines. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.467
-
[36]
Measuring Attribution in Natural Language Generation Models
Rashkin, Hannah and Nikolaev, Vitaly and Lamm, Matthew and Aroyo, Lora and Collins, Michael and Das, Dipanjan and Petrov, Slav and Tomar, Gaurav Singh and Turc, Iulia and Reitter, David. Measuring Attribution in Natural Language Generation Models. Computational Linguistics. 2023. doi:10.1162/coli_a_00486
-
[37]
Context-faithful Prompting for Large Language Models
Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao. Context-faithful Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.968
-
[38]
Entity-Based Knowledge Conflicts in Question Answering
Longpre, Shayne and Perisetla, Kartik and Chen, Anthony and Ramesh, Nikhil and DuBois, Chris and Singh, Sameer. Entity-Based Knowledge Conflicts in Question Answering. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.565
-
[39]
Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486
-
[40]
2022 , eprint=
Teaching language models to support answers with verified quotes , author=. 2022 , eprint=
2022
-
[41]
Learning Fine-Grained Grounded Citations for Attributed Large Language Models
Huang, Lei and Feng, Xiaocheng and Ma, Weitao and Gu, Yuxuan and Zhong, Weihong and Feng, Xiachong and Yu, Weijiang and Peng, Weihua and Tang, Duyu and Tu, Dandan and Qin, Bing. Learning Fine-Grained Grounded Citations for Attributed Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.find...
-
[42]
Verifiable Generation with Subsentence-Level Fine-Grained Citations
Cao, Shuyang and Wang, Lu. Verifiable Generation with Subsentence-Level Fine-Grained Citations. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.920
-
[43]
AL ii CE : Evaluating Positional Fine-grained Citation Generation
Xu, Yilong and Gao, Jinhua and Yu, Xiaoming and Bi, Baolong and Shen, Huawei and Cheng, Xueqi. AL ii CE : Evaluating Positional Fine-grained Citation Generation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653...
-
[44]
International Conference on Learning Representations , year=
Recitation-Augmented Language Models , author=. International Conference on Learning Representations , year=
-
[45]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638
-
[46]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410
-
[47]
Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =
-
[48]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.