Explicit Evidence Grounding via Structured Inline Citation Generation

Amelie W\"uhrl; Anar Yeginbergen; Anna Rogers; Rodrigo Agerri

arxiv: 2606.07130 · v1 · pith:MVWLYXOBnew · submitted 2026-06-05 · 💻 cs.CL

Explicit Evidence Grounding via Structured Inline Citation Generation

Anar Yeginbergen , Amelie W\"uhrl , Anna Rogers , Rodrigo Agerri This is my paper

Pith reviewed 2026-06-27 21:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords inline citationsevidence groundingattributed generationspan identificationquestion answeringLLM evaluation

0 comments

The pith

LLMs can select relevant documents for citations but struggle to identify the precise evidence spans that support each claim.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FullCite, a framework that generates structured inline citations tying each claim to both its source document and the specific supporting evidence inside it. It tests three generation approaches on three question-answering benchmarks and measures performance on document selection, span identification, and faithfulness. The central finding is that models handle document-level choices adequately yet consistently fall short at locating exact supporting passages. This gap points to evidence-span detection as the main remaining obstacle for producing reliably attributed answers.

Core claim

FullCite produces structured inline citations that link each generated claim to its source document and to the exact evidence span within that document. The three tested strategies—prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment—are evaluated on ASQA, BioASQ, and ExpertQA along the dimensions of document-level correctness, evidence span identification, and claim-citation faithfulness. Results show reliable document selection but markedly weaker performance on identifying precise supporting spans, leading to the conclusion that faithful attributed QA requires stronger emphasis on span-level grounding.

What carries the argument

FullCite, a framework that generates structured inline citations linking each claim to both its source document and its supporting evidence span via prompt-based, constrained-decoding, or posthoc-alignment strategies.

If this is right

Achieving faithful attributed QA will require methods that improve precise evidence span identification.
All three citation-generation strategies face the same core difficulty with span accuracy.
Future systems should prioritize explicit mechanisms for locating and citing exact supporting passages rather than relying solely on document retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training or decoding objectives that directly optimize for span boundaries could narrow the observed performance gap.
The same span-identification weakness is likely to appear in other attribution-heavy tasks such as long-form summarization.
Automated metrics that score span precision without requiring human annotation would speed up iteration on these methods.

Load-bearing premise

The three chosen QA benchmarks and the three evaluation dimensions are sufficient to reveal the main limitations of current citation strategies.

What would settle it

A method that reaches high accuracy on precise evidence-span identification across the same three benchmarks while maintaining document-level performance would falsify the claim that span identification remains the primary bottleneck.

Figures

Figures reproduced from arXiv: 2606.07130 by Amelie W\"uhrl, Anar Yeginbergen, Anna Rogers, Rodrigo Agerri.

**Figure 2.** Figure 2: Overall quality comparison of citations across settings and datasets. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Position of the cited document in BioASQ. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Ratio of unique cited documents. 3. LLM and Human evaluation. We use LLMas-a-Judge and human evaluation to evaluate the same set of examples in order to assess how LLMs align or diverge in the citation evaluation task by asking two questions: (Q1): Does the citation support the claim that is being cited? and (Q2) Is the citation relevant to the claim?. The answer is provided on a scale between 1 and 5. T… view at source ↗

**Figure 4.** Figure 4: Average scores for Citation Support (Q1) and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Overall citation rate per method [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Quality of chosen citation under different [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Citation rate for each question type under each setting [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Citation support and relevance performance over the full set with GPT-5.4 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FullCite names a framework for span-linked citations and shows LLMs retrieve docs better than exact spans, but the evaluation has no baselines so the gap size is unclear.

read the letter

The main point is that this paper defines FullCite to generate inline citations that tie each claim to both a source document and a specific evidence span inside it. They test three generation approaches—prompting, constrained decoding with a grammar, and post-hoc alignment—on ASQA, BioASQ, and ExpertQA, and report that models handle document selection reasonably but do worse on pinpointing the supporting text.

The work is useful because it isolates evidence localization as a distinct sub-problem in attributed generation. The three strategies give a practical way to compare prompting versus more structured methods, and the three evaluation axes (document correctness, span ID, claim faithfulness) are reasonable dimensions to track.

The soft spot is the missing baselines. The headline result on span identification difficulty would be stronger if the paper had compared against simple methods like sentence-level TF-IDF overlap or prior citation-generation techniques. Without those, it is hard to separate model limitations from task difficulty or metric choice. The abstract also leaves the exact span scoring function unspecified, which makes the numbers harder to interpret.

This is for people working on faithful LLM output and citation mechanisms. A reader who cares about high-stakes attribution would find the gap worth thinking about, even if the methods need more comparison.

It deserves peer review because it surfaces a concrete, actionable weakness in current systems. The authors should expect questions on baselines and metric details, but the core framing is worth referee time.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FullCite, a framework for generating structured inline citations that explicitly link each generated claim to both its source document and the precise supporting evidence span. It defines three generation strategies (prompt-based, constrained decoding over a citation grammar, and posthoc span alignment) and evaluates citation quality on the ASQA, BioASQ, and ExpertQA benchmarks along three axes: document-level correctness, evidence span identification, and claim-citation faithfulness. The central empirical claim is that LLMs identify relevant documents effectively but struggle to identify the precise supporting spans.

Significance. If the evaluation is made rigorous with baselines and explicit metrics, the work would usefully document a concrete limitation in current LLM attribution capabilities and motivate targeted research on span-level grounding. The structured citation format itself is a clear contribution over document-only citation.

major comments (3)

[Evaluation] Evaluation section (and abstract): no baselines are reported for the evidence-span identification task (e.g., sentence retrieval, TF-IDF overlap, or prior citation-generation methods). Without them the headline gap between document-level and span-level performance cannot be distinguished from inherent task difficulty or metric choice.
[Evaluation] Evaluation section: the concrete scoring function used for "precise supporting spans" (exact match, token-level F1, ROUGE, etc.) and the precise definition of each of the three evaluation dimensions are not stated, rendering the quantitative claim that models "struggle" impossible to interpret or reproduce.
[§3] §3 (Strategies): the three generation strategies are described at a high level only; no implementation details, grammar definition for constrained decoding, or alignment procedure for the posthoc method are supplied. These details are load-bearing for assessing whether the observed span-identification failures are strategy-specific or fundamental.

minor comments (2)

[Abstract] Abstract: the sentence "Our evaluation shows…" asserts a quantitative finding but supplies no numbers, tables, or error bars; this should be moved to the results section or qualified.
Notation: the term "structured inline citation" is used without an explicit formal definition or example early in the paper; a short illustrative figure or example would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We agree that the evaluation section requires strengthening with baselines and explicit metric definitions, and that Section 3 needs additional implementation details. We will revise the manuscript to incorporate these changes, which will make the empirical claims more interpretable and reproducible while preserving the core contribution on structured inline citations.

read point-by-point responses

Referee: [Evaluation] Evaluation section (and abstract): no baselines are reported for the evidence-span identification task (e.g., sentence retrieval, TF-IDF overlap, or prior citation-generation methods). Without them the headline gap between document-level and span-level performance cannot be distinguished from inherent task difficulty or metric choice.

Authors: We acknowledge this limitation. In the revised manuscript we will add baselines for evidence-span identification, including TF-IDF-based sentence retrieval, simple overlap methods, and any applicable prior citation-generation approaches. These will be reported alongside the main results to better contextualize the observed gap between document-level and span-level performance. revision: yes
Referee: [Evaluation] Evaluation section: the concrete scoring function used for "precise supporting spans" (exact match, token-level F1, ROUGE, etc.) and the precise definition of each of the three evaluation dimensions are not stated, rendering the quantitative claim that models "struggle" impossible to interpret or reproduce.

Authors: We will expand the Evaluation section to explicitly define all three dimensions (document-level correctness, evidence span identification, and claim-citation faithfulness) and state the precise scoring functions used for each (including token-level F1 and any other metrics for span identification). This will ensure full reproducibility of the quantitative results. revision: yes
Referee: [§3] §3 (Strategies): the three generation strategies are described at a high level only; no implementation details, grammar definition for constrained decoding, or alignment procedure for the posthoc method are supplied. These details are load-bearing for assessing whether the observed span-identification failures are strategy-specific or fundamental.

Authors: We agree that the current description is insufficiently detailed. The revised version will include the full citation grammar used for constrained decoding, the exact prompt templates for the prompt-based strategy, and the complete procedure (including any similarity thresholds or alignment algorithms) for the posthoc span alignment method, along with all relevant hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation of proposed citation strategies on external benchmarks.

full rationale

The paper introduces FullCite and three generation strategies, then reports direct empirical measurements on the independent ASQA, BioASQ, and ExpertQA benchmarks across three explicitly defined dimensions. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The central claim is an observation from the evaluation results rather than a reduction to any self-referential definition or prior author work. The evaluation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5687 in / 990 out tokens · 17930 ms · 2026-06-27T21:55:40.370643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages

[1]

Scientific data , volume=

BioASQ-QA: A manually curated corpus for Biomedical Question Answering , author=. Scientific data , volume=. 2023 , publisher=

2023
[2]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Ground every sentence: Improving retrieval-augmented llms with interleaved reference-claim generation , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[3]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

MedCite: Can Language Models Generate Verifiable Text for Medicine? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[4]

Proceedings of the 17th International Natural Language Generation Conference , pages=

Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics , author=. Proceedings of the 17th International Natural Language Generation Conference , pages=
[5]

Nature Communications , volume=

An automated framework for assessing how well LLMs cite relevant medical references , author=. Nature Communications , volume=. 2025 , publisher=

2025
[6]

arXiv preprint arXiv:2212.08037 , year=

Attributed question answering: Evaluation and modeling for attributed large language models , author=. arXiv preprint arXiv:2212.08037 , year=

arXiv
[7]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

2015
[8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enabling large language models to generate text with citations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[9]

Wallat, Jonas and Heuss, Maria and de Rijke, Maarten and Anand, Avishek , journal=
[10]

Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei , booktitle=
[11]

doi:10.18653/v1/2022.emnlp-main.566

Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth , booktitle =. doi:10.18653/v1/2022.emnlp-main.566

work page doi:10.18653/v1/2022.emnlp-main.566 2022
[12]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv
[13]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[14]

https://arxiv.org/abs/2403.08295

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , url="https://arxiv.org/abs/2403.08295", year=

Pith/arXiv arXiv
[15]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[16]

Proceedings of the international multiconference of engineers and computer scientists , volume=

Using of Jaccard coefficient for keywords similarity , author=. Proceedings of the international multiconference of engineers and computer scientists , volume=
[17]

Proceedings of the second conference on machine translation , pages=

Popovi. Proceedings of the second conference on machine translation , pages=
[18]

Zhang, Jiajie and Bai, Yushi and Lv, Xin and Gu, Wanjun and Liu, Danqing and Zou, Minhao and Cao, Shulin and Hou, Lei and Dong, Yuxiao and Feng, Ling and others , booktitle=
[19]

and Li, Shang-Wen and Yih, Wen-Tau , booktitle =

Chuang, Yung-Sung and Cohen-Wang, Benjamin and Shen, Zejiang and Wu, Zhaofeng and Xu, Hu and Lin, Xi Victoria and Glass, James R. and Li, Shang-Wen and Yih, Wen-Tau , booktitle =. 2025 , editor =

2025
[20]

arXiv preprint arXiv:2307.09702 , year=

Efficient guided generation for large language models , author=. arXiv preprint arXiv:2307.09702 , year=

Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2112.09332 , year=

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

Pith/arXiv arXiv
[22]

RARR : Researching and Revising What Language Models Say, Using Language Models

Gao, Luyu and Dai, Zhuyun and Pasupat, Panupong and Chen, Anthony and Chaganty, Arun Tejasvi and Fan, Yicheng and Zhao, Vincent and Lao, Ni and Lee, Hongrae and Juan, Da-Cheng and Guu, Kelvin. RARR : Researching and Revising What Language Models Say, Using Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguis...

work page doi:10.18653/v1/2023.acl-long.910 2023
[23]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

2020
[24]

arXiv preprint arXiv:2303.18223 , volume=

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

Pith/arXiv arXiv
[25]

Decision Support Systems , volume=

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers , author=. Decision Support Systems , volume=. 2025 , publisher=

2025
[26]

Citation: A Key to Building Responsible and Accountable Large Language Models

Huang, Jie and Chang, Kevin. Citation: A Key to Building Responsible and Accountable Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.31

work page doi:10.18653/v1/2024.findings-naacl.31 2024
[27]

arXiv preprint arXiv:2311.03731 , year=

A survey of large language models attribution , author=. arXiv preprint arXiv:2311.03731 , year=

arXiv
[28]

Med-HALT : Medical domain hallucination test for large language models

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan. M ed- HALT : Medical Domain Hallucination Test for Large Language Models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.21

work page doi:10.18653/v1/2023.conll-1.21 2023
[29]

arXiv preprint arXiv:2311.01463 , year=

Creating trustworthy llms: Dealing with hallucinations in healthcare ai , author=. arXiv preprint arXiv:2311.01463 , year=

arXiv
[30]

Survey of hallucination in natural language generation,

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =

work page doi:10.1145/3571730 2023
[31]

arXiv preprint arXiv:2508.15396 , year=

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models , author=. arXiv preprint arXiv:2508.15396 , year=

Pith/arXiv arXiv
[32]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =. Advances in Neural Information Processing Systems , editor =
[33]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

2020
[34]

arXiv preprint arXiv:2312.10997 , volume=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=. 2023 , url=

Pith/arXiv arXiv 2023
[35]

Evaluating Verifiability in Generative Search Engines

Liu, Nelson and Zhang, Tianyi and Liang, Percy. Evaluating Verifiability in Generative Search Engines. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.467

work page doi:10.18653/v1/2023.findings-emnlp.467 2023
[36]

Measuring Attribution in Natural Language Generation Models

Rashkin, Hannah and Nikolaev, Vitaly and Lamm, Matthew and Aroyo, Lora and Collins, Michael and Das, Dipanjan and Petrov, Slav and Tomar, Gaurav Singh and Turc, Iulia and Reitter, David. Measuring Attribution in Natural Language Generation Models. Computational Linguistics. 2023. doi:10.1162/coli_a_00486

work page doi:10.1162/coli_a_00486 2023
[37]

Context-faithful Prompting for Large Language Models

Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao. Context-faithful Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.968

work page doi:10.18653/v1/2023.findings-emnlp.968 2023
[38]

Entity-Based Knowledge Conflicts in Question Answering

Longpre, Shayne and Perisetla, Kartik and Chen, Anthony and Ramesh, Nikhil and DuBois, Chris and Singh, Sameer. Entity-Based Knowledge Conflicts in Question Answering. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.565

work page doi:10.18653/v1/2021.emnlp-main.565 2021
[39]

Knowledge Conflicts for

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024
[40]

2022 , eprint=

Teaching language models to support answers with verified quotes , author=. 2022 , eprint=

2022
[41]

Learning Fine-Grained Grounded Citations for Attributed Large Language Models

Huang, Lei and Feng, Xiaocheng and Ma, Weitao and Gu, Yuxuan and Zhong, Weihong and Feng, Xiachong and Yu, Weijiang and Peng, Weihua and Tang, Duyu and Tu, Dandan and Qin, Bing. Learning Fine-Grained Grounded Citations for Attributed Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.find...

work page doi:10.18653/v1/2024.findings-acl.838 2024
[42]

Verifiable Generation with Subsentence-Level Fine-Grained Citations

Cao, Shuyang and Wang, Lu. Verifiable Generation with Subsentence-Level Fine-Grained Citations. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.920

work page doi:10.18653/v1/2024.findings-acl.920 2024
[43]

AL ii CE : Evaluating Positional Fine-grained Citation Generation

Xu, Yilong and Gao, Jinhua and Yu, Xiaoming and Bi, Baolong and Shen, Huawei and Cheng, Xueqi. AL ii CE : Evaluating Positional Fine-grained Citation Generation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653...

work page doi:10.18653/v1/2025.naacl-long.23 2025
[44]

International Conference on Learning Representations , year=

Recitation-Augmented Language Models , author=. International Conference on Learning Representations , year=
[45]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[46]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[47]

Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =

work page doi:10.1561/1500000019 2009
[48]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023

[1] [1]

Scientific data , volume=

BioASQ-QA: A manually curated corpus for Biomedical Question Answering , author=. Scientific data , volume=. 2023 , publisher=

2023

[2] [2]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Ground every sentence: Improving retrieval-augmented llms with interleaved reference-claim generation , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025

[3] [3]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

MedCite: Can Language Models Generate Verifiable Text for Medicine? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[4] [4]

Proceedings of the 17th International Natural Language Generation Conference , pages=

Towards fine-grained citation evaluation in generated text: A comparative analysis of faithfulness metrics , author=. Proceedings of the 17th International Natural Language Generation Conference , pages=

[5] [5]

Nature Communications , volume=

An automated framework for assessing how well LLMs cite relevant medical references , author=. Nature Communications , volume=. 2025 , publisher=

2025

[6] [6]

arXiv preprint arXiv:2212.08037 , year=

Attributed question answering: Evaluation and modeling for attributed large language models , author=. arXiv preprint arXiv:2212.08037 , year=

arXiv

[7] [7]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

2015

[8] [8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enabling large language models to generate text with citations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[9] [9]

Wallat, Jonas and Heuss, Maria and de Rijke, Maarten and Anand, Avishek , journal=

[10] [10]

Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei , booktitle=

[11] [11]

doi:10.18653/v1/2022.emnlp-main.566

Chaitanya Malaviya and Subin Lee and Sihao Chen and Elizabeth Sieber and Mark Yatskar and Dan Roth , booktitle =. doi:10.18653/v1/2022.emnlp-main.566

work page doi:10.18653/v1/2022.emnlp-main.566 2022

[12] [12]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv

[13] [13]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[14] [14]

https://arxiv.org/abs/2403.08295

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , url="https://arxiv.org/abs/2403.08295", year=

Pith/arXiv arXiv

[15] [15]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

[16] [16]

Proceedings of the international multiconference of engineers and computer scientists , volume=

Using of Jaccard coefficient for keywords similarity , author=. Proceedings of the international multiconference of engineers and computer scientists , volume=

[17] [17]

Proceedings of the second conference on machine translation , pages=

Popovi. Proceedings of the second conference on machine translation , pages=

[18] [18]

Zhang, Jiajie and Bai, Yushi and Lv, Xin and Gu, Wanjun and Liu, Danqing and Zou, Minhao and Cao, Shulin and Hou, Lei and Dong, Yuxiao and Feng, Ling and others , booktitle=

[19] [19]

and Li, Shang-Wen and Yih, Wen-Tau , booktitle =

Chuang, Yung-Sung and Cohen-Wang, Benjamin and Shen, Zejiang and Wu, Zhaofeng and Xu, Hu and Lin, Xi Victoria and Glass, James R. and Li, Shang-Wen and Yih, Wen-Tau , booktitle =. 2025 , editor =

2025

[20] [20]

arXiv preprint arXiv:2307.09702 , year=

Efficient guided generation for large language models , author=. arXiv preprint arXiv:2307.09702 , year=

Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2112.09332 , year=

Webgpt: Browser-assisted question-answering with human feedback , author=. arXiv preprint arXiv:2112.09332 , year=

Pith/arXiv arXiv

[22] [22]

RARR : Researching and Revising What Language Models Say, Using Language Models

Gao, Luyu and Dai, Zhuyun and Pasupat, Panupong and Chen, Anthony and Chaganty, Arun Tejasvi and Fan, Yicheng and Zhao, Vincent and Lao, Ni and Lee, Hongrae and Juan, Da-Cheng and Guu, Kelvin. RARR : Researching and Revising What Language Models Say, Using Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguis...

work page doi:10.18653/v1/2023.acl-long.910 2023

[23] [23]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

2020

[24] [24]

arXiv preprint arXiv:2303.18223 , volume=

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

Pith/arXiv arXiv

[25] [25]

Decision Support Systems , volume=

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers , author=. Decision Support Systems , volume=. 2025 , publisher=

2025

[26] [26]

Citation: A Key to Building Responsible and Accountable Large Language Models

Huang, Jie and Chang, Kevin. Citation: A Key to Building Responsible and Accountable Large Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.31

work page doi:10.18653/v1/2024.findings-naacl.31 2024

[27] [27]

arXiv preprint arXiv:2311.03731 , year=

A survey of large language models attribution , author=. arXiv preprint arXiv:2311.03731 , year=

arXiv

[28] [28]

Med-HALT : Medical domain hallucination test for large language models

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan. M ed- HALT : Medical Domain Hallucination Test for Large Language Models. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023. doi:10.18653/v1/2023.conll-1.21

work page doi:10.18653/v1/2023.conll-1.21 2023

[29] [29]

arXiv preprint arXiv:2311.01463 , year=

Creating trustworthy llms: Dealing with hallucinations in healthcare ai , author=. arXiv preprint arXiv:2311.01463 , year=

arXiv

[30] [30]

Survey of hallucination in natural language generation,

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. 2023 , issue_date =. doi:10.1145/3571730 , journal =

work page doi:10.1145/3571730 2023

[31] [31]

arXiv preprint arXiv:2508.15396 , year=

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models , author=. arXiv preprint arXiv:2508.15396 , year=

Pith/arXiv arXiv

[32] [32]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , url =. Advances in Neural Information Processing Systems , editor =

[33] [33]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

2020

[34] [34]

arXiv preprint arXiv:2312.10997 , volume=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=. 2023 , url=

Pith/arXiv arXiv 2023

[35] [35]

Evaluating Verifiability in Generative Search Engines

Liu, Nelson and Zhang, Tianyi and Liang, Percy. Evaluating Verifiability in Generative Search Engines. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.467

work page doi:10.18653/v1/2023.findings-emnlp.467 2023

[36] [36]

Measuring Attribution in Natural Language Generation Models

Rashkin, Hannah and Nikolaev, Vitaly and Lamm, Matthew and Aroyo, Lora and Collins, Michael and Das, Dipanjan and Petrov, Slav and Tomar, Gaurav Singh and Turc, Iulia and Reitter, David. Measuring Attribution in Natural Language Generation Models. Computational Linguistics. 2023. doi:10.1162/coli_a_00486

work page doi:10.1162/coli_a_00486 2023

[37] [37]

Context-faithful Prompting for Large Language Models

Zhou, Wenxuan and Zhang, Sheng and Poon, Hoifung and Chen, Muhao. Context-faithful Prompting for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.968

work page doi:10.18653/v1/2023.findings-emnlp.968 2023

[38] [38]

Entity-Based Knowledge Conflicts in Question Answering

Longpre, Shayne and Perisetla, Kartik and Chen, Anthony and Ramesh, Nikhil and DuBois, Chris and Singh, Sameer. Entity-Based Knowledge Conflicts in Question Answering. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.565

work page doi:10.18653/v1/2021.emnlp-main.565 2021

[39] [39]

Knowledge Conflicts for

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024

[40] [40]

2022 , eprint=

Teaching language models to support answers with verified quotes , author=. 2022 , eprint=

2022

[41] [41]

Learning Fine-Grained Grounded Citations for Attributed Large Language Models

Huang, Lei and Feng, Xiaocheng and Ma, Weitao and Gu, Yuxuan and Zhong, Weihong and Feng, Xiachong and Yu, Weijiang and Peng, Weihua and Tang, Duyu and Tu, Dandan and Qin, Bing. Learning Fine-Grained Grounded Citations for Attributed Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.find...

work page doi:10.18653/v1/2024.findings-acl.838 2024

[42] [42]

Verifiable Generation with Subsentence-Level Fine-Grained Citations

Cao, Shuyang and Wang, Lu. Verifiable Generation with Subsentence-Level Fine-Grained Citations. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.920

work page doi:10.18653/v1/2024.findings-acl.920 2024

[43] [43]

AL ii CE : Evaluating Positional Fine-grained Citation Generation

Xu, Yilong and Gao, Jinhua and Yu, Xiaoming and Bi, Baolong and Shen, Huawei and Cheng, Xueqi. AL ii CE : Evaluating Positional Fine-grained Citation Generation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653...

work page doi:10.18653/v1/2025.naacl-long.23 2025

[44] [44]

International Conference on Learning Representations , year=

Recitation-Augmented Language Models , author=. International Conference on Learning Representations , year=

[45] [45]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[46] [46]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[47] [47]

Robertson, Stephen and Zaragoza, Hugo , title =. Found. Trends Inf. Retr. , month = apr, pages =. 2009 , issue_date =. doi:10.1561/1500000019 , abstract =

work page doi:10.1561/1500000019 2009

[48] [48]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023