Measuring Attribution in Natural Language Generation Models

David Reitter; Dipanjan Das; Gaurav Singh Tomar; Hannah Rashkin; Iulia Turc; Lora Aroyo; Matthew Lamm; Michael Collins; Slav Petrov; Vitaly Nikolaev

arxiv: 2112.12870 · v2 · pith:4VYHAM7Hnew · submitted 2021-12-23 · 💻 cs.CL

Measuring Attribution in Natural Language Generation Models

Hannah Rashkin , Vitaly Nikolaev , Matthew Lamm , Lora Aroyo , Michael Collins , Dipanjan Das , Slav Petrov , Gaurav Singh Tomar

show 2 more authors

Iulia Turc David Reitter

This is my paper

classification 💻 cs.CL

keywords generationoutputevaluationlanguagemodelsnaturaldatasetdatasets

0 comments

read the original abstract

With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
cs.CL 2026-06 unverdicted novelty 7.0

WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.
Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA
cs.CL 2026-06 unverdicted novelty 7.0

Re-ranking retrieval candidates via a cross-encoder trained on continuous perturbation-based attribution scores improves citation faithfulness and gold-answer alignment in legal QA over semantic similarity.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 unverdicted novelty 6.0

ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
cs.CL 2025-05 conditional novelty 6.0

ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...
LaMDA: Language Models for Dialog Applications
cs.CL 2022-01 unverdicted novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
How Large Language Models Source Brand Reputation Across Languages and Markets
cs.IR 2026-06 unverdicted novelty 5.0

LLMs cite third-party domains for 85.7% of brand attributions, with Wikipedia dominant in most languages, a long-tailed domain distribution, and market-specific shifts such as YouTube and HR sites in Poland.