pith. sign in

arxiv: 2112.12870 · v2 · pith:4VYHAM7Hnew · submitted 2021-12-23 · 💻 cs.CL

Measuring Attribution in Natural Language Generation Models

classification 💻 cs.CL
keywords generationoutputevaluationlanguagemodelsnaturaldatasetdatasets
0
0 comments X
read the original abstract

With recent improvements in natural language generation (NLG) models for various applications, it has become imperative to have the means to identify and evaluate whether NLG output is only sharing verifiable information about the external world. In this work, we present a new evaluation framework entitled Attributable to Identified Sources (AIS) for assessing the output of natural language generation models, when such output pertains to the external world. We first define AIS and introduce a two-stage annotation pipeline for allowing annotators to appropriately evaluate model output according to AIS guidelines. We empirically validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset) via human evaluation studies that suggest that AIS could serve as a common framework for measuring whether model-generated statements are supported by underlying sources. We release guidelines for the human evaluation studies.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    cs.CL 2022-01 accept novelty 9.0

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  2. WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

    cs.CL 2026-06 unverdicted novelty 7.0

    WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.

  3. Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

    cs.CL 2026-06 unverdicted novelty 7.0

    Re-ranking retrieval candidates via a cross-encoder trained on continuous perturbation-based attribution scores improves citation faithfulness and gold-answer alignment in legal QA over semantic similarity.

  4. ZeroSearch: Incentivize the Search Capability of LLMs without Searching

    cs.CL 2025-05 unverdicted novelty 6.0

    ZeroSearch uses supervised fine-tuning to create a simulated retrieval module and curriculum-based RL rollouts that degrade document quality to train LLMs on search capabilities without real search API calls.

  5. ZeroSearch: Incentivize the Search Capability of LLMs without Searching

    cs.CL 2025-05 conditional novelty 6.0

    ZeroSearch simulates search engine interactions via supervised fine-tuning of a retrieval module and curriculum-based RL degradation of document quality, achieving comparable or superior performance to real search eng...

  6. LaMDA: Language Models for Dialog Applications

    cs.CL 2022-01 unverdicted novelty 6.0

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  7. How Large Language Models Source Brand Reputation Across Languages and Markets

    cs.IR 2026-06 unverdicted novelty 5.0

    LLMs cite third-party domains for 85.7% of brand attributions, with Wikipedia dominant in most languages, a long-tailed domain distribution, and market-specific shifts such as YouTube and HR sites in Poland.