NLG Evaluation: Past, Present, Future

Ehud Reiter

arxiv: 2605.23715 · v1 · pith:WOBTOO2Vnew · submitted 2026-05-22 · 💻 cs.CL

NLG Evaluation: Past, Present, Future

Ehud Reiter This is my paper

Pith reviewed 2026-05-25 04:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords NLG evaluationnatural language generationevaluation methodsmachine learningimpact evaluationsafety evaluationqualitative evaluation

0 comments

The pith

NLG evaluation has evolved from minimal formal testing in the linguistics era to a core requirement today, and will increasingly emphasize impact, qualitative, and safety aspects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how natural language generation evaluation practices have transformed over more than three decades. In the linguistics-dominated era around 1990, formal experiments were rare, but as NLG became tied to machine learning, rigorous evaluation became standard. The author expects this evolution to continue, with greater attention to how systems affect users in practice, their qualitative performance, and safety issues, because NLG tools are now used by large numbers of people. A sympathetic reader would care because the choice of evaluation methods determines whether widely deployed systems help or harm their users.

Core claim

Natural Language Generation evaluation has changed dramatically since 1990, when there was very little formal experimental evaluation in the modern sense due to its close ties to linguistics. By 2026, with close links to machine learning, experimental evaluation is expected and fundamental. Many techniques have been developed, including LLM-as-Judge, and future priorities will include impact, qualitative, and safety evaluation as NLG technology sees routine use by many people.

What carries the argument

The shift in NLG's primary connections from linguistics to machine learning, which has driven and will continue to drive changes in what kinds of evaluation are prioritized.

If this is right

Evaluation will move beyond automatic metrics to include assessments of real-world user impact.
Qualitative methods will play a larger role in understanding system performance.
Safety evaluations will become essential to prevent harms from widely deployed NLG systems.
Research practices will adapt to reflect these new priorities in experimental design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation may require more human-subject studies and long-term monitoring of deployed systems.
This shift could lead to new standards or regulations for NLG applications in public-facing tools.
Similar evaluation changes might appear in other generative AI areas such as image or code generation.

Load-bearing premise

The assumption that the shift from linguistics-linked NLG to machine-learning-linked NLG will continue and that this will make impact, qualitative, and safety evaluation the dominant future priorities.

What would settle it

A continued dominance of automatic metrics or linguistic analyses in NLG papers even after widespread public deployment, or a reversal back to linguistics-focused evaluation methods.

read the original abstract

Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short historical survey of NLG evaluation trends by a long-time researcher, with forward opinions but no new data or methods.

read the letter

Reiter's paper is a short essay on how NLG evaluation moved from almost no formal experiments in the linguistics-linked era around 1990 to a required part of ML-based work today, plus a prediction that impact, qualitative, and safety evaluation will grow as the technology reaches more users. The author draws on personal experience to lay out that timeline in plain terms, which gives some useful context on why methods like LLM-as-Judge emerged when they did. That framing is the main thing the piece offers. It organizes known developments without claiming to solve open problems. The soft spots are straightforward. There are no new experiments, datasets, or derivations, so the forward claims rest on the assumption that the ML linkage will keep driving priorities without much argument or counter-consideration. That is typical for this kind of survey, but it means the paper does not test or strengthen any specific technical position. The historical narrative is descriptive rather than evidence-heavy. This is mainly for people already in or entering the NLG evaluation discussion who want a compact reminder of the field's shifts. It could help frame reading groups or student overviews, but it will not change how anyone designs an experiment or measures a system. I would send it to peer review as a perspective piece. A senior author with decades in the area can still contribute by organizing the record, even when the novelty is low and the predictions stay at the level of informed expectation.

Referee Report

1 major / 1 minor

Summary. The manuscript surveys the evolution of NLG evaluation practices, contrasting the limited formal experimental evaluation in 1990 (when NLG was closely tied to linguistics) with the expected centrality of such evaluation in 2026 (when NLG is closely linked to machine learning). It notes the development of various techniques over this period, including LLM-as-Judge, and predicts that impact, qualitative, and safety evaluations will grow in importance as NLG sees routine use by large numbers of people.

Significance. If the historical narrative is accurate, the paper offers a compact synthesis of how evaluation norms in NLG have shifted with the field's methodological orientation. Its forward-looking claim could help orient researchers toward emerging priorities, though the absence of new data, controlled comparisons, or formal arguments means its contribution is primarily reflective rather than generative.

major comments (1)

[Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.

minor comments (1)

The manuscript would benefit from explicit citations to representative papers or evaluation protocols from the 1990s versus the 2020s to make the claimed 'dramatic change' more concrete and verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our reflective survey. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.

Authors: We agree that the forward claim in the abstract is an extrapolation from the historical pattern documented in the paper rather than a data-supported forecast. The manuscript is a short reflective synthesis and does not contain usage statistics, deployment studies, or analysis of counter-trends. To address the concern, we will revise the abstract to make the conditional nature of the prediction explicit (i.e., 'assuming the current methodological orientation persists') and add a brief clause acknowledging that alternative trajectories remain possible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey and opinion piece with no derivations

full rationale

The paper is a historical survey plus forward-looking opinion on NLG evaluation trends. It contains no equations, fitted parameters, formal derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central statements are descriptive timelines and an expectation that impact/qualitative/safety evaluation will rise in importance; these are presented as premises and observations rather than results derived from internal logic or self-referential fits. No circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is a historical review without new technical machinery.

pith-pipeline@v0.9.0 · 5622 in / 969 out tokens · 24284 ms · 2026-05-25T04:14:29.345420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

2026 , eprint=

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation , author=. 2026 , eprint=

work page 2026
[2]

2026 , eprint=

MIRAGE: The Illusion of Visual Understanding , author=. 2026 , eprint=

work page 2026
[3]

Asher and Gillian Gold and Eason Chen and Paulo F

Michael W. Asher and Gillian Gold and Eason Chen and Paulo F. Carvalho , title =. Advances in Methods and Practices in Psychological Science , volume =. 2026 , doi =

work page 2026
[4]

The Guardian , year =

Nadeem Badshah , title =. The Guardian , year =

work page
[5]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[6]

Evaluation Metrics for Generation

Bangalore, Srinivas and Rambow, Owen and Whittaker, Steve. Evaluation Metrics for Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118255

work page doi:10.3115/1118253.1118255 2000
[7]

Distributional Memory: A General Framework for Corpus-Based Semantics

Baroni, Marco and Lenci, Alessandro. Distributional Memory: A General Framework for Corpus-Based Semantics. Computational Linguistics. 2010. doi:10.1162/coli_a_00016

work page doi:10.1162/coli_a_00016 2010
[8]

arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

Measuring what matters: Construct validity in large language model benchmarks , author=. arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

work page arXiv 2025
[9]

Nature Medicine , pages=

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study , author=. Nature Medicine , pages=. 2026 , publisher=

work page 2026
[10]

The GREC Challenges 2010: Overview and Evaluation Results

Belz, Anja and Kow, Eric. The GREC Challenges 2010: Overview and Evaluation Results. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010
[11]

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Belz, Anya and Mille, Simon and Howcroft, David M. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.24

work page doi:10.18653/v1/2020.inlg-1.24 2020
[12]

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Belz, Anya and Thomson, Craig and Reiter, Ehud and Mille, Simon. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.226

work page doi:10.18653/v1/2023.findings-acl.226 2023
[13]

HEDS 3.0: The Human Evaluation Data Sheet Version 3.0

Belz, Anya and Thomson, Craig. HEDS 3.0: The Human Evaluation Data Sheet Version 3.0. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025
[14]

2026 , eprint=

International AI Safety Report 2026 , author=. 2026 , eprint=

work page 2026
[15]

Journal of medical Internet research , volume=

Patient and consumer safety risks when using conversational assistants for medical information: an observational study of Siri, Alexa, and Google Assistant , author=. Journal of medical Internet research , volume=. 2018 , publisher=

work page 2018
[16]

2026 , eprint=

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic , author=. 2026 , eprint=

work page 2026
[17]

and Cocke, John and Della Pietra, Stephen A

Brown, Peter F. and Cocke, John and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Jelinek, Fredrick and Lafferty, John D. and Mercer, Robert L. and Roossin, Paul S. A Statistical Approach to Machine Translation. Computational Linguistics. 1990

work page 1990
[18]

Evaluation Metrics for Generation

Carenini, Giuseppe. A Task-based Framework to Evaluate Evaluative Arguments. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118256

work page doi:10.3115/1118253.1118256 2000
[19]

Using Argumentation Strategies in Automated Argument Generation

Cheng, Hua and Mellish, Chris. An Empirical Analysis of Constructing Non-restrictive NP Modifiers to Express Semantic Relations. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118269

work page doi:10.3115/1118253.1118269 2000
[20]

arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

Benchmarking is Broken--Don't Let AI be its Own Judge , author=. arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

work page arXiv
[21]

Hierarchical Reinforcement Learning for Adaptive Text Generation

Dethlefs, Nina and Cuay \'a huitl, Heriberto. Hierarchical Reinforcement Learning for Adaptive Text Generation. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010
[22]

and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B

Duggan, Matthew J. and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B. , title =. JAMA Network Open , volume =. 2025 , month =. doi:10.1001/jamanetworkopen.2024.60637 , url =

work page doi:10.1001/jamanetworkopen.2024.60637 2025
[23]

Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference

Du s ek, Ond r ej and Kasner, Zden e k. Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.19

work page doi:10.18653/v1/2020.inlg-1.19 2020
[24]

LLM -based NLG Evaluation: Current Status and Challenges

Gao, Mingqi and Hu, Xinyu and Yin, Xunjian and Ruan, Jie and Pu, Xiao and Wan, Xiaojun. LLM -based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. doi:10.1162/coli_a_00561

work page doi:10.1162/coli_a_00561 2025
[25]

Automatic Labeling of Semantic Roles

Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of Semantic Roles. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. 2000. doi:10.3115/1075218.1075283

work page doi:10.3115/1075218.1075283 2000
[26]

British Medical Journal , volume=

Papers that go Beyond Numbers (Qualitative Research)' , author=. British Medical Journal , volume=

work page
[27]

2011 , publisher=

Applied thematic analysis , author=. 2011 , publisher=

work page 2011
[28]

and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A

Howcroft, David M. and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A. and Mahamood, Saad and Mille, Simon and van Miltenburg, Emiel and Santhanam, Sashank and Rieser, Verena. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. Proceedings of the 13th International Confer...

work page doi:10.18653/v1/2020.inlg-1.23 2020
[29]

The TIPSTER SUMMAC Text Summarization Evaluation

Mani, Inderjeet and House, David and Klein, Gary and Hirschman, Lynette and Firmin, Therese and Sundheim, Beth. The TIPSTER SUMMAC Text Summarization Evaluation. Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics. 1999

work page 1999
[30]

Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

work page doi:10.18653/v1/2020.acl-main.448 2020
[31]

and Vijay-Shanker, K

McCoy, Kathleen F. and Vijay-Shanker, K. and Yang, Gijoo. Using T ree A djoining G rammars Systemic Framework in the. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

work page 1990
[32]

Barriers and enabling factors for error analysis in NLG research

van Miltenburg, Emiel and Clinciu, Miruna and Du. Barriers and enabling factors for error analysis in NLG research. Northern European Journal of Language Technology. 2023. doi:10.3384/nejlt.2000-1533.2023.4529

work page doi:10.3384/nejlt.2000-1533.2023.4529 2023
[33]

Robust, applied morphological generation

Minnen, Guido and Carroll, John and Pearce, Darren. Robust, applied morphological generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118281

work page doi:10.3115/1118253.1118281 2000
[34]

Generating and Validating Abstracts of Meeting Conversations: a User Study

Murray, Gabriel and Carenini, Giuseppe and Ng, Raymond. Generating and Validating Abstracts of Meeting Conversations: a User Study. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010
[35]

Domain Communication Knowledge

Rambow, Owen. Domain Communication Knowledge. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

work page 1990
[36]

2025 , month =

Recent Frontier Models Are Reward Hacking , author =. 2025 , month =

work page 2025
[37]

BMJ health & care informatics , volume=

Evaluation framework to guide implementation of AI systems into healthcare settings , author=. BMJ health & care informatics , volume=

work page
[38]

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Reiter, Ehud and Belz, Anja. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics. 2009. doi:10.1162/coli.2009.35.4.35405

work page doi:10.1162/coli.2009.35.4.35405 2009
[39]

Scott and Osman, Liesl

Reiter, Ehud and Robertson, Roma and Lennox, A. Scott and Osman, Liesl. Using a Randomised Controlled Clinical Trial to Evaluate an NLG System. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. doi:10.3115/1073012.1073069

work page doi:10.3115/1073012.1073069 2001
[40]

Ehud Reiter , title =

work page
[41]

We Should Evaluate Real-World Impact

Reiter, Ehud. We Should Evaluate Real-World Impact. Computational Linguistics. 2025. doi:10.1162/coli.a.18

work page doi:10.1162/coli.a.18 2025
[42]

T., Wu, T., Guestrin, C., and Singh, S

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

work page doi:10.18653/v1/2020.acl-main.442 2020
[43]

What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis

Sambaraju, Rahul and Reiter, Ehud and Logie, Robert and Mckinlay, Andy and McVittie, Chris and Gatt, Albert and Sykes, Cindy. What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis. Proceedings of the 13th E uropean Workshop on Natural Language Generation. 2011

work page 2011
[44]

2026 , doi =

Sun, Mengxuan and Reiter, Ehud and Murchie, Peter and Kiltie, Anne E and Ramsay, George and Duncan, Lisa and Adam, Rosalind , title =. 2026 , doi =. https://www.medrxiv.org/content/early/2026/02/03/2026.02.02.26345346.full.pdf , journal =

work page 2026
[45]

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Thomson, Craig and Reiter, Ehud. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.22

work page doi:10.18653/v1/2020.inlg-1.22 2020
[46]

Evaluating factual accuracy in complex data-to-text , journal =

Craig Thomson and Ehud Reiter and Barkavi Sundararajan , keywords =. Evaluating factual accuracy in complex data-to-text , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csl.2023.101482 , url =

work page doi:10.1016/j.csl.2023.101482 2023
[47]

Common Flaws in Running Human Evaluation Experiments in NLP

Thomson, Craig and Reiter, Ehud and Belz, Anya. Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. 2024. doi:10.1162/coli_a_00508

work page doi:10.1162/coli_a_00508 2024
[48]

Word Representations: A Simple and General Method for Semi-Supervised Learning

Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010

work page 2010
[49]

2025 , publisher=

Qualitative Research: A Guide to Design and Implementation , author=. 2025 , publisher=

work page 2025
[50]

Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =

Chris. Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csl.2020.101151 , url =

work page doi:10.1016/j.csl.2020.101151 2021
[51]

1990 , publisher=

Readings in speech recognition , author=. 1990 , publisher=

work page 1990
[52]

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Zhou, Kaitlyn and Blodgett, Su Lin and Trischler, Adam and Daum \'e III, Hal and Suleman, Kaheer and Olteanu, Alexandra. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 20...

work page doi:10.18653/v1/2022.naacl-main.24 2022
[53]

Using Argumentation Strategies in Automated Argument Generation

Zukerman, Ingrid and McConachy, Richard and George, Sarah. Using Argumentation Strategies in Automated Argument Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118262

work page doi:10.3115/1118253.1118262 2000

[1] [1]

2026 , eprint=

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation , author=. 2026 , eprint=

work page 2026

[2] [2]

2026 , eprint=

MIRAGE: The Illusion of Visual Understanding , author=. 2026 , eprint=

work page 2026

[3] [3]

Asher and Gillian Gold and Eason Chen and Paulo F

Michael W. Asher and Gillian Gold and Eason Chen and Paulo F. Carvalho , title =. Advances in Methods and Practices in Psychological Science , volume =. 2026 , doi =

work page 2026

[4] [4]

The Guardian , year =

Nadeem Badshah , title =. The Guardian , year =

work page

[5] [5]

Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[6] [6]

Evaluation Metrics for Generation

Bangalore, Srinivas and Rambow, Owen and Whittaker, Steve. Evaluation Metrics for Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118255

work page doi:10.3115/1118253.1118255 2000

[7] [7]

Distributional Memory: A General Framework for Corpus-Based Semantics

Baroni, Marco and Lenci, Alessandro. Distributional Memory: A General Framework for Corpus-Based Semantics. Computational Linguistics. 2010. doi:10.1162/coli_a_00016

work page doi:10.1162/coli_a_00016 2010

[8] [8]

arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

Measuring what matters: Construct validity in large language model benchmarks , author=. arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

work page arXiv 2025

[9] [9]

Nature Medicine , pages=

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study , author=. Nature Medicine , pages=. 2026 , publisher=

work page 2026

[10] [10]

The GREC Challenges 2010: Overview and Evaluation Results

Belz, Anja and Kow, Eric. The GREC Challenges 2010: Overview and Evaluation Results. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010

[11] [11]

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Belz, Anya and Mille, Simon and Howcroft, David M. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.24

work page doi:10.18653/v1/2020.inlg-1.24 2020

[12] [12]

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Belz, Anya and Thomson, Craig and Reiter, Ehud and Mille, Simon. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.226

work page doi:10.18653/v1/2023.findings-acl.226 2023

[13] [13]

HEDS 3.0: The Human Evaluation Data Sheet Version 3.0

Belz, Anya and Thomson, Craig. HEDS 3.0: The Human Evaluation Data Sheet Version 3.0. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025

[14] [14]

2026 , eprint=

International AI Safety Report 2026 , author=. 2026 , eprint=

work page 2026

[15] [15]

Journal of medical Internet research , volume=

Patient and consumer safety risks when using conversational assistants for medical information: an observational study of Siri, Alexa, and Google Assistant , author=. Journal of medical Internet research , volume=. 2018 , publisher=

work page 2018

[16] [16]

2026 , eprint=

A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic , author=. 2026 , eprint=

work page 2026

[17] [17]

and Cocke, John and Della Pietra, Stephen A

Brown, Peter F. and Cocke, John and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Jelinek, Fredrick and Lafferty, John D. and Mercer, Robert L. and Roossin, Paul S. A Statistical Approach to Machine Translation. Computational Linguistics. 1990

work page 1990

[18] [18]

Evaluation Metrics for Generation

Carenini, Giuseppe. A Task-based Framework to Evaluate Evaluative Arguments. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118256

work page doi:10.3115/1118253.1118256 2000

[19] [19]

Using Argumentation Strategies in Automated Argument Generation

Cheng, Hua and Mellish, Chris. An Empirical Analysis of Constructing Non-restrictive NP Modifiers to Express Semantic Relations. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118269

work page doi:10.3115/1118253.1118269 2000

[20] [20]

arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

Benchmarking is Broken--Don't Let AI be its Own Judge , author=. arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

work page arXiv

[21] [21]

Hierarchical Reinforcement Learning for Adaptive Text Generation

Dethlefs, Nina and Cuay \'a huitl, Heriberto. Hierarchical Reinforcement Learning for Adaptive Text Generation. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010

[22] [22]

and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B

Duggan, Matthew J. and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B. , title =. JAMA Network Open , volume =. 2025 , month =. doi:10.1001/jamanetworkopen.2024.60637 , url =

work page doi:10.1001/jamanetworkopen.2024.60637 2025

[23] [23]

Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference

Du s ek, Ond r ej and Kasner, Zden e k. Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.19

work page doi:10.18653/v1/2020.inlg-1.19 2020

[24] [24]

LLM -based NLG Evaluation: Current Status and Challenges

Gao, Mingqi and Hu, Xinyu and Yin, Xunjian and Ruan, Jie and Pu, Xiao and Wan, Xiaojun. LLM -based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. doi:10.1162/coli_a_00561

work page doi:10.1162/coli_a_00561 2025

[25] [25]

Automatic Labeling of Semantic Roles

Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of Semantic Roles. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. 2000. doi:10.3115/1075218.1075283

work page doi:10.3115/1075218.1075283 2000

[26] [26]

British Medical Journal , volume=

Papers that go Beyond Numbers (Qualitative Research)' , author=. British Medical Journal , volume=

work page

[27] [27]

2011 , publisher=

Applied thematic analysis , author=. 2011 , publisher=

work page 2011

[28] [28]

and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A

Howcroft, David M. and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A. and Mahamood, Saad and Mille, Simon and van Miltenburg, Emiel and Santhanam, Sashank and Rieser, Verena. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. Proceedings of the 13th International Confer...

work page doi:10.18653/v1/2020.inlg-1.23 2020

[29] [29]

The TIPSTER SUMMAC Text Summarization Evaluation

Mani, Inderjeet and House, David and Klein, Gary and Hirschman, Lynette and Firmin, Therese and Sundheim, Beth. The TIPSTER SUMMAC Text Summarization Evaluation. Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics. 1999

work page 1999

[30] [30]

Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

work page doi:10.18653/v1/2020.acl-main.448 2020

[31] [31]

and Vijay-Shanker, K

McCoy, Kathleen F. and Vijay-Shanker, K. and Yang, Gijoo. Using T ree A djoining G rammars Systemic Framework in the. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

work page 1990

[32] [32]

Barriers and enabling factors for error analysis in NLG research

van Miltenburg, Emiel and Clinciu, Miruna and Du. Barriers and enabling factors for error analysis in NLG research. Northern European Journal of Language Technology. 2023. doi:10.3384/nejlt.2000-1533.2023.4529

work page doi:10.3384/nejlt.2000-1533.2023.4529 2023

[33] [33]

Robust, applied morphological generation

Minnen, Guido and Carroll, John and Pearce, Darren. Robust, applied morphological generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118281

work page doi:10.3115/1118253.1118281 2000

[34] [34]

Generating and Validating Abstracts of Meeting Conversations: a User Study

Murray, Gabriel and Carenini, Giuseppe and Ng, Raymond. Generating and Validating Abstracts of Meeting Conversations: a User Study. Proceedings of the 6th International Natural Language Generation Conference. 2010

work page 2010

[35] [35]

Domain Communication Knowledge

Rambow, Owen. Domain Communication Knowledge. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

work page 1990

[36] [36]

2025 , month =

Recent Frontier Models Are Reward Hacking , author =. 2025 , month =

work page 2025

[37] [37]

BMJ health & care informatics , volume=

Evaluation framework to guide implementation of AI systems into healthcare settings , author=. BMJ health & care informatics , volume=

work page

[38] [38]

An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

Reiter, Ehud and Belz, Anja. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics. 2009. doi:10.1162/coli.2009.35.4.35405

work page doi:10.1162/coli.2009.35.4.35405 2009

[39] [39]

Scott and Osman, Liesl

Reiter, Ehud and Robertson, Roma and Lennox, A. Scott and Osman, Liesl. Using a Randomised Controlled Clinical Trial to Evaluate an NLG System. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. doi:10.3115/1073012.1073069

work page doi:10.3115/1073012.1073069 2001

[40] [40]

Ehud Reiter , title =

work page

[41] [41]

We Should Evaluate Real-World Impact

Reiter, Ehud. We Should Evaluate Real-World Impact. Computational Linguistics. 2025. doi:10.1162/coli.a.18

work page doi:10.1162/coli.a.18 2025

[42] [42]

T., Wu, T., Guestrin, C., and Singh, S

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

work page doi:10.18653/v1/2020.acl-main.442 2020

[43] [43]

What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis

Sambaraju, Rahul and Reiter, Ehud and Logie, Robert and Mckinlay, Andy and McVittie, Chris and Gatt, Albert and Sykes, Cindy. What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis. Proceedings of the 13th E uropean Workshop on Natural Language Generation. 2011

work page 2011

[44] [44]

2026 , doi =

Sun, Mengxuan and Reiter, Ehud and Murchie, Peter and Kiltie, Anne E and Ramsay, George and Duncan, Lisa and Adam, Rosalind , title =. 2026 , doi =. https://www.medrxiv.org/content/early/2026/02/03/2026.02.02.26345346.full.pdf , journal =

work page 2026

[45] [45]

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Thomson, Craig and Reiter, Ehud. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.22

work page doi:10.18653/v1/2020.inlg-1.22 2020

[46] [46]

Evaluating factual accuracy in complex data-to-text , journal =

Craig Thomson and Ehud Reiter and Barkavi Sundararajan , keywords =. Evaluating factual accuracy in complex data-to-text , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csl.2023.101482 , url =

work page doi:10.1016/j.csl.2023.101482 2023

[47] [47]

Common Flaws in Running Human Evaluation Experiments in NLP

Thomson, Craig and Reiter, Ehud and Belz, Anya. Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. 2024. doi:10.1162/coli_a_00508

work page doi:10.1162/coli_a_00508 2024

[48] [48]

Word Representations: A Simple and General Method for Semi-Supervised Learning

Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010

work page 2010

[49] [49]

2025 , publisher=

Qualitative Research: A Guide to Design and Implementation , author=. 2025 , publisher=

work page 2025

[50] [50]

Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =

Chris. Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csl.2020.101151 , url =

work page doi:10.1016/j.csl.2020.101151 2021

[51] [51]

1990 , publisher=

Readings in speech recognition , author=. 1990 , publisher=

work page 1990

[52] [52]

Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

Zhou, Kaitlyn and Blodgett, Su Lin and Trischler, Adam and Daum \'e III, Hal and Suleman, Kaheer and Olteanu, Alexandra. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 20...

work page doi:10.18653/v1/2022.naacl-main.24 2022

[53] [53]

Using Argumentation Strategies in Automated Argument Generation

Zukerman, Ingrid and McConachy, Richard and George, Sarah. Using Argumentation Strategies in Automated Argument Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118262

work page doi:10.3115/1118253.1118262 2000