NLG Evaluation: Past, Present, Future
Pith reviewed 2026-05-25 04:14 UTC · model grok-4.3
The pith
NLG evaluation has evolved from minimal formal testing in the linguistics era to a core requirement today, and will increasingly emphasize impact, qualitative, and safety aspects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Natural Language Generation evaluation has changed dramatically since 1990, when there was very little formal experimental evaluation in the modern sense due to its close ties to linguistics. By 2026, with close links to machine learning, experimental evaluation is expected and fundamental. Many techniques have been developed, including LLM-as-Judge, and future priorities will include impact, qualitative, and safety evaluation as NLG technology sees routine use by many people.
What carries the argument
The shift in NLG's primary connections from linguistics to machine learning, which has driven and will continue to drive changes in what kinds of evaluation are prioritized.
If this is right
- Evaluation will move beyond automatic metrics to include assessments of real-world user impact.
- Qualitative methods will play a larger role in understanding system performance.
- Safety evaluations will become essential to prevent harms from widely deployed NLG systems.
- Research practices will adapt to reflect these new priorities in experimental design.
Where Pith is reading between the lines
- Evaluation may require more human-subject studies and long-term monitoring of deployed systems.
- This shift could lead to new standards or regulations for NLG applications in public-facing tools.
- Similar evaluation changes might appear in other generative AI areas such as image or code generation.
Load-bearing premise
The assumption that the shift from linguistics-linked NLG to machine-learning-linked NLG will continue and that this will make impact, qualitative, and safety evaluation the dominant future priorities.
What would settle it
A continued dominance of automatic metrics or linguistic analyses in NLG papers even after widespread public deployment, or a reversal back to linguistics-focused evaluation methods.
read the original abstract
Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys the evolution of NLG evaluation practices, contrasting the limited formal experimental evaluation in 1990 (when NLG was closely tied to linguistics) with the expected centrality of such evaluation in 2026 (when NLG is closely linked to machine learning). It notes the development of various techniques over this period, including LLM-as-Judge, and predicts that impact, qualitative, and safety evaluations will grow in importance as NLG sees routine use by large numbers of people.
Significance. If the historical narrative is accurate, the paper offers a compact synthesis of how evaluation norms in NLG have shifted with the field's methodological orientation. Its forward-looking claim could help orient researchers toward emerging priorities, though the absence of new data, controlled comparisons, or formal arguments means its contribution is primarily reflective rather than generative.
major comments (1)
- [Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.
minor comments (1)
- The manuscript would benefit from explicit citations to representative papers or evaluation protocols from the 1990s versus the 2020s to make the claimed 'dramatic change' more concrete and verifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our reflective survey. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.
Authors: We agree that the forward claim in the abstract is an extrapolation from the historical pattern documented in the paper rather than a data-supported forecast. The manuscript is a short reflective synthesis and does not contain usage statistics, deployment studies, or analysis of counter-trends. To address the concern, we will revise the abstract to make the conditional nature of the prediction explicit (i.e., 'assuming the current methodological orientation persists') and add a brief clause acknowledging that alternative trajectories remain possible. revision: yes
Circularity Check
No significant circularity; survey and opinion piece with no derivations
full rationale
The paper is a historical survey plus forward-looking opinion on NLG evaluation trends. It contains no equations, fitted parameters, formal derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central statements are descriptive timelines and an expectation that impact/qualitative/safety evaluation will rise in importance; these are presented as premises and observations rather than results derived from internal logic or self-referential fits. No circular steps exist.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation , author=. 2026 , eprint=
work page 2026
-
[2]
MIRAGE: The Illusion of Visual Understanding , author=. 2026 , eprint=
work page 2026
-
[3]
Asher and Gillian Gold and Eason Chen and Paulo F
Michael W. Asher and Gillian Gold and Eason Chen and Paulo F. Carvalho , title =. Advances in Methods and Practices in Psychological Science , volume =. 2026 , doi =
work page 2026
- [4]
-
[5]
Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[6]
Evaluation Metrics for Generation
Bangalore, Srinivas and Rambow, Owen and Whittaker, Steve. Evaluation Metrics for Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118255
-
[7]
Distributional Memory: A General Framework for Corpus-Based Semantics
Baroni, Marco and Lenci, Alessandro. Distributional Memory: A General Framework for Corpus-Based Semantics. Computational Linguistics. 2010. doi:10.1162/coli_a_00016
-
[8]
arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=
Measuring what matters: Construct validity in large language model benchmarks , author=. arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=
-
[9]
Reliability of LLMs as medical assistants for the general public: a randomized preregistered study , author=. Nature Medicine , pages=. 2026 , publisher=
work page 2026
-
[10]
The GREC Challenges 2010: Overview and Evaluation Results
Belz, Anja and Kow, Eric. The GREC Challenges 2010: Overview and Evaluation Results. Proceedings of the 6th International Natural Language Generation Conference. 2010
work page 2010
-
[11]
Belz, Anya and Mille, Simon and Howcroft, David M. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.24
-
[12]
Belz, Anya and Thomson, Craig and Reiter, Ehud and Mille, Simon. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.226
-
[13]
HEDS 3.0: The Human Evaluation Data Sheet Version 3.0
Belz, Anya and Thomson, Craig. HEDS 3.0: The Human Evaluation Data Sheet Version 3.0. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025
work page 2025
- [14]
-
[15]
Journal of medical Internet research , volume=
Patient and consumer safety risks when using conversational assistants for medical information: an observational study of Siri, Alexa, and Google Assistant , author=. Journal of medical Internet research , volume=. 2018 , publisher=
work page 2018
-
[16]
A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic , author=. 2026 , eprint=
work page 2026
-
[17]
and Cocke, John and Della Pietra, Stephen A
Brown, Peter F. and Cocke, John and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Jelinek, Fredrick and Lafferty, John D. and Mercer, Robert L. and Roossin, Paul S. A Statistical Approach to Machine Translation. Computational Linguistics. 1990
work page 1990
-
[18]
Evaluation Metrics for Generation
Carenini, Giuseppe. A Task-based Framework to Evaluate Evaluative Arguments. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118256
-
[19]
Using Argumentation Strategies in Automated Argument Generation
Cheng, Hua and Mellish, Chris. An Empirical Analysis of Constructing Non-restrictive NP Modifiers to Express Semantic Relations. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118269
-
[20]
arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=
Benchmarking is Broken--Don't Let AI be its Own Judge , author=. arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=
-
[21]
Hierarchical Reinforcement Learning for Adaptive Text Generation
Dethlefs, Nina and Cuay \'a huitl, Heriberto. Hierarchical Reinforcement Learning for Adaptive Text Generation. Proceedings of the 6th International Natural Language Generation Conference. 2010
work page 2010
-
[22]
Duggan, Matthew J. and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B. , title =. JAMA Network Open , volume =. 2025 , month =. doi:10.1001/jamanetworkopen.2024.60637 , url =
-
[23]
Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference
Du s ek, Ond r ej and Kasner, Zden e k. Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.19
-
[24]
LLM -based NLG Evaluation: Current Status and Challenges
Gao, Mingqi and Hu, Xinyu and Yin, Xunjian and Ruan, Jie and Pu, Xiao and Wan, Xiaojun. LLM -based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. doi:10.1162/coli_a_00561
-
[25]
Automatic Labeling of Semantic Roles
Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of Semantic Roles. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. 2000. doi:10.3115/1075218.1075283
-
[26]
British Medical Journal , volume=
Papers that go Beyond Numbers (Qualitative Research)' , author=. British Medical Journal , volume=
- [27]
-
[28]
and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A
Howcroft, David M. and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A. and Mahamood, Saad and Mille, Simon and van Miltenburg, Emiel and Santhanam, Sashank and Rieser, Verena. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. Proceedings of the 13th International Confer...
-
[29]
The TIPSTER SUMMAC Text Summarization Evaluation
Mani, Inderjeet and House, David and Klein, Gary and Hirschman, Lynette and Firmin, Therese and Sundheim, Beth. The TIPSTER SUMMAC Text Summarization Evaluation. Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics. 1999
work page 1999
-
[30]
Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448
-
[31]
McCoy, Kathleen F. and Vijay-Shanker, K. and Yang, Gijoo. Using T ree A djoining G rammars Systemic Framework in the. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990
work page 1990
-
[32]
Barriers and enabling factors for error analysis in NLG research
van Miltenburg, Emiel and Clinciu, Miruna and Du. Barriers and enabling factors for error analysis in NLG research. Northern European Journal of Language Technology. 2023. doi:10.3384/nejlt.2000-1533.2023.4529
-
[33]
Robust, applied morphological generation
Minnen, Guido and Carroll, John and Pearce, Darren. Robust, applied morphological generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118281
-
[34]
Generating and Validating Abstracts of Meeting Conversations: a User Study
Murray, Gabriel and Carenini, Giuseppe and Ng, Raymond. Generating and Validating Abstracts of Meeting Conversations: a User Study. Proceedings of the 6th International Natural Language Generation Conference. 2010
work page 2010
-
[35]
Domain Communication Knowledge
Rambow, Owen. Domain Communication Knowledge. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990
work page 1990
- [36]
-
[37]
BMJ health & care informatics , volume=
Evaluation framework to guide implementation of AI systems into healthcare settings , author=. BMJ health & care informatics , volume=
-
[38]
Reiter, Ehud and Belz, Anja. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics. 2009. doi:10.1162/coli.2009.35.4.35405
-
[39]
Reiter, Ehud and Robertson, Roma and Lennox, A. Scott and Osman, Liesl. Using a Randomised Controlled Clinical Trial to Evaluate an NLG System. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. doi:10.3115/1073012.1073069
-
[40]
Ehud Reiter , title =
-
[41]
We Should Evaluate Real-World Impact
Reiter, Ehud. We Should Evaluate Real-World Impact. Computational Linguistics. 2025. doi:10.1162/coli.a.18
-
[42]
T., Wu, T., Guestrin, C., and Singh, S
Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442
-
[43]
Sambaraju, Rahul and Reiter, Ehud and Logie, Robert and Mckinlay, Andy and McVittie, Chris and Gatt, Albert and Sykes, Cindy. What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis. Proceedings of the 13th E uropean Workshop on Natural Language Generation. 2011
work page 2011
-
[44]
Sun, Mengxuan and Reiter, Ehud and Murchie, Peter and Kiltie, Anne E and Ramsay, George and Duncan, Lisa and Adam, Rosalind , title =. 2026 , doi =. https://www.medrxiv.org/content/early/2026/02/03/2026.02.02.26345346.full.pdf , journal =
work page 2026
-
[45]
A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems
Thomson, Craig and Reiter, Ehud. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.22
-
[46]
Evaluating factual accuracy in complex data-to-text , journal =
Craig Thomson and Ehud Reiter and Barkavi Sundararajan , keywords =. Evaluating factual accuracy in complex data-to-text , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csl.2023.101482 , url =
-
[47]
Common Flaws in Running Human Evaluation Experiments in NLP
Thomson, Craig and Reiter, Ehud and Belz, Anya. Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. 2024. doi:10.1162/coli_a_00508
-
[48]
Word Representations: A Simple and General Method for Semi-Supervised Learning
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010
work page 2010
-
[49]
Qualitative Research: A Guide to Design and Implementation , author=. 2025 , publisher=
work page 2025
-
[50]
Chris. Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csl.2020.101151 , url =
- [51]
-
[52]
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications
Zhou, Kaitlyn and Blodgett, Su Lin and Trischler, Adam and Daum \'e III, Hal and Suleman, Kaheer and Olteanu, Alexandra. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 20...
-
[53]
Using Argumentation Strategies in Automated Argument Generation
Zukerman, Ingrid and McConachy, Richard and George, Sarah. Using Argumentation Strategies in Automated Argument Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118262
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.