pith. sign in

arxiv: 2605.23715 · v1 · pith:WOBTOO2Vnew · submitted 2026-05-22 · 💻 cs.CL

NLG Evaluation: Past, Present, Future

Pith reviewed 2026-05-25 04:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords NLG evaluationnatural language generationevaluation methodsmachine learningimpact evaluationsafety evaluationqualitative evaluation
0
0 comments X

The pith

NLG evaluation has evolved from minimal formal testing in the linguistics era to a core requirement today, and will increasingly emphasize impact, qualitative, and safety aspects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces how natural language generation evaluation practices have transformed over more than three decades. In the linguistics-dominated era around 1990, formal experiments were rare, but as NLG became tied to machine learning, rigorous evaluation became standard. The author expects this evolution to continue, with greater attention to how systems affect users in practice, their qualitative performance, and safety issues, because NLG tools are now used by large numbers of people. A sympathetic reader would care because the choice of evaluation methods determines whether widely deployed systems help or harm their users.

Core claim

Natural Language Generation evaluation has changed dramatically since 1990, when there was very little formal experimental evaluation in the modern sense due to its close ties to linguistics. By 2026, with close links to machine learning, experimental evaluation is expected and fundamental. Many techniques have been developed, including LLM-as-Judge, and future priorities will include impact, qualitative, and safety evaluation as NLG technology sees routine use by many people.

What carries the argument

The shift in NLG's primary connections from linguistics to machine learning, which has driven and will continue to drive changes in what kinds of evaluation are prioritized.

If this is right

  • Evaluation will move beyond automatic metrics to include assessments of real-world user impact.
  • Qualitative methods will play a larger role in understanding system performance.
  • Safety evaluations will become essential to prevent harms from widely deployed NLG systems.
  • Research practices will adapt to reflect these new priorities in experimental design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation may require more human-subject studies and long-term monitoring of deployed systems.
  • This shift could lead to new standards or regulations for NLG applications in public-facing tools.
  • Similar evaluation changes might appear in other generative AI areas such as image or code generation.

Load-bearing premise

The assumption that the shift from linguistics-linked NLG to machine-learning-linked NLG will continue and that this will make impact, qualitative, and safety evaluation the dominant future priorities.

What would settle it

A continued dominance of automatic metrics or linguistic analyses in NLG papers even after widespread public deployment, or a reversal back to linguistics-focused evaluation methods.

read the original abstract

Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript surveys the evolution of NLG evaluation practices, contrasting the limited formal experimental evaluation in 1990 (when NLG was closely tied to linguistics) with the expected centrality of such evaluation in 2026 (when NLG is closely linked to machine learning). It notes the development of various techniques over this period, including LLM-as-Judge, and predicts that impact, qualitative, and safety evaluations will grow in importance as NLG sees routine use by large numbers of people.

Significance. If the historical narrative is accurate, the paper offers a compact synthesis of how evaluation norms in NLG have shifted with the field's methodological orientation. Its forward-looking claim could help orient researchers toward emerging priorities, though the absence of new data, controlled comparisons, or formal arguments means its contribution is primarily reflective rather than generative.

major comments (1)
  1. [Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.
minor comments (1)
  1. The manuscript would benefit from explicit citations to representative papers or evaluation protocols from the 1990s versus the 2020s to make the claimed 'dramatic change' more concrete and verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our reflective survey. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the prediction that impact, qualitative, and safety evaluation 'will become more important' rests on the unexamined premise that the linguistics-to-ML shift 'will continue'; no supporting usage statistics, deployment case studies, or discussion of countervailing trends are supplied to ground this load-bearing forward claim.

    Authors: We agree that the forward claim in the abstract is an extrapolation from the historical pattern documented in the paper rather than a data-supported forecast. The manuscript is a short reflective synthesis and does not contain usage statistics, deployment studies, or analysis of counter-trends. To address the concern, we will revise the abstract to make the conditional nature of the prediction explicit (i.e., 'assuming the current methodological orientation persists') and add a brief clause acknowledging that alternative trajectories remain possible. revision: yes

Circularity Check

0 steps flagged

No significant circularity; survey and opinion piece with no derivations

full rationale

The paper is a historical survey plus forward-looking opinion on NLG evaluation trends. It contains no equations, fitted parameters, formal derivations, or load-bearing self-citations that reduce any claim to its own inputs by construction. The central statements are descriptive timelines and an expectation that impact/qualitative/safety evaluation will rise in importance; these are presented as premises and observations rather than results derived from internal logic or self-referential fits. No circular steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is a historical review without new technical machinery.

pith-pipeline@v0.9.0 · 5622 in / 969 out tokens · 24284 ms · 2026-05-25T04:14:29.345420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

  1. [1]

    2026 , eprint=

    When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation , author=. 2026 , eprint=

  2. [2]

    2026 , eprint=

    MIRAGE: The Illusion of Visual Understanding , author=. 2026 , eprint=

  3. [3]

    Asher and Gillian Gold and Eason Chen and Paulo F

    Michael W. Asher and Gillian Gold and Eason Chen and Paulo F. Carvalho , title =. Advances in Methods and Practices in Psychological Science , volume =. 2026 , doi =

  4. [4]

    The Guardian , year =

    Nadeem Badshah , title =. The Guardian , year =

  5. [5]

    Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs , author=. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  6. [6]

    Evaluation Metrics for Generation

    Bangalore, Srinivas and Rambow, Owen and Whittaker, Steve. Evaluation Metrics for Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118255

  7. [7]

    Distributional Memory: A General Framework for Corpus-Based Semantics

    Baroni, Marco and Lenci, Alessandro. Distributional Memory: A General Framework for Corpus-Based Semantics. Computational Linguistics. 2010. doi:10.1162/coli_a_00016

  8. [8]

    arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

    Measuring what matters: Construct validity in large language model benchmarks , author=. arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

  9. [9]

    Nature Medicine , pages=

    Reliability of LLMs as medical assistants for the general public: a randomized preregistered study , author=. Nature Medicine , pages=. 2026 , publisher=

  10. [10]

    The GREC Challenges 2010: Overview and Evaluation Results

    Belz, Anja and Kow, Eric. The GREC Challenges 2010: Overview and Evaluation Results. Proceedings of the 6th International Natural Language Generation Conference. 2010

  11. [11]

    Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

    Belz, Anya and Mille, Simon and Howcroft, David M. Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.24

  12. [12]

    Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

    Belz, Anya and Thomson, Craig and Reiter, Ehud and Mille, Simon. Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.226

  13. [13]

    HEDS 3.0: The Human Evaluation Data Sheet Version 3.0

    Belz, Anya and Thomson, Craig. HEDS 3.0: The Human Evaluation Data Sheet Version 3.0. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

  14. [14]

    2026 , eprint=

    International AI Safety Report 2026 , author=. 2026 , eprint=

  15. [15]

    Journal of medical Internet research , volume=

    Patient and consumer safety risks when using conversational assistants for medical information: an observational study of Siri, Alexa, and Google Assistant , author=. Journal of medical Internet research , volume=. 2018 , publisher=

  16. [16]

    2026 , eprint=

    A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic , author=. 2026 , eprint=

  17. [17]

    and Cocke, John and Della Pietra, Stephen A

    Brown, Peter F. and Cocke, John and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Jelinek, Fredrick and Lafferty, John D. and Mercer, Robert L. and Roossin, Paul S. A Statistical Approach to Machine Translation. Computational Linguistics. 1990

  18. [18]

    Evaluation Metrics for Generation

    Carenini, Giuseppe. A Task-based Framework to Evaluate Evaluative Arguments. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118256

  19. [19]

    Using Argumentation Strategies in Automated Argument Generation

    Cheng, Hua and Mellish, Chris. An Empirical Analysis of Constructing Non-restrictive NP Modifiers to Express Semantic Relations. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118269

  20. [20]

    arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

    Benchmarking is Broken--Don't Let AI be its Own Judge , author=. arXiv preprint arXiv:2510.07575, presented at Neurips2025 , year=

  21. [21]

    Hierarchical Reinforcement Learning for Adaptive Text Generation

    Dethlefs, Nina and Cuay \'a huitl, Heriberto. Hierarchical Reinforcement Learning for Adaptive Text Generation. Proceedings of the 6th International Natural Language Generation Conference. 2010

  22. [22]

    and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B

    Duggan, Matthew J. and Gervase, Julietta and Schoenbaum, Anna and Hanson, William and Howell, John T., III and Sheinberg, Michael and Johnson, Kevin B. , title =. JAMA Network Open , volume =. 2025 , month =. doi:10.1001/jamanetworkopen.2024.60637 , url =

  23. [23]

    Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference

    Du s ek, Ond r ej and Kasner, Zden e k. Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.19

  24. [24]

    LLM -based NLG Evaluation: Current Status and Challenges

    Gao, Mingqi and Hu, Xinyu and Yin, Xunjian and Ruan, Jie and Pu, Xiao and Wan, Xiaojun. LLM -based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. doi:10.1162/coli_a_00561

  25. [25]

    Automatic Labeling of Semantic Roles

    Gildea, Daniel and Jurafsky, Daniel. Automatic Labeling of Semantic Roles. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. 2000. doi:10.3115/1075218.1075283

  26. [26]

    British Medical Journal , volume=

    Papers that go Beyond Numbers (Qualitative Research)' , author=. British Medical Journal , volume=

  27. [27]

    2011 , publisher=

    Applied thematic analysis , author=. 2011 , publisher=

  28. [28]

    and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A

    Howcroft, David M. and Belz, Anya and Clinciu, Miruna-Adriana and Gkatzia, Dimitra and Hasan, Sadid A. and Mahamood, Saad and Mille, Simon and van Miltenburg, Emiel and Santhanam, Sashank and Rieser, Verena. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions. Proceedings of the 13th International Confer...

  29. [29]

    The TIPSTER SUMMAC Text Summarization Evaluation

    Mani, Inderjeet and House, David and Klein, Gary and Hirschman, Lynette and Firmin, Therese and Sundheim, Beth. The TIPSTER SUMMAC Text Summarization Evaluation. Ninth Conference of the E uropean Chapter of the Association for Computational Linguistics. 1999

  30. [30]

    Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

    Mathur, Nitika and Baldwin, Timothy and Cohn, Trevor. Tangled up in BLEU : Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.448

  31. [31]

    and Vijay-Shanker, K

    McCoy, Kathleen F. and Vijay-Shanker, K. and Yang, Gijoo. Using T ree A djoining G rammars Systemic Framework in the. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

  32. [32]

    Barriers and enabling factors for error analysis in NLG research

    van Miltenburg, Emiel and Clinciu, Miruna and Du. Barriers and enabling factors for error analysis in NLG research. Northern European Journal of Language Technology. 2023. doi:10.3384/nejlt.2000-1533.2023.4529

  33. [33]

    Robust, applied morphological generation

    Minnen, Guido and Carroll, John and Pearce, Darren. Robust, applied morphological generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118281

  34. [34]

    Generating and Validating Abstracts of Meeting Conversations: a User Study

    Murray, Gabriel and Carenini, Giuseppe and Ng, Raymond. Generating and Validating Abstracts of Meeting Conversations: a User Study. Proceedings of the 6th International Natural Language Generation Conference. 2010

  35. [35]

    Domain Communication Knowledge

    Rambow, Owen. Domain Communication Knowledge. Proceedings of the Fifth International Workshop on Natural Language Generation. 1990

  36. [36]

    2025 , month =

    Recent Frontier Models Are Reward Hacking , author =. 2025 , month =

  37. [37]

    BMJ health & care informatics , volume=

    Evaluation framework to guide implementation of AI systems into healthcare settings , author=. BMJ health & care informatics , volume=

  38. [38]

    An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems

    Reiter, Ehud and Belz, Anja. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics. 2009. doi:10.1162/coli.2009.35.4.35405

  39. [39]

    Scott and Osman, Liesl

    Reiter, Ehud and Robertson, Roma and Lennox, A. Scott and Osman, Liesl. Using a Randomised Controlled Clinical Trial to Evaluate an NLG System. Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. 2001. doi:10.3115/1073012.1073069

  40. [40]

    Ehud Reiter , title =

  41. [41]

    We Should Evaluate Real-World Impact

    Reiter, Ehud. We Should Evaluate Real-World Impact. Computational Linguistics. 2025. doi:10.1162/coli.a.18

  42. [42]

    T., Wu, T., Guestrin, C., and Singh, S

    Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.442

  43. [43]

    What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis

    Sambaraju, Rahul and Reiter, Ehud and Logie, Robert and Mckinlay, Andy and McVittie, Chris and Gatt, Albert and Sykes, Cindy. What is in a text and what does it do: Qualitative Evaluations of an NLG system -- the BT -Nurse -- using content analysis and discourse analysis. Proceedings of the 13th E uropean Workshop on Natural Language Generation. 2011

  44. [44]

    2026 , doi =

    Sun, Mengxuan and Reiter, Ehud and Murchie, Peter and Kiltie, Anne E and Ramsay, George and Duncan, Lisa and Adam, Rosalind , title =. 2026 , doi =. https://www.medrxiv.org/content/early/2026/02/03/2026.02.02.26345346.full.pdf , journal =

  45. [45]

    A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

    Thomson, Craig and Reiter, Ehud. A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems. Proceedings of the 13th International Conference on Natural Language Generation. 2020. doi:10.18653/v1/2020.inlg-1.22

  46. [46]

    Evaluating factual accuracy in complex data-to-text , journal =

    Craig Thomson and Ehud Reiter and Barkavi Sundararajan , keywords =. Evaluating factual accuracy in complex data-to-text , journal =. 2023 , issn =. doi:https://doi.org/10.1016/j.csl.2023.101482 , url =

  47. [47]

    Common Flaws in Running Human Evaluation Experiments in NLP

    Thomson, Craig and Reiter, Ehud and Belz, Anya. Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. 2024. doi:10.1162/coli_a_00508

  48. [48]

    Word Representations: A Simple and General Method for Semi-Supervised Learning

    Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua. Word Representations: A Simple and General Method for Semi-Supervised Learning. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010

  49. [49]

    2025 , publisher=

    Qualitative Research: A Guide to Design and Implementation , author=. 2025 , publisher=

  50. [50]

    Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =

    Chris. Human evaluation of automatically generated text: Current trends and best practice guidelines , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.csl.2020.101151 , url =

  51. [51]

    1990 , publisher=

    Readings in speech recognition , author=. 1990 , publisher=

  52. [52]

    Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications

    Zhou, Kaitlyn and Blodgett, Su Lin and Trischler, Adam and Daum \'e III, Hal and Suleman, Kaheer and Olteanu, Alexandra. Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 20...

  53. [53]

    Using Argumentation Strategies in Automated Argument Generation

    Zukerman, Ingrid and McConachy, Richard and George, Sarah. Using Argumentation Strategies in Automated Argument Generation. INLG ' 2000 Proceedings of the First International Conference on Natural Language Generation. 2000. doi:10.3115/1118253.1118262