pith. sign in

arxiv: 2604.25923 · v1 · submitted 2026-04-01 · 💻 cs.CL

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Pith reviewed 2026-05-13 22:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords evaluation methodologyNLPtaxonomyscoping reviewmethodological concernstrade-offsLLM evaluationhistorical context
0
0 comments X

The pith

A taxonomy organizes recurring evaluation concerns across the history of NLP research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper performs a scoping review of evaluation concerns in natural language processing and synthesizes them into a taxonomy. The taxonomy groups positions and trade-offs from both past and present literature on how to assess NLP systems. It places recent questions about large language model evaluations into a longer historical conversation. From this, the authors derive a checklist to help guide more thoughtful choices in evaluation design and interpretation.

Core claim

The paper's core claim is that a scoping review reveals a set of recurring evaluation concerns in NLP that can be structured into a taxonomy, allowing current debates to be understood as part of established methodological discussions rather than entirely new issues.

What carries the argument

The taxonomy of evaluation concerns, which groups recurring positions and trade-offs from the literature into categories that support structured reasoning about evaluation choices.

If this is right

  • Evaluation design can become more deliberate through use of the derived checklist.
  • Trade-offs in evaluation choices become clearer when viewed through the taxonomy categories.
  • Contemporary critiques of LLM evaluations align with long-standing positions in the field.
  • Researchers gain a consolidated reference for reasoning about evaluation practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could reduce repeated critiques by showing which concerns have already been addressed in prior work.
  • Similar structured reviews might clarify evaluation practices in adjacent areas such as computer vision or reinforcement learning.
  • Adoption of the checklist could improve consistency in how evaluation results are reported and compared across studies.

Load-bearing premise

A scoping review can identify and organize all major recurring positions and trade-offs in the NLP evaluation literature without significant gaps.

What would settle it

Discovery of a substantial body of NLP evaluation literature that introduces concerns not covered by any category in the taxonomy.

Figures

Figures reproduced from arXiv: 2604.25923 by Anders S{\o}gaard, Ruchira Dhar.

Figure 1
Figure 1. Figure 1: Temporal distribution of papers addressing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of surveyed papers across specific [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A taxonomy of evaluation concerns in NLP: [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of evaluation concerns across 257 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A checklist for evaluating NLP evaluation studies, organized by concern category. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper conducts a scoping review of research on evaluation concerns in NLP and develops a taxonomy synthesizing recurring positions and trade-offs within each area. It also discusses practical implications, including a structured checklist to support more deliberate evaluation design and interpretation, with the aim of situating contemporary debates within the field's historical context to provide a consolidated reference.

Significance. If the taxonomy is comprehensive and the synthesis accurate, this manuscript would offer a useful historical and organizational reference for NLP evaluation practices, helping researchers navigate trade-offs without reinventing prior critiques. The structured checklist stands out as a concrete contribution that could directly improve evaluation design and interpretation in the field.

major comments (1)
  1. [Methods section] Methods section: The scoping review lacks a clear description of the search strategy, databases used, keywords or queries, inclusion/exclusion criteria, time frame, and the number of papers screened or included. This information is necessary to evaluate the completeness of the taxonomy and the claim of capturing recurring positions across the NLP literature.
minor comments (1)
  1. [Taxonomy section] The taxonomy categories would benefit from a summary table or diagram to make the structure and interconnections more immediately accessible to readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We address the single major comment below and will revise the manuscript to improve methodological transparency.

read point-by-point responses
  1. Referee: [Methods section] Methods section: The scoping review lacks a clear description of the search strategy, databases used, keywords or queries, inclusion/exclusion criteria, time frame, and the number of papers screened or included. This information is necessary to evaluate the completeness of the taxonomy and the claim of capturing recurring positions across the NLP literature.

    Authors: We agree that explicit methodological details are essential for a scoping review to allow readers to assess the scope and potential biases in the synthesized literature. The original manuscript presented the review process at a higher level of abstraction, focusing on the resulting taxonomy rather than procedural specifics. In the revised version, we will add a dedicated subsection in Methods that details: (1) the databases and repositories searched (ACL Anthology, arXiv, Google Scholar, and selected journals), (2) the Boolean keyword queries and iterative refinement process used to identify papers on evaluation concerns, (3) inclusion/exclusion criteria (e.g., peer-reviewed or preprint works explicitly addressing evaluation methodology in NLP, excluding purely application papers), (4) the time frame (primarily 2010–2024 with key foundational works from earlier decades), and (5) screening numbers (initial hits, duplicates removed, papers screened at title/abstract and full-text stages, and final included set). These additions will directly support the claim of capturing recurring positions without altering the taxonomy itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is a scoping review that synthesizes recurring positions on evaluation concerns from the NLP literature into a taxonomy, accompanied by a practical checklist. Its derivation chain consists of standard literature-search and synthesis steps rather than any equations, fitted parameters, or predictions. No load-bearing step reduces by construction to the authors' own prior results, self-citations, or imported uniqueness theorems; the central claim of historical contextualization rests on the review methodology itself, which is externally verifiable and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a scoping review and taxonomy paper. No free parameters are fitted, no new axioms are introduced beyond standard assumptions of literature synthesis, and no invented entities are postulated.

pith-pipeline@v0.9.0 · 5397 in / 1120 out tokens · 25447 ms · 2026-05-13T22:49:37.196349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=

    Extending MT evaluation tools with transla- tion complexity metrics. InCOLING 2004: Pro- ceedings of the 20th International Conference on Computational Linguistics, pages 106–112, Geneva, Switzerland. COLING. Srinivas Bangalore, Owen Rambow, and Steve Whit- taker. 2000. Evaluation metrics for generation. In INLG’2000 Proceedings of the First International...

  2. [2]

    InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning, pages 995–1005, Jeju Island, Korea

    An empirical investigation of statistical sig- nificance in NLP. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning, pages 995–1005, Jeju Island, Korea. Association for Computational Linguistics. Glen Berman, Nitesh Goyal, and Michael Madaio. 2024. A scoping study ...

  3. [3]

    InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067

    Machine learning data practices through a data curation lens: An evaluation framework. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. InPro- ceedings of the ...

  4. [4]

    InProceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (On- line)

    Curious case of language generation evalua- tion metrics: A cautionary tale. InProceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (On- line). International Committee on Computational Lin- guistics. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (m...

  5. [5]

    InProceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 643–653, Melbourne, Australia

    The price of debiasing automatic metrics in natural language evalaution. InProceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 643–653, Melbourne, Australia. Association for Computational Linguistics. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyua...

  6. [6]

    Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InPro- ceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4069–4082, Hong Kong, China. Association for Computational Linguistics. Elizabe...

  7. [7]

    All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics. Jonathan H. Clark, C...

  8. [8]

    InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online

    Evaluating models’ local decision boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics. Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel- lam. 2023. Repairing the cracked foundation: A sur- vey of obstacles in evaluation practices for ...

  9. [9]

    InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia

    Breaking NLI systems with sentences that require simple lexical inferences. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics. Seraphina Goldfarb-Tarrant, Eddie Ungless, Esma Balkir, and Su Lin Blodgett. 2023. Th...

  10. [10]

    InProceedings of the Ninth Workshop 15 on Statistical Machine Translation, pages 266–274, Baltimore, Maryland, USA

    Randomized significance tests in machine translation. InProceedings of the Ninth Workshop 15 on Statistical Machine Translation, pages 266–274, Baltimore, Maryland, USA. Association for Compu- tational Linguistics. G. Guida and G. Mauri. Evaluation of natural lan- guage processing systems: Issues and approaches. 74(7):1026–1035. Suchin Gururangan, Dallas ...

  11. [11]

    Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, Minneapolis, Min- nesota. Association for Computational Linguistics. William P. Hea...

  12. [12]

    InProceedings of the 2022 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’22, page 1859–1876, New York, NY , USA

    Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’22, page 1859–1876, New York, NY , USA. Association for Computing Machinery. Ben Hutchinson, Andrew Smart, Alex Hanna, Remi Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. To...

  13. [13]

    arXiv preprint arXiv:2112.01716 , year=

    The perils of using Mechanical Turk to evalu- ate open-ended text generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob Morrison, Alexan...

  14. [14]

    InProceed- ings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia

    LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. InProceed- ings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics. Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, a...

  15. [15]

    InProceedings of the 31st International Con- ference on Computational Linguistics, pages 10325– 10344, Abu Dhabi, UAE

    Exploring the reliability of large language models as customized evaluators for diverse NLP tasks. InProceedings of the 31st International Con- ference on Computational Linguistics, pages 10325– 10344, Abu Dhabi, UAE. Association for Computa- tional Linguistics. Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2...

  16. [16]

    In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 6246–6261, Singapore

    Responsible AI considerations in text summa- rization research: A review of current practices. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 6246–6261, Singapore. Association for Computational Linguistics. Adian Liusie, Potsawee Manakul, and Mark Gales. 2024. Llm comparative assessment: Zero-shot nlg evalua- tion throug...

  17. [17]

    Opening the nlp blackbox - analysis and eval- uation of nlp models: Methods, challenges and op- portunities. InProceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD), CODS-COMAD ’21, page 447–448, New York, NY , USA. Association for Computing Machin- ery. Christopher D Manning....

  18. [18]

    Nitika Mathur, Timothy Baldwin, and Trevor Cohn

    Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial intelligence, 271:18–42. Nitika Mathur, Timothy Baldwin, and Trevor Cohn

  19. [19]

    reinventing the wheel

    Tangled up in BLEU: Reevaluating the eval- uation of automatic machine translation evaluation metrics. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computa- tional Linguistics. Tobias A Mattei. 2018. “reinventing the wheel”: reflec- tions on a recurrent phenomenon in ...

  20. [20]

    In Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 479–488, Baltimore, Maryland

    Predicting the relevance of distributional se- mantic similarity with contextual information. In Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 479–488, Baltimore, Maryland. Association for Computational Linguistics. Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose,...

  21. [21]

    Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021a

    Pervasive label errors in test sets destabi- lize machine learning benchmarks.arXiv preprint arXiv:2103.14749. Jekaterina Novikova, Ond ˇrej Dušek, Amanda Cer- cas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. InProceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing, pages 2241–2252, Copen- h...

  22. [22]

    Patterns, 2(11)

    Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11). Micah DJ Peters, Casey Marnie, Heather Colquhoun, Chantelle M Garritty, Susanne Hempel, Tanya Hors- ley, Etienne V Langlois, Erin Lillie, Kelly K OBrien, Ozge Tunalp, et al. 2021. Scoping reviews: reinforc- ing and advancing the methodology ...

  23. [23]

    Finding replicable human evaluations via sta- ble ranking probability. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4908–4919, Mexico City, Mexico. Association for Computational Linguistics. Pedro Rodriguez, Joe Barrow, Alexan...

  24. [24]

    InProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5645– 5657, Dublin, Ireland

    RoMe: A robust metric for evaluating natural language generation. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5645– 5657, Dublin, Ireland. Association for Computational Linguistics. Magnus Sahlgren. 2006. Towards pertinent evaluation methodologies for word-space models. InProce...

  25. [25]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

    NLP evaluation in trouble: On the need to mea- sure LLM data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore. Association for Computational Linguistics. Olawale Salaudeen, Nicole Chiou, Shiny Weng, and Sanmi Koyejo. Are domain generalization bench- marks with accurac...

  26. [26]

    InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 8776–8788, Singapore

    Evaluation metrics in the era of GPT-4: Reli- ably evaluating large language models on sequence to sequence tasks. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 8776–8788, Singapore. Associa- tion for Computational Linguistics. Karen Sparck Jones. 1994. Towards better NLP sys- tem evaluation. InHuman Lan...

  27. [27]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10199– 10219, Toronto, Canada

    NLP reproducibility for all: Understanding experiences of beginners. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10199– 10219, Toronto, Canada. Association for Computa- tional Linguistics. Arjun Subramonian, Xingdi Yuan, Hal Daumé III, and Su Lin Blodgett. 2023. It takes two to t...

  28. [28]

    Association for Computa- tional Linguistics

    What’s the meaning of superhuman perfor- mance in today’s NLU? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12471– 12491, Toronto, Canada. Association for Computa- tional Linguistics. Joel R. Tetreault and Martin Chodorow. 2008. The ups and downs of preposition error detection in ...

  29. [29]

    evaluating student performance

    Exploring limitations of llm capabilities with multi-problem evaluation. InThe Sixth Workshop on Insights from Negative Results in NLP, pages 121– 140. Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Cal- cott, Evie Coxon, and Chinmay Deshpande. Position: Human baselines in model evaluations need rigor ...

  30. [30]

    Lost in benchmarks? Rethinking large language model benchmarking with item response theory.arXiv preprint arXiv:2505.15055,

    Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. Yan Xue, Xuefei Cao, Xingli Yang, Yu Wang, Ruibo Wang, and Jihong Li. 2023. We need ...