Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
Pith reviewed 2026-05-13 22:49 UTC · model grok-4.3
The pith
A taxonomy organizes recurring evaluation concerns across the history of NLP research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's core claim is that a scoping review reveals a set of recurring evaluation concerns in NLP that can be structured into a taxonomy, allowing current debates to be understood as part of established methodological discussions rather than entirely new issues.
What carries the argument
The taxonomy of evaluation concerns, which groups recurring positions and trade-offs from the literature into categories that support structured reasoning about evaluation choices.
If this is right
- Evaluation design can become more deliberate through use of the derived checklist.
- Trade-offs in evaluation choices become clearer when viewed through the taxonomy categories.
- Contemporary critiques of LLM evaluations align with long-standing positions in the field.
- Researchers gain a consolidated reference for reasoning about evaluation practices.
Where Pith is reading between the lines
- The taxonomy could reduce repeated critiques by showing which concerns have already been addressed in prior work.
- Similar structured reviews might clarify evaluation practices in adjacent areas such as computer vision or reinforcement learning.
- Adoption of the checklist could improve consistency in how evaluation results are reported and compared across studies.
Load-bearing premise
A scoping review can identify and organize all major recurring positions and trade-offs in the NLP evaluation literature without significant gaps.
What would settle it
Discovery of a substantial body of NLP evaluation literature that introduces concerns not covered by any category in the taxonomy.
Figures
read the original abstract
Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a scoping review of research on evaluation concerns in NLP and develops a taxonomy synthesizing recurring positions and trade-offs within each area. It also discusses practical implications, including a structured checklist to support more deliberate evaluation design and interpretation, with the aim of situating contemporary debates within the field's historical context to provide a consolidated reference.
Significance. If the taxonomy is comprehensive and the synthesis accurate, this manuscript would offer a useful historical and organizational reference for NLP evaluation practices, helping researchers navigate trade-offs without reinventing prior critiques. The structured checklist stands out as a concrete contribution that could directly improve evaluation design and interpretation in the field.
major comments (1)
- [Methods section] Methods section: The scoping review lacks a clear description of the search strategy, databases used, keywords or queries, inclusion/exclusion criteria, time frame, and the number of papers screened or included. This information is necessary to evaluate the completeness of the taxonomy and the claim of capturing recurring positions across the NLP literature.
minor comments (1)
- [Taxonomy section] The taxonomy categories would benefit from a summary table or diagram to make the structure and interconnections more immediately accessible to readers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for major revision. We address the single major comment below and will revise the manuscript to improve methodological transparency.
read point-by-point responses
-
Referee: [Methods section] Methods section: The scoping review lacks a clear description of the search strategy, databases used, keywords or queries, inclusion/exclusion criteria, time frame, and the number of papers screened or included. This information is necessary to evaluate the completeness of the taxonomy and the claim of capturing recurring positions across the NLP literature.
Authors: We agree that explicit methodological details are essential for a scoping review to allow readers to assess the scope and potential biases in the synthesized literature. The original manuscript presented the review process at a higher level of abstraction, focusing on the resulting taxonomy rather than procedural specifics. In the revised version, we will add a dedicated subsection in Methods that details: (1) the databases and repositories searched (ACL Anthology, arXiv, Google Scholar, and selected journals), (2) the Boolean keyword queries and iterative refinement process used to identify papers on evaluation concerns, (3) inclusion/exclusion criteria (e.g., peer-reviewed or preprint works explicitly addressing evaluation methodology in NLP, excluding purely application papers), (4) the time frame (primarily 2010–2024 with key foundational works from earlier decades), and (5) screening numbers (initial hits, duplicates removed, papers screened at title/abstract and full-text stages, and final included set). These additions will directly support the claim of capturing recurring positions without altering the taxonomy itself. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is a scoping review that synthesizes recurring positions on evaluation concerns from the NLP literature into a taxonomy, accompanied by a practical checklist. Its derivation chain consists of standard literature-search and synthesis steps rather than any equations, fitted parameters, or predictions. No load-bearing step reduces by construction to the authors' own prior results, self-citations, or imported uniqueness theorems; the central claim of historical contextualization rests on the review methodology itself, which is externally verifiable and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.04703; presented at Neurips 2025 , year=
Extending MT evaluation tools with transla- tion complexity metrics. InCOLING 2004: Pro- ceedings of the 20th International Conference on Computational Linguistics, pages 106–112, Geneva, Switzerland. COLING. Srinivas Bangalore, Owen Rambow, and Steve Whit- taker. 2000. Evaluation metrics for generation. In INLG’2000 Proceedings of the First International...
-
[2]
An empirical investigation of statistical sig- nificance in NLP. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Lan- guage Learning, pages 995–1005, Jeju Island, Korea. Association for Computational Linguistics. Glen Berman, Nitesh Goyal, and Michael Madaio. 2024. A scoping study ...
work page 2012
-
[3]
Machine learning data practices through a data curation lens: An evaluation framework. InPro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of “bias” in NLP. InPro- ceedings of the ...
work page 2024
-
[4]
Curious case of language generation evalua- tion metrics: A cautionary tale. InProceedings of the 28th International Conference on Computational Linguistics, pages 2322–2328, Barcelona, Spain (On- line). International Committee on Computational Lin- guistics. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (m...
work page 2007
-
[5]
The price of debiasing automatic metrics in natural language evalaution. InProceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 643–653, Melbourne, Australia. Association for Computational Linguistics. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyua...
work page 2024
-
[6]
Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. InPro- ceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4069–4082, Hong Kong, China. Association for Computational Linguistics. Elizabe...
work page 2019
-
[7]
All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics. Jonathan H. Clark, C...
-
[8]
InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online
Evaluating models’ local decision boundaries via contrast sets. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics. Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel- lam. 2023. Repairing the cracked foundation: A sur- vey of obstacles in evaluation practices for ...
work page 2020
-
[9]
Breaking NLI systems with sentences that require simple lexical inferences. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics. Seraphina Goldfarb-Tarrant, Eddie Ungless, Esma Balkir, and Su Lin Blodgett. 2023. Th...
work page 2023
-
[10]
Randomized significance tests in machine translation. InProceedings of the Ninth Workshop 15 on Statistical Machine Translation, pages 266–274, Baltimore, Maryland, USA. Association for Compu- tational Linguistics. G. Guida and G. Mauri. Evaluation of natural lan- guage processing systems: Issues and approaches. 74(7):1026–1035. Suchin Gururangan, Dallas ...
work page 2022
-
[11]
Unifying human and statistical evaluation for natural language generation. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701, Minneapolis, Min- nesota. Association for Computational Linguistics. William P. Hea...
work page 2019
-
[12]
Evaluation gaps in machine learning practice. InProceedings of the 2022 ACM Conference on Fair- ness, Accountability, and Transparency, FAccT ’22, page 1859–1876, New York, NY , USA. Association for Computing Machinery. Ben Hutchinson, Andrew Smart, Alex Hanna, Remi Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. To...
work page 2022
-
[13]
arXiv preprint arXiv:2112.01716 , year=
The perils of using Mechanical Turk to evalu- ate open-ended text generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob Morrison, Alexan...
-
[14]
LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. InProceed- ings of the 17th Conference of the European Chap- ter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics. Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc- Cann, Caiming Xiong, a...
work page 2019
-
[15]
Exploring the reliability of large language models as customized evaluators for diverse NLP tasks. InProceedings of the 31st International Con- ference on Computational Linguistics, pages 10325– 10344, Abu Dhabi, UAE. Association for Computa- tional Linguistics. Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. 2...
work page 2024
-
[16]
Responsible AI considerations in text summa- rization research: A review of current practices. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 6246–6261, Singapore. Association for Computational Linguistics. Adian Liusie, Potsawee Manakul, and Mark Gales. 2024. Llm comparative assessment: Zero-shot nlg evalua- tion throug...
work page 2023
-
[17]
Opening the nlp blackbox - analysis and eval- uation of nlp models: Methods, challenges and op- portunities. InProceedings of the 3rd ACM India Joint International Conference on Data Science & Management of Data (8th ACM IKDD CODS & 26th COMAD), CODS-COMAD ’21, page 447–448, New York, NY , USA. Association for Computing Machin- ery. Christopher D Manning....
work page 2011
-
[18]
Nitika Mathur, Timothy Baldwin, and Trevor Cohn
Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial intelligence, 271:18–42. Nitika Mathur, Timothy Baldwin, and Trevor Cohn
-
[19]
Tangled up in BLEU: Reevaluating the eval- uation of automatic machine translation evaluation metrics. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computa- tional Linguistics. Tobias A Mattei. 2018. “reinventing the wheel”: reflec- tions on a recurrent phenomenon in ...
work page 2018
-
[20]
Predicting the relevance of distributional se- mantic similarity with contextual information. In Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 479–488, Baltimore, Maryland. Association for Computational Linguistics. Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose,...
work page 2018
-
[21]
Pervasive label errors in test sets destabi- lize machine learning benchmarks.arXiv preprint arXiv:2103.14749. Jekaterina Novikova, Ond ˇrej Dušek, Amanda Cer- cas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. InProceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing, pages 2241–2252, Copen- h...
-
[22]
Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11). Micah DJ Peters, Casey Marnie, Heather Colquhoun, Chantelle M Garritty, Susanne Hempel, Tanya Hors- ley, Etienne V Langlois, Erin Lillie, Kelly K OBrien, Ozge Tunalp, et al. 2021. Scoping reviews: reinforc- ing and advancing the methodology ...
work page 2021
-
[23]
Finding replicable human evaluations via sta- ble ranking probability. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4908–4919, Mexico City, Mexico. Association for Computational Linguistics. Pedro Rodriguez, Joe Barrow, Alexan...
work page 2024
-
[24]
RoMe: A robust metric for evaluating natural language generation. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5645– 5657, Dublin, Ireland. Association for Computational Linguistics. Magnus Sahlgren. 2006. Towards pertinent evaluation methodologies for word-space models. InProce...
work page 2006
-
[25]
NLP evaluation in trouble: On the need to mea- sure LLM data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore. Association for Computational Linguistics. Olawale Salaudeen, Nicole Chiou, Shiny Weng, and Sanmi Koyejo. Are domain generalization bench- marks with accurac...
-
[26]
Evaluation metrics in the era of GPT-4: Reli- ably evaluating large language models on sequence to sequence tasks. InProceedings of the 2023 Con- ference on Empirical Methods in Natural Language Processing, pages 8776–8788, Singapore. Associa- tion for Computational Linguistics. Karen Sparck Jones. 1994. Towards better NLP sys- tem evaluation. InHuman Lan...
work page 2023
-
[27]
NLP reproducibility for all: Understanding experiences of beginners. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10199– 10219, Toronto, Canada. Association for Computa- tional Linguistics. Arjun Subramonian, Xingdi Yuan, Hal Daumé III, and Su Lin Blodgett. 2023. It takes two to t...
work page 2023
-
[28]
Association for Computa- tional Linguistics
What’s the meaning of superhuman perfor- mance in today’s NLU? InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12471– 12491, Toronto, Canada. Association for Computa- tional Linguistics. Joel R. Tetreault and Martin Chodorow. 2008. The ups and downs of preposition error detection in ...
work page 2008
-
[29]
evaluating student performance
Exploring limitations of llm capabilities with multi-problem evaluation. InThe Sixth Workshop on Insights from Negative Results in NLP, pages 121– 140. Kevin Wei, Patricia Paskov, Sunishchal Dev, Michael J. Byun, Anka Reuel, Xavier Roberts-Gaal, Rachel Cal- cott, Evie Coxon, and Chinmay Deshpande. Position: Human baselines in model evaluations need rigor ...
-
[30]
Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measure- ment theory. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 10967–10982, Singapore. Association for Computational Linguistics. Yan Xue, Xuefei Cao, Xingli Yang, Yu Wang, Ruibo Wang, and Jihong Li. 2023. We need ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.