pith. sign in

arxiv: 2606.13111 · v1 · pith:6APDGR2Qnew · submitted 2026-06-11 · 💻 cs.CL

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

Pith reviewed 2026-06-27 07:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationGerman public sectorbenchmarkhallucinationenergy consumptionconstitutional alignmentmulti-metric evaluation
0
0 comments X

The pith

No single LLM leads on all German public-sector tasks, and model size does not predict quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MÖVE, a benchmark that scores LLMs on both performance tasks such as summarization and question answering and on governance factors such as energy use, hallucination rates, and alignment with German constitutional values. It tests 39 models on ten German-language datasets built or adapted for public-administration content. Results show that different models rank highest on different criteria and that larger parameter counts alone do not produce better scores. The benchmark is released as a living resource with public rankings and ongoing validation of its own reliability.

Core claim

MÖVE evaluates LLMs for German public administration by combining performance metrics on summarization, question answering, and topic extraction with governance metrics on hallucination, energy consumption, provider transparency, and alignment with constitutional values and party positions. Using ten German datasets including newly created gold and silver standards, the evaluation of 39 models finds that no model leads on every dimension and that parameter count is a weak predictor of overall suitability.

What carries the argument

The MÖVE benchmark, which pairs performance criteria with governance criteria across ten German-language datasets and multi-metric scoring that includes classical NLP metrics, embeddings, and LLM-as-judge methods.

If this is right

  • Public agencies can use the dual performance-governance scores to select models rather than defaulting to the largest available LLM.
  • Model rankings shift when governance criteria such as energy use or constitutional alignment are added to pure task accuracy.
  • The living benchmark structure allows new models and updated datasets to be added without redesigning the evaluation protocol.
  • Prompt sensitivity tests in the paper indicate that small wording changes can alter rankings, so agencies should re-validate before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agencies might combine the benchmark scores into a single weighted index tailored to their own risk tolerance on energy or transparency.
  • The finding that size is a poor predictor suggests testing smaller, specialized models could yield better cost-performance trade-offs in constrained public-sector budgets.
  • If the datasets prove representative, the same dual-criteria approach could be adapted for other languages or other regulated domains such as healthcare or legal services.

Load-bearing premise

The ten German datasets, including the new gold and silver ones, accurately capture the content and needs of actual public-administration work.

What would settle it

Re-running the full evaluation on a fresh set of public-administration documents drawn from a different German federal agency and checking whether the same models remain top-ranked on the same criteria.

read the original abstract

We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MÖVE, a holistic benchmark evaluating 39 LLMs for the German public sector across performance criteria (summarization, question answering, topic extraction) and governance criteria (hallucination, energy consumption, provider transparency, alignment with German constitutional values and political party positions). It uses ten German-language datasets, including newly constructed gold- and silver-standard sets intended to reflect public-administration domains, and applies a multi-metric strategy with classical NLP metrics, embeddings, and LLM-as-a-judge. Results indicate no single model dominates across criteria and that model size is a poor predictor of quality; the work also includes self-evaluation of benchmark properties such as statistical precision, judge reliability, prompt sensitivity, and private-dataset impact, and positions MÖVE as a living benchmark with public results.

Significance. If the datasets are representative, the multi-dimensional evaluation and finding that performance and governance criteria produce different top models would provide actionable guidance for German public-sector LLM selection, moving beyond English-centric or performance-only benchmarks. The public leaderboard, energy estimates, and constitutional-alignment checks add practical value; the living-benchmark design and self-evaluation checks are also positive features.

major comments (2)
  1. [dataset construction and evaluation sections] The manuscript states that the newly constructed gold- and silver-standard datasets were 'constructed to reflect public-administration domains,' yet supplies no external validation (practitioner review, distributional comparison to authentic Verwaltungsrecht or administrative corpora, or coverage analysis of domain-specific features such as formal legal phrasing and bureaucratic constraints). This directly affects the robustness of the central claim that task-wise rank reversals and the decoupling of model size from quality reflect real deployment rather than benchmark artifacts.
  2. [self-evaluation of the benchmark] The additional benchmark-validity checks (LLM-judge reliability, prompt sensitivity, private-dataset impact) presuppose that the underlying ten datasets already capture the target domain; they do not test that presupposition. Without domain-representativeness evidence, these checks cannot fully substantiate the generalizability of the reported rankings.
minor comments (2)
  1. Notation for the ten datasets and the distinction between gold- and silver-standard items could be clarified with an explicit table listing sources, sizes, and construction procedures.
  2. The abstract and introduction use 'M"OVE' with escaped umlauts; consistent rendering of the acronym and German terms throughout would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on dataset construction and the scope of the benchmark self-evaluations. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [dataset construction and evaluation sections] The manuscript states that the newly constructed gold- and silver-standard datasets were 'constructed to reflect public-administration domains,' yet supplies no external validation (practitioner review, distributional comparison to authentic Verwaltungsrecht or administrative corpora, or coverage analysis of domain-specific features such as formal legal phrasing and bureaucratic constraints). This directly affects the robustness of the central claim that task-wise rank reversals and the decoupling of model size from quality reflect real deployment rather than benchmark artifacts.

    Authors: We agree that the manuscript provides no external validation of the newly constructed datasets. Construction relied on internal expertise and selection of texts exhibiting administrative characteristics, but no practitioner review, distributional comparisons, or systematic coverage analysis of features such as formal legal phrasing was performed. This is a genuine limitation that weakens claims linking observed rank reversals and size-quality decoupling directly to real-world deployment. In the revised manuscript we will (i) expand the dataset-construction subsection with a more detailed account of the internal process and concrete examples of incorporated features, (ii) add an explicit limitations paragraph stating the absence of external validation, and (iii) qualify all statements about real-deployment implications to refer only to the evaluated datasets. revision: yes

  2. Referee: [self-evaluation of the benchmark] The additional benchmark-validity checks (LLM-judge reliability, prompt sensitivity, private-dataset impact) presuppose that the underlying ten datasets already capture the target domain; they do not test that presupposition. Without domain-representativeness evidence, these checks cannot fully substantiate the generalizability of the reported rankings.

    Authors: We agree that the self-evaluation checks address internal properties of the evaluation pipeline but do not test the domain-representativeness assumption. In the revision we will (i) explicitly delineate the scope of these checks as methodological robustness measures rather than domain-validation evidence, (ii) cross-reference the new limitations paragraph on dataset construction, and (iii) adjust the discussion of ranking generalizability to reflect that the reported results apply to the specific datasets employed. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation or self-referential reduction

full rationale

The paper is a direct empirical benchmark of 39 LLMs on ten German datasets (including author-constructed gold/silver standards) using standard NLP metrics, embeddings, and LLM judges. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented; the central claim that no model dominates and size is a poor predictor follows from tabulated performance numbers on the chosen tasks. The construction of datasets to 'reflect public-administration domains' is an input assumption whose validity is external to any derivation chain, and no self-citation is invoked as load-bearing evidence for the reported rankings. The work therefore contains no circular steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution rests on the assumption that the constructed datasets and chosen governance criteria validly represent public-administration needs; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5801 in / 1093 out tokens · 18267 ms · 2026-06-27T07:05:08.668688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

146 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

    Jan Etscheid, Jörn von Lucke, and Felix Stroh. Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

  2. [2]

    Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung

    Statistisches Bundesamt (Destatis). Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung. https://www.destatis.de/DE/Presse/ Pressemitteilungen/2025/06/PD25_212_741.html, 2025. Pressemitteilung Nr. 212 vom 23. Juni 2025. Accessed: 2026-05-07

  3. [3]

    Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten

    PwC Deutschland. Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten. https://blogs. pwc.de/de/oeffentlicher-sektor-zukunft-gestalten/article/252701/ wie-oeffentliche-institutionen-ihre-beschaeftigten-strategisch-binden-sollten/, January 2026. Blog post, Öffentlicher Sektor – Zukunft gestalten. Accessed: 2026-05-07

  4. [4]

    Measuring Massive Multitask Lan- guage Understanding, January 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Lan- guage Understanding, January 2021. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs]

  5. [5]

    Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023

    Aarohi Srivastava et al. Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023. URL http://arxiv.org/ abs/2206.04615. arXiv:2206.04615 [cs]. 80

  6. [6]

    GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...

  7. [7]

    Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA, 2019

  8. [8]

    Star-sql: Self-taught reasoner for text-to-sql

    ShivalikaSinghetal. GlobalMMLU:Understandingandaddressingculturaland linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vienna, Austria, July ...

  9. [9]

    Towards multilingual llm eval- uation for european languages, 2024

    Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, and Mehdi Ali. Towards multilingual llm eval- uation for european languages, 2024. URL https://arxiv.org/abs/2410.08928. preprint

  10. [10]

    Holistic Evaluation of Language Models, October 2023

    Percy Liang et al. Holistic Evaluation of Language Models, October 2023. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs]

  11. [11]

    NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

    Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Compu- tational Linguistics: EMNLP 2023, pages 10776–10787, Singapore, December

  12. [12]

    doi: 10.18653/v1/2023

    Association for Computational Linguistics. doi: 10.18653/v1/2023. findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722/

  13. [13]

    Generalization or memorization: Data contamination and trustworthy evaluation for large language models

    Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050, Bangkok, Thailand, August 20...

  14. [14]

    Healthy llms? benchmarking llm knowledge of uk government public health information, 2025

    Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Non- nenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, and Michael Borowitz. Healthy llms? benchmarking llm knowledge of uk government public health information, 2025. URL https://arxiv.org/abs/ 2505.06046. preprint. 81

  15. [15]

    The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026

    Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, Manil Maskey, Elena Simperl, and Nigel Shadbolt. The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026. URL https://arxiv.org/abs/2602.04064

  16. [16]

    The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans

    Shuo Liu, Lin Zhang, Weidong Liu, Jianfeng Zhang, Donghui Gao, and Xiaofeng Jia. The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans. Intell. Syst. Technol., 16(6), November

  17. [17]

    doi: 10.1145/3716854

    ISSN 2157-6904. doi: 10.1145/3716854. URL https://doi.org/10.1145/ 3716854

  18. [18]

    Agent benchmarks fail public sector requirements, 2026

    Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, and Chris Russell. Agent benchmarks fail public sector requirements, 2026. URL https: //arxiv.org/abs/2601.20617

  19. [19]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024

    European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024. URL https://eur-lex. europa.eu/eli/reg/2024/1689/oj

  20. [20]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Pro- cessing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

  21. [21]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805

  22. [22]

    Brown et al

    Tom B. Brown et al. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Sys- tems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

  23. [23]

    SWAG: A large- scale adversarial dataset for grounded commonsense inference

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large- scale adversarial dataset for grounded commonsense inference. In Ellen Riloff, DavidChiang,JuliaHockenmaier,andJun’ichiTsujii,editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium, October-November 2018. Assoc...

  24. [24]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

  25. [25]

    Winogrande: an adversarial winograd schema challenge at scale.Commun

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381. 82

  26. [26]

    CLUE: A Chinese language understanding evaluation benchmark

    Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Ri...

  27. [27]

    Mmlu-pro: a more robust and challenging multi-task language understanding benchmark

    Yubo Wang et al. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA,

  28. [28]

    ISBN 9798331314385

    Curran Associates Inc. ISBN 9798331314385

  29. [29]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

  30. [30]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Health- bench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

  31. [31]

    Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R

    Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Fed- erica Villa, James S. Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R. Maria del Rio-Chanona. Large language models’ expert-level global history knowledge benchmark (hist-llm). InProceedings of the 38th International Con- ference on Neural Information Processing Syste...

  32. [32]

    Detecting linguistic bias in government documents using large language models

    Milena de Swart, Floris Den Hengst, and Jieying Chen. Detecting linguistic bias in government documents using large language models. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 5034–5044, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712746. doi: 10.1145/3696410.3714526. URL https://doi.org/10.1145/3696410.3714526

  33. [33]

    I roko B ench: A New Benchmark for A frican Languages in the Age of Large Language Models

    David Ifeoluwa Adelani et al. IrokoBench: A new benchmark for African languages in the age of large language models. In Luis Chiruzzo, Alan Rit- ter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 273...

  34. [34]

    SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

    Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, 83 Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. SEA-HELM: Southeast Asian holistic evaluation of language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors...

  35. [35]

    Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance

    Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance. In James Hale, Kushal Chawla, and Muskan Garg, editors,Proceedings of the Second Workshop on Social Influence in Con- versations (SICon 2024), pages 9–35, Miami, Florida, USA, Novemb...

  36. [36]

    TurkishMMLU: Measuring massive multitask language under- standing in Turkish

    Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Senel, Anna Korhonen, and Hinrich Schuetze. TurkishMMLU: Measuring massive multitask language under- standing in Turkish. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7035–7055, Miami, Florida, USA, November 2024. Assoc...

  37. [37]

    KMMLU: Measuring massive multitask language understanding in Korean

    Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, TaekyoonChoi,CheonbokPark,KangMinYoo,andStellaBiderman. KMMLU: Measuring massive multitask language understanding in Korean. In Luis Chiruzzo,AlanRitter,andLuWang,editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Li...

  38. [38]

    URL https://aclanthology.org/2025.naacl-long.206/

  39. [39]

    On the measure of intelligence, 2019

    François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/ abs/1911.01547

  40. [40]

    Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831

  41. [41]

    In benchmarks we trust

    Ine Gevers, Victor De Marez, Jens Van Nooten, Jens Lemmens, Andriy Kosar, Ehsan Lotfi, Nikolay Banar, Pieter Fivez, Luna De Bruyne, and Walter Daele- mans. In benchmarks we trust ... or not? In Christos Christodoulopoulos, TanmoyChakraborty,CarolynRose,andVioletPeng,editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

  42. [42]

    Bender, Alex Hanna, and Amandalynne Paullada

    Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

  43. [43]

    Bean et al

    Andrew M. Bean et al. Measuring what matters: Construct validity in large language model benchmarks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=mdA5lVvNcU

  44. [44]

    Retrieval Augmentation Reduces Hallucination in Conversation,

    Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cot- terell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association...

  45. [46]

    Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto

    Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto. Provingtest set contamination inblack-boxlanguage models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KS8mIvetg2

  46. [47]

    Benchmarking is broken – don’t let ai be its own judge, 2025

    Zerui Cheng et al. Benchmarking is broken – don’t let ai be its own judge, 2025. URL https://arxiv.org/abs/2510.07575

  47. [48]

    Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices

    Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel Kochenderfer. Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=hcOq2buakM

  48. [49]

    A trainable document sum- marizer

    Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document sum- marizer. InProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 68–73, 1995

  49. [50]

    Generic text summarization using relevance measure and latent semantic analysis

    Yihong Gong and Xin Liu. Generic text summarization using relevance measure and latent semantic analysis. InProceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 19–25, New York, NY, USA, September 2001. Association for Computing Machinery. ISBN 978-1-58113-331-8. doi: 10.1...

  50. [51]

    Neural Summarization by Extract- ing Sentences and Words, July 2016

    Jianpeng Cheng and Mirella Lapata. Neural Summarization by Extract- ing Sentences and Words, July 2016. URL http://arxiv.org/abs/1603.07252. 85 arXiv:1603.07252 [cs]

  51. [52]

    Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

    Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

  52. [53]

    Rush, Sumit Chopra, and Jason Weston

    Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence Summarization, September 2015. URL http: //arxiv.org/abs/1509.00685. arXiv:1509.00685 [cs]

  53. [54]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks, April 2017. URL http:// arxiv.org/abs/1704.04368. arXiv:1704.04368 [cs]

  54. [55]

    ACM Comput

    Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.ACM Com- puting Surveys, 55(8):1–35, August 2023. ISSN 0360-0300, 1557-7341. doi: 10.1145/3545176

  55. [56]

    A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

    Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

  56. [57]

    Abstractive text summarization using sequence-to-sequence rnns and beyond

    Ramesh Nallapati, Bowen Zhou, Cicero Dos Santos, Çağlar Gulçehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence rnns and beyond. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 280–290, 2016

  57. [58]

    Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

    Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. pages 1797–1807, 2018

  58. [59]

    Efficient attentions for long document summarization

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. pages 1419–1436, 2021

  59. [60]

    BillSum: A corpus for automatic summarization of US legislation

    Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. pages 48–56, 2019

  60. [61]

    Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

    Dennis Aumiller, Ashish Chouhan, and Michael Gertz. Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

  61. [62]

    Ho, and Joel Niklaus

    Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, and Joel Niklaus. Scale: Scaling up the complexity for advanced language model evaluation, 2023

  62. [63]

    ROUGE: A Package for Automatic Evaluation of Summaries

    Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

  63. [64]

    URL https://aclanthology

    Association for Computational Linguistics. URL https://aclanthology. org/W04-1013/

  64. [65]

    B leu: a Method for Automatic Evaluation of Machine Translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, page 311, Philadelphia, Pennsylvania, 2001. Association for Computational Linguistics. doi: 10.3115/1073083.1073135

  65. [66]

    Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

    Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization 86 evaluation.Transactions of the Association for Computational Linguistics, 9: 391–409, 2021

  66. [67]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT, February 2020. URL http://arxiv.org/abs/1904.09675. arXiv:1904.09675 [cs]

  67. [68]

    SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024

    Ansar Aynetdinov and Alan Akbik. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024. URL http://arxiv.org/abs/2401.17072. arXiv:2401.17072 [cs]

  68. [69]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]

  69. [70]

    G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

  70. [71]

    Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

  71. [72]

    News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

    Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

  72. [73]

    Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024

    Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024. ISSN 2307-387X

  73. [74]

    QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension

    Anna Rogers, Matt Gardner, and Isabelle Augenstein. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55:1–45, 2023. doi: 10.1145/3560260

  74. [75]

    Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

    Penghao Zhu, Zhiwei Lin, Zijian Wang, Xiaodan Liang, et al. Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

  75. [76]

    Teaching machines to read and comprehend

    Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, volume 28, 2015

  76. [78]

    Reading Wikipedia to answer open-domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics, 2017. 87

  77. [79]

    Latent retrieval for weakly supervised open domain question answering

    Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–

  78. [81]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022. URL http://arxiv.org/abs/2109. 07958. arXiv:2109.07958 [cs]

  79. [82]

    Know what you don’t know: Unanswerable questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Association for Computational Linguistics, 2018

  80. [83]

    Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

    Tom Kwiatkowski et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

Showing first 80 references.