M\"OVE: A Holistic LLM Benchmark for the German Public Sector

Camilla Dalerci; Daniel Weinland; Robin Schaefer; Thilo Michael

arxiv: 2606.13111 · v1 · pith:6APDGR2Qnew · submitted 2026-06-11 · 💻 cs.CL

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

Camilla Dalerci , Thilo Michael , Robin Schaefer , Daniel Weinland This is my paper

Pith reviewed 2026-06-27 07:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationGerman public sectorbenchmarkhallucinationenergy consumptionconstitutional alignmentmulti-metric evaluation

0 comments

The pith

No single LLM leads on all German public-sector tasks, and model size does not predict quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MÖVE, a benchmark that scores LLMs on both performance tasks such as summarization and question answering and on governance factors such as energy use, hallucination rates, and alignment with German constitutional values. It tests 39 models on ten German-language datasets built or adapted for public-administration content. Results show that different models rank highest on different criteria and that larger parameter counts alone do not produce better scores. The benchmark is released as a living resource with public rankings and ongoing validation of its own reliability.

Core claim

MÖVE evaluates LLMs for German public administration by combining performance metrics on summarization, question answering, and topic extraction with governance metrics on hallucination, energy consumption, provider transparency, and alignment with constitutional values and party positions. Using ten German datasets including newly created gold and silver standards, the evaluation of 39 models finds that no model leads on every dimension and that parameter count is a weak predictor of overall suitability.

What carries the argument

The MÖVE benchmark, which pairs performance criteria with governance criteria across ten German-language datasets and multi-metric scoring that includes classical NLP metrics, embeddings, and LLM-as-judge methods.

If this is right

Public agencies can use the dual performance-governance scores to select models rather than defaulting to the largest available LLM.
Model rankings shift when governance criteria such as energy use or constitutional alignment are added to pure task accuracy.
The living benchmark structure allows new models and updated datasets to be added without redesigning the evaluation protocol.
Prompt sensitivity tests in the paper indicate that small wording changes can alter rankings, so agencies should re-validate before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agencies might combine the benchmark scores into a single weighted index tailored to their own risk tolerance on energy or transparency.
The finding that size is a poor predictor suggests testing smaller, specialized models could yield better cost-performance trade-offs in constrained public-sector budgets.
If the datasets prove representative, the same dual-criteria approach could be adapted for other languages or other regulated domains such as healthcare or legal services.

Load-bearing premise

The ten German datasets, including the new gold and silver ones, accurately capture the content and needs of actual public-administration work.

What would settle it

Re-running the full evaluation on a fresh set of public-administration documents drawn from a different German federal agency and checking whether the same models remain top-ranked on the same criteria.

read the original abstract

We present M\"OVE (Modelle f\"ur die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MÖVE gives a practical German public-sector LLM benchmark but its new datasets need external validation to back the ranking claims.

read the letter

The main takeaway is that this paper creates MÖVE, a benchmark that tests 39 LLMs on German public-administration tasks while also scoring governance factors like energy use, hallucination, transparency, and alignment with German constitutional values and party positions.

It does a few things cleanly. The combination of performance metrics (summarization, QA, topic extraction) with those governance criteria is a useful framing for the sector. They run a multi-metric setup that mixes standard NLP scores, embeddings, and LLM judges, then test the benchmark itself for statistical precision, judge reliability, prompt sensitivity, and the effect of their private datasets. Making results public and labeling it a living benchmark is straightforward and usable.

The soft spot sits with the ten German datasets, especially the new gold- and silver-standard ones. The abstract states they were constructed to reflect public-administration domains, yet supplies no external validation from practitioners, no distributional check against real administrative corpora, and no detail on how domain-specific features like legal phrasing or bureaucratic constraints were captured. The other robustness checks they performed assume the data already match the target setting; they do not test that assumption. If the items lean too generic, the headline result—no single model dominates and size is a poor predictor—could be an artifact of the test items.

This is for researchers or administrators focused on German or similar regulated public-sector LLM use. A reader looking for domain-specific evaluation tools will find the framing and the scale of the model comparison helpful.

It deserves peer review because the gap is real and the work is transparent enough to be critiqued on the data-construction side.

Referee Report

2 major / 2 minor

Summary. The paper presents MÖVE, a holistic benchmark evaluating 39 LLMs for the German public sector across performance criteria (summarization, question answering, topic extraction) and governance criteria (hallucination, energy consumption, provider transparency, alignment with German constitutional values and political party positions). It uses ten German-language datasets, including newly constructed gold- and silver-standard sets intended to reflect public-administration domains, and applies a multi-metric strategy with classical NLP metrics, embeddings, and LLM-as-a-judge. Results indicate no single model dominates across criteria and that model size is a poor predictor of quality; the work also includes self-evaluation of benchmark properties such as statistical precision, judge reliability, prompt sensitivity, and private-dataset impact, and positions MÖVE as a living benchmark with public results.

Significance. If the datasets are representative, the multi-dimensional evaluation and finding that performance and governance criteria produce different top models would provide actionable guidance for German public-sector LLM selection, moving beyond English-centric or performance-only benchmarks. The public leaderboard, energy estimates, and constitutional-alignment checks add practical value; the living-benchmark design and self-evaluation checks are also positive features.

major comments (2)

[dataset construction and evaluation sections] The manuscript states that the newly constructed gold- and silver-standard datasets were 'constructed to reflect public-administration domains,' yet supplies no external validation (practitioner review, distributional comparison to authentic Verwaltungsrecht or administrative corpora, or coverage analysis of domain-specific features such as formal legal phrasing and bureaucratic constraints). This directly affects the robustness of the central claim that task-wise rank reversals and the decoupling of model size from quality reflect real deployment rather than benchmark artifacts.
[self-evaluation of the benchmark] The additional benchmark-validity checks (LLM-judge reliability, prompt sensitivity, private-dataset impact) presuppose that the underlying ten datasets already capture the target domain; they do not test that presupposition. Without domain-representativeness evidence, these checks cannot fully substantiate the generalizability of the reported rankings.

minor comments (2)

Notation for the ten datasets and the distinction between gold- and silver-standard items could be clarified with an explicit table listing sources, sizes, and construction procedures.
The abstract and introduction use 'M"OVE' with escaped umlauts; consistent rendering of the acronym and German terms throughout would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on dataset construction and the scope of the benchmark self-evaluations. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [dataset construction and evaluation sections] The manuscript states that the newly constructed gold- and silver-standard datasets were 'constructed to reflect public-administration domains,' yet supplies no external validation (practitioner review, distributional comparison to authentic Verwaltungsrecht or administrative corpora, or coverage analysis of domain-specific features such as formal legal phrasing and bureaucratic constraints). This directly affects the robustness of the central claim that task-wise rank reversals and the decoupling of model size from quality reflect real deployment rather than benchmark artifacts.

Authors: We agree that the manuscript provides no external validation of the newly constructed datasets. Construction relied on internal expertise and selection of texts exhibiting administrative characteristics, but no practitioner review, distributional comparisons, or systematic coverage analysis of features such as formal legal phrasing was performed. This is a genuine limitation that weakens claims linking observed rank reversals and size-quality decoupling directly to real-world deployment. In the revised manuscript we will (i) expand the dataset-construction subsection with a more detailed account of the internal process and concrete examples of incorporated features, (ii) add an explicit limitations paragraph stating the absence of external validation, and (iii) qualify all statements about real-deployment implications to refer only to the evaluated datasets. revision: yes
Referee: [self-evaluation of the benchmark] The additional benchmark-validity checks (LLM-judge reliability, prompt sensitivity, private-dataset impact) presuppose that the underlying ten datasets already capture the target domain; they do not test that presupposition. Without domain-representativeness evidence, these checks cannot fully substantiate the generalizability of the reported rankings.

Authors: We agree that the self-evaluation checks address internal properties of the evaluation pipeline but do not test the domain-representativeness assumption. In the revision we will (i) explicitly delineate the scope of these checks as methodological robustness measures rather than domain-validation evidence, (ii) cross-reference the new limitations paragraph on dataset construction, and (iii) adjust the discussion of ranking generalizability to reflect that the reported results apply to the specific datasets employed. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation or self-referential reduction

full rationale

The paper is a direct empirical benchmark of 39 LLMs on ten German datasets (including author-constructed gold/silver standards) using standard NLP metrics, embeddings, and LLM judges. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented; the central claim that no model dominates and size is a poor predictor follows from tabulated performance numbers on the chosen tasks. The construction of datasets to 'reflect public-administration domains' is an input assumption whose validity is external to any derivation chain, and no self-citation is invoked as load-bearing evidence for the reported rankings. The work therefore contains no circular steps of the enumerated kinds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central contribution rests on the assumption that the constructed datasets and chosen governance criteria validly represent public-administration needs; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5801 in / 1093 out tokens · 18267 ms · 2026-06-27T07:05:08.668688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

146 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

Jan Etscheid, Jörn von Lucke, and Felix Stroh. Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

2020
[2]

Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung

Statistisches Bundesamt (Destatis). Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung. https://www.destatis.de/DE/Presse/ Pressemitteilungen/2025/06/PD25_212_741.html, 2025. Pressemitteilung Nr. 212 vom 23. Juni 2025. Accessed: 2026-05-07

2024
[3]

Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten

PwC Deutschland. Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten. https://blogs. pwc.de/de/oeffentlicher-sektor-zukunft-gestalten/article/252701/ wie-oeffentliche-institutionen-ihre-beschaeftigten-strategisch-binden-sollten/, January 2026. Blog post, Öffentlicher Sektor – Zukunft gestalten. Accessed: 2026-05-07

2026
[4]

Measuring Massive Multitask Lan- guage Understanding, January 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Lan- guage Understanding, January 2021. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs]

Pith/arXiv arXiv 2021
[5]

Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023

Aarohi Srivastava et al. Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023. URL http://arxiv.org/ abs/2206.04615. arXiv:2206.04615 [cs]. 80

Pith/arXiv arXiv 2023
[6]

GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...

work page doi:10.18653/v1/w18-5446 2018
[7]

Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA, 2019

2019
[8]

Star-sql: Self-taught reasoner for text-to-sql

ShivalikaSinghetal. GlobalMMLU:Understandingandaddressingculturaland linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vienna, Austria, July ...

work page doi:10.18653/v1/2025 2025
[9]

Towards multilingual llm eval- uation for european languages, 2024

Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, and Mehdi Ali. Towards multilingual llm eval- uation for european languages, 2024. URL https://arxiv.org/abs/2410.08928. preprint

arXiv 2024
[10]

Holistic Evaluation of Language Models, October 2023

Percy Liang et al. Holistic Evaluation of Language Models, October 2023. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs]

Pith/arXiv arXiv 2023
[11]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Compu- tational Linguistics: EMNLP 2023, pages 10776–10787, Singapore, December

2023
[12]

doi: 10.18653/v1/2023

Association for Computational Linguistics. doi: 10.18653/v1/2023. findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722/

work page doi:10.18653/v1/2023 2023
[13]

Generalization or memorization: Data contamination and trustworthy evaluation for large language models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050, Bangkok, Thailand, August 20...

work page doi:10.18653/v1/2024.findings-acl.716 2024
[14]

Healthy llms? benchmarking llm knowledge of uk government public health information, 2025

Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Non- nenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, and Michael Borowitz. Healthy llms? benchmarking llm knowledge of uk government public health information, 2025. URL https://arxiv.org/abs/ 2505.06046. preprint. 81

arXiv 2025
[15]

The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026

Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, Manil Maskey, Elena Simperl, and Nigel Shadbolt. The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026. URL https://arxiv.org/abs/2602.04064

arXiv 2026
[16]

The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans

Shuo Liu, Lin Zhang, Weidong Liu, Jianfeng Zhang, Donghui Gao, and Xiaofeng Jia. The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans. Intell. Syst. Technol., 16(6), November
[17]

doi: 10.1145/3716854

ISSN 2157-6904. doi: 10.1145/3716854. URL https://doi.org/10.1145/ 3716854

work page doi:10.1145/3716854
[18]

Agent benchmarks fail public sector requirements, 2026

Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, and Chris Russell. Agent benchmarks fail public sector requirements, 2026. URL https: //arxiv.org/abs/2601.20617

arXiv 2026
[19]

Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024

European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024. URL https://eur-lex. europa.eu/eli/reg/2024/1689/oj

2024
[20]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Pro- cessing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

2017
[21]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805

Pith/arXiv arXiv 2018
[22]

Brown et al

Tom B. Brown et al. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Sys- tems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020
[23]

SWAG: A large- scale adversarial dataset for grounded commonsense inference

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large- scale adversarial dataset for grounded commonsense inference. In Ellen Riloff, DavidChiang,JuliaHockenmaier,andJun’ichiTsujii,editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium, October-November 2018. Assoc...

work page doi:10.18653/v1/d18-1009 2018
[24]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

work page doi:10.18653/v1/p19-1472 2019
[25]

Winogrande: an adversarial winograd schema challenge at scale.Commun

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381. 82

work page doi:10.1145/3474381 2021
[26]

CLUE: A Chinese language understanding evaluation benchmark

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Ri...

work page doi:10.18653/v1/2020.coling-main.419 2020
[27]

Mmlu-pro: a more robust and challenging multi-task language understanding benchmark

Yubo Wang et al. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA,
[28]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385
[29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

2024
[30]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Health- bench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

Pith/arXiv arXiv 2025
[31]

Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R

Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Fed- erica Villa, James S. Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R. Maria del Rio-Chanona. Large language models’ expert-level global history knowledge benchmark (hist-llm). InProceedings of the 38th International Con- ference on Neural Information Processing Syste...

2024
[32]

Detecting linguistic bias in government documents using large language models

Milena de Swart, Floris Den Hengst, and Jieying Chen. Detecting linguistic bias in government documents using large language models. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 5034–5044, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712746. doi: 10.1145/3696410.3714526. URL https://doi.org/10.1145/3696410.3714526

work page doi:10.1145/3696410.3714526 2025
[33]

I roko B ench: A New Benchmark for A frican Languages in the Age of Large Language Models

David Ifeoluwa Adelani et al. IrokoBench: A new benchmark for African languages in the age of large language models. In Luis Chiruzzo, Alan Rit- ter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 273...

work page doi:10.18653/v1/2025.naacl-long.139 2025
[34]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, 83 Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. SEA-HELM: Southeast Asian holistic evaluation of language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors...

work page doi:10.18653/v1/2025.findings-acl.636 2025
[35]

Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance. In James Hale, Kushal Chawla, and Muskan Garg, editors,Proceedings of the Second Workshop on Social Influence in Con- versations (SICon 2024), pages 9–35, Miami, Florida, USA, Novemb...

work page doi:10.18653/v1/2024.sicon-1.2 2024
[36]

TurkishMMLU: Measuring massive multitask language under- standing in Turkish

Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Senel, Anna Korhonen, and Hinrich Schuetze. TurkishMMLU: Measuring massive multitask language under- standing in Turkish. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7035–7055, Miami, Florida, USA, November 2024. Assoc...

work page doi:10.18653/v1/2024.findings-emnlp.413 2024
[37]

KMMLU: Measuring massive multitask language understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, TaekyoonChoi,CheonbokPark,KangMinYoo,andStellaBiderman. KMMLU: Measuring massive multitask language understanding in Korean. In Luis Chiruzzo,AlanRitter,andLuWang,editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Li...

work page doi:10.18653/v1/2025.naacl-long 2025
[38]

URL https://aclanthology.org/2025.naacl-long.206/

2025
[39]

On the measure of intelligence, 2019

François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/ abs/1911.01547

Pith/arXiv arXiv 2019
[40]

Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831

Pith/arXiv arXiv 2025
[41]

In benchmarks we trust

Ine Gevers, Victor De Marez, Jens Van Nooten, Jens Lemmens, Andriy Kosar, Ehsan Lotfi, Nikolay Banar, Pieter Fivez, Luna De Bruyne, and Walter Daele- mans. In benchmarks we trust ... or not? In Christos Christodoulopoulos, TanmoyChakraborty,CarolynRose,andVioletPeng,editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2025.emnlp-main.1208 2025
[42]

Bender, Alex Hanna, and Amandalynne Paullada

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

2021
[43]

Bean et al

Andrew M. Bean et al. Measuring what matters: Construct validity in large language model benchmarks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=mdA5lVvNcU

2025
[44]

Retrieval Augmentation Reduces Hallucination in Conversation,

Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cot- terell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association...

work page doi:10.18653/v1/2021 2021
[46]

Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto

Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto. Provingtest set contamination inblack-boxlanguage models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KS8mIvetg2

2024
[47]

Benchmarking is broken – don’t let ai be its own judge, 2025

Zerui Cheng et al. Benchmarking is broken – don’t let ai be its own judge, 2025. URL https://arxiv.org/abs/2510.07575

arXiv 2025
[48]

Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel Kochenderfer. Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=hcOq2buakM

2024
[49]

A trainable document sum- marizer

Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document sum- marizer. InProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 68–73, 1995

1995
[50]

Generic text summarization using relevance measure and latent semantic analysis

Yihong Gong and Xin Liu. Generic text summarization using relevance measure and latent semantic analysis. InProceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 19–25, New York, NY, USA, September 2001. Association for Computing Machinery. ISBN 978-1-58113-331-8. doi: 10.1...

work page doi:10.1145/383952.383955 2001
[51]

Neural Summarization by Extract- ing Sentences and Words, July 2016

Jianpeng Cheng and Mirella Lapata. Neural Summarization by Extract- ing Sentences and Words, July 2016. URL http://arxiv.org/abs/1603.07252. 85 arXiv:1603.07252 [cs]

Pith/arXiv arXiv 2016
[52]

Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

2002
[53]

Rush, Sumit Chopra, and Jason Weston

Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence Summarization, September 2015. URL http: //arxiv.org/abs/1509.00685. arXiv:1509.00685 [cs]

Pith/arXiv arXiv 2015
[54]

Liu, and Christopher D

Abigail See, Peter J. Liu, and Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks, April 2017. URL http:// arxiv.org/abs/1704.04368. arXiv:1704.04368 [cs]

Pith/arXiv arXiv 2017
[55]

ACM Comput

Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.ACM Com- puting Surveys, 55(8):1–35, August 2023. ISSN 0360-0300, 1557-7341. doi: 10.1145/3545176

work page doi:10.1145/3545176 2023
[56]

A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

2025
[57]

Abstractive text summarization using sequence-to-sequence rnns and beyond

Ramesh Nallapati, Bowen Zhou, Cicero Dos Santos, Çağlar Gulçehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence rnns and beyond. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 280–290, 2016

2016
[58]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. pages 1797–1807, 2018

2018
[59]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. pages 1419–1436, 2021

2021
[60]

BillSum: A corpus for automatic summarization of US legislation

Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. pages 48–56, 2019

2019
[61]

Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

Dennis Aumiller, Ashish Chouhan, and Michael Gertz. Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

arXiv 2022
[62]

Ho, and Joel Niklaus

Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, and Joel Niklaus. Scale: Scaling up the complexity for advanced language model evaluation, 2023

2023
[63]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July
[64]

URL https://aclanthology

Association for Computational Linguistics. URL https://aclanthology. org/W04-1013/
[65]

B leu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, page 311, Philadelphia, Pennsylvania, 2001. Association for Computational Linguistics. doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2001
[66]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization 86 evaluation.Transactions of the Association for Computational Linguistics, 9: 391–409, 2021

2021
[67]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT, February 2020. URL http://arxiv.org/abs/1904.09675. arXiv:1904.09675 [cs]

Pith/arXiv arXiv 2020
[68]

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024

Ansar Aynetdinov and Alan Akbik. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024. URL http://arxiv.org/abs/2401.17072. arXiv:2401.17072 [cs]

arXiv 2024
[69]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]

Pith/arXiv arXiv 2023
[70]

G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

Pith/arXiv arXiv 2023
[71]

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

Pith/arXiv arXiv 2024
[72]

News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

arXiv 2022
[73]

Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024. ISSN 2307-387X

2024
[74]

QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension

Anna Rogers, Matt Gardner, and Isabelle Augenstein. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55:1–45, 2023. doi: 10.1145/3560260

work page doi:10.1145/3560260 2023
[75]

Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

Penghao Zhu, Zhiwei Lin, Zijian Wang, Xiaodan Liang, et al. Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

Pith/arXiv arXiv 2024
[76]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015
[78]

Reading Wikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics, 2017. 87

2017
[79]

Latent retrieval for weakly supervised open domain question answering

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–
[81]

TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022. URL http://arxiv.org/abs/2109. 07958. arXiv:2109.07958 [cs]

Pith/arXiv arXiv 2022
[82]

Know what you don’t know: Unanswerable questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Association for Computational Linguistics, 2018

2018
[83]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

Tom Kwiatkowski et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

2019

Showing first 80 references.

[1] [1]

Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

Jan Etscheid, Jörn von Lucke, and Felix Stroh. Künstliche intelligenz in der öffentlichen verwaltung.Digitalakademie@ BW & Fraunhofer IAO, Stuttgart, pages 11–12, 2020

2020

[2] [2]

Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung

Statistisches Bundesamt (Destatis). Öffentlicher Dienst 2024: Mehr Beschäftigte für Bildung und Kinderbetreuung. https://www.destatis.de/DE/Presse/ Pressemitteilungen/2025/06/PD25_212_741.html, 2025. Pressemitteilung Nr. 212 vom 23. Juni 2025. Accessed: 2026-05-07

2024

[3] [3]

Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten

PwC Deutschland. Wie öffentliche Institutionen ihre Beschäftigten strategisch binden sollten. https://blogs. pwc.de/de/oeffentlicher-sektor-zukunft-gestalten/article/252701/ wie-oeffentliche-institutionen-ihre-beschaeftigten-strategisch-binden-sollten/, January 2026. Blog post, Öffentlicher Sektor – Zukunft gestalten. Accessed: 2026-05-07

2026

[4] [4]

Measuring Massive Multitask Lan- guage Understanding, January 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Lan- guage Understanding, January 2021. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs]

Pith/arXiv arXiv 2021

[5] [5]

Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023

Aarohi Srivastava et al. Beyond the Imitation Game: Quantifying and extrap- olating the capabilities of language models, June 2023. URL http://arxiv.org/ abs/2206.04615. arXiv:2206.04615 [cs]. 80

Pith/arXiv arXiv 2023

[6] [6]

GLUE : A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...

work page doi:10.18653/v1/w18-5446 2018

[7] [7]

Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.SuperGLUE: a stick- ier benchmark for general-purpose language understanding systems. Curran Associates Inc., Red Hook, NY, USA, 2019

2019

[8] [8]

Star-sql: Self-taught reasoner for text-to-sql

ShivalikaSinghetal. GlobalMMLU:Understandingandaddressingculturaland linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vienna, Austria, July ...

work page doi:10.18653/v1/2025 2025

[9] [9]

Towards multilingual llm eval- uation for european languages, 2024

Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel, and Mehdi Ali. Towards multilingual llm eval- uation for european languages, 2024. URL https://arxiv.org/abs/2410.08928. preprint

arXiv 2024

[10] [10]

Holistic Evaluation of Language Models, October 2023

Percy Liang et al. Holistic Evaluation of Language Models, October 2023. URL http://arxiv.org/abs/2211.09110. arXiv:2211.09110 [cs]

Pith/arXiv arXiv 2023

[11] [11]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Compu- tational Linguistics: EMNLP 2023, pages 10776–10787, Singapore, December

2023

[12] [12]

doi: 10.18653/v1/2023

Association for Computational Linguistics. doi: 10.18653/v1/2023. findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722/

work page doi:10.18653/v1/2023 2023

[13] [13]

Generalization or memorization: Data contamination and trustworthy evaluation for large language models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050, Bangkok, Thailand, August 20...

work page doi:10.18653/v1/2024.findings-acl.716 2024

[14] [14]

Healthy llms? benchmarking llm knowledge of uk government public health information, 2025

Joshua Harris, Fan Grayson, Felix Feldman, Timothy Laurence, Toby Non- nenmacher, Oliver Higgins, Leo Loman, Selina Patel, Thomas Finnie, Samuel Collins, and Michael Borowitz. Healthy llms? benchmarking llm knowledge of uk government public health information, 2025. URL https://arxiv.org/abs/ 2505.06046. preprint. 81

arXiv 2025

[15] [15]

The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026

Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, Manil Maskey, Elena Simperl, and Nigel Shadbolt. The citizenquery benchmark: A novel dataset and evaluation pipeline for measuring llm performance in citizen query tasks, 2026. URL https://arxiv.org/abs/2602.04064

arXiv 2026

[16] [16]

The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans

Shuo Liu, Lin Zhang, Weidong Liu, Jianfeng Zhang, Donghui Gao, and Xiaofeng Jia. The evaluation framework and benchmark for large language models in the government affairs domain.ACM Trans. Intell. Syst. Technol., 16(6), November

[17] [17]

doi: 10.1145/3716854

ISSN 2157-6904. doi: 10.1145/3716854. URL https://doi.org/10.1145/ 3716854

work page doi:10.1145/3716854

[18] [18]

Agent benchmarks fail public sector requirements, 2026

Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, and Chris Russell. Agent benchmarks fail public sector requirements, 2026. URL https: //arxiv.org/abs/2601.20617

arXiv 2026

[19] [19]

Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024

European Parliament and Council of the European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down har- monised rules on artificial intelligence (AI Act), 2024. URL https://eur-lex. europa.eu/eli/reg/2024/1689/oj

2024

[20] [20]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Pro- cessing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

2017

[21] [21]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805

Pith/arXiv arXiv 2018

[22] [22]

Brown et al

Tom B. Brown et al. Language models are few-shot learners. InProceedings of the 34th International Conference on Neural Information Processing Sys- tems, NIPS ’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020

[23] [23]

SWAG: A large- scale adversarial dataset for grounded commonsense inference

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large- scale adversarial dataset for grounded commonsense inference. In Ellen Riloff, DavidChiang,JuliaHockenmaier,andJun’ichiTsujii,editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium, October-November 2018. Assoc...

work page doi:10.18653/v1/d18-1009 2018

[24] [24]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguisti...

work page doi:10.18653/v1/p19-1472 2019

[25] [25]

Winogrande: an adversarial winograd schema challenge at scale.Commun

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale.Commun. ACM, 64(9):99–106, August 2021. ISSN 0001-0782. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381. 82

work page doi:10.1145/3474381 2021

[26] [26]

CLUE: A Chinese language understanding evaluation benchmark

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Ri...

work page doi:10.18653/v1/2020.coling-main.419 2020

[27] [27]

Mmlu-pro: a more robust and challenging multi-task language understanding benchmark

Yubo Wang et al. Mmlu-pro: a more robust and challenging multi-task language understanding benchmark. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY, USA,

[28] [28]

ISBN 9798331314385

Curran Associates Inc. ISBN 9798331314385

[29] [29]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

2024

[30] [30]

Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Health- bench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/2505.08775

Pith/arXiv arXiv 2025

[31] [31]

Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R

Jakob Hauser, Daniel Kondor, Jenny Reddish, Majid Benam, Enrico Cioni, Fed- erica Villa, James S. Bennett, Daniel Hoyer, Pieter Francois, Peter Turchin, and R. Maria del Rio-Chanona. Large language models’ expert-level global history knowledge benchmark (hist-llm). InProceedings of the 38th International Con- ference on Neural Information Processing Syste...

2024

[32] [32]

Detecting linguistic bias in government documents using large language models

Milena de Swart, Floris Den Hengst, and Jieying Chen. Detecting linguistic bias in government documents using large language models. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 5034–5044, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712746. doi: 10.1145/3696410.3714526. URL https://doi.org/10.1145/3696410.3714526

work page doi:10.1145/3696410.3714526 2025

[33] [33]

I roko B ench: A New Benchmark for A frican Languages in the Age of Large Language Models

David Ifeoluwa Adelani et al. IrokoBench: A new benchmark for African languages in the age of large language models. In Luis Chiruzzo, Alan Rit- ter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 273...

work page doi:10.18653/v1/2025.naacl-long.139 2025

[34] [34]

SEA - HELM : S outheast A sian Holistic Evaluation of Language Models

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xianbin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, 83 Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. SEA-HELM: Southeast Asian holistic evaluation of language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors...

work page doi:10.18653/v1/2025.findings-acl.636 2025

[35] [35]

Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect LLMs? a cross-lingual study on the influence of prompt politeness on LLM performance. In James Hale, Kushal Chawla, and Muskan Garg, editors,Proceedings of the Second Workshop on Social Influence in Con- versations (SICon 2024), pages 9–35, Miami, Florida, USA, Novemb...

work page doi:10.18653/v1/2024.sicon-1.2 2024

[36] [36]

TurkishMMLU: Measuring massive multitask language under- standing in Turkish

Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Senel, Anna Korhonen, and Hinrich Schuetze. TurkishMMLU: Measuring massive multitask language under- standing in Turkish. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7035–7055, Miami, Florida, USA, November 2024. Assoc...

work page doi:10.18653/v1/2024.findings-emnlp.413 2024

[37] [37]

KMMLU: Measuring massive multitask language understanding in Korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, TaekyoonChoi,CheonbokPark,KangMinYoo,andStellaBiderman. KMMLU: Measuring massive multitask language understanding in Korean. In Luis Chiruzzo,AlanRitter,andLuWang,editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Li...

work page doi:10.18653/v1/2025.naacl-long 2025

[38] [38]

URL https://aclanthology.org/2025.naacl-long.206/

2025

[39] [39]

On the measure of intelligence, 2019

François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/ abs/1911.01547

Pith/arXiv arXiv 2019

[40] [40]

Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025. URL https://arxiv.org/abs/2505.11831

Pith/arXiv arXiv 2025

[41] [41]

In benchmarks we trust

Ine Gevers, Victor De Marez, Jens Van Nooten, Jens Lemmens, Andriy Kosar, Ehsan Lotfi, Nikolay Banar, Pieter Fivez, Luna De Bruyne, and Walter Daele- mans. In benchmarks we trust ... or not? In Christos Christodoulopoulos, TanmoyChakraborty,CarolynRose,andVioletPeng,editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proces...

work page doi:10.18653/v1/2025.emnlp-main.1208 2025

[42] [42]

Bender, Alex Hanna, and Amandalynne Paullada

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

2021

[43] [43]

Bean et al

Andrew M. Bean et al. Measuring what matters: Construct validity in large language model benchmarks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=mdA5lVvNcU

2025

[44] [44]

Retrieval Augmentation Reduces Hallucination in Conversation,

Samuel R. Bowman and George Dahl. What will it take to fix benchmarking in natural language understanding? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cot- terell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Association...

work page doi:10.18653/v1/2021 2021

[45] [46]

Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto

Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tat- sunori Hashimoto. Provingtest set contamination inblack-boxlanguage models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KS8mIvetg2

2024

[46] [47]

Benchmarking is broken – don’t let ai be its own judge, 2025

Zerui Cheng et al. Benchmarking is broken – don’t let ai be its own judge, 2025. URL https://arxiv.org/abs/2510.07575

arXiv 2025

[47] [48]

Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel Kochenderfer. Betterbench: Assessing AI benchmarks, uncovering issues,andestablishingbestpractices. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=hcOq2buakM

2024

[48] [49]

A trainable document sum- marizer

Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document sum- marizer. InProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 68–73, 1995

1995

[49] [50]

Generic text summarization using relevance measure and latent semantic analysis

Yihong Gong and Xin Liu. Generic text summarization using relevance measure and latent semantic analysis. InProceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’01, pages 19–25, New York, NY, USA, September 2001. Association for Computing Machinery. ISBN 978-1-58113-331-8. doi: 10.1...

work page doi:10.1145/383952.383955 2001

[50] [51]

Neural Summarization by Extract- ing Sentences and Words, July 2016

Jianpeng Cheng and Mirella Lapata. Neural Summarization by Extract- ing Sentences and Words, July 2016. URL http://arxiv.org/abs/1603.07252. 85 arXiv:1603.07252 [cs]

Pith/arXiv arXiv 2016

[51] [52]

Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression.Artificial Intelligence, 139(1): 91–107, 2002

2002

[52] [53]

Rush, Sumit Chopra, and Jason Weston

Alexander M. Rush, Sumit Chopra, and Jason Weston. A Neural Attention Model for Abstractive Sentence Summarization, September 2015. URL http: //arxiv.org/abs/1509.00685. arXiv:1509.00685 [cs]

Pith/arXiv arXiv 2015

[53] [54]

Liu, and Christopher D

Abigail See, Peter J. Liu, and Christopher D. Manning. Get To The Point: Summarization with Pointer-Generator Networks, April 2017. URL http:// arxiv.org/abs/1704.04368. arXiv:1704.04368 [cs]

Pith/arXiv arXiv 2017

[54] [55]

ACM Comput

Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics.ACM Com- puting Surveys, 55(8):1–35, August 2023. ISSN 0360-0300, 1557-7341. doi: 10.1145/3545176

work page doi:10.1145/3545176 2023

[55] [56]

A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

Yang Zhang, Hanlei Jin, Dan Meng, Jun Wang, and Jinghua Tan. A compre- hensive survey on automatic text summarization with exploration of llm-based methods.Neurocomputing, page 131928, 2025

2025

[56] [57]

Abstractive text summarization using sequence-to-sequence rnns and beyond

Ramesh Nallapati, Bowen Zhou, Cicero Dos Santos, Çağlar Gulçehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence rnns and beyond. InProceedings of the 20th SIGNLL conference on computational natural language learning, pages 280–290, 2016

2016

[57] [58]

Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. pages 1797–1807, 2018

2018

[58] [59]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. pages 1419–1436, 2021

2021

[59] [60]

BillSum: A corpus for automatic summarization of US legislation

Anastassia Kornilova and Vladimir Eidelman. BillSum: A corpus for automatic summarization of US legislation. pages 48–56, 2019

2019

[60] [61]

Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

Dennis Aumiller, Ashish Chouhan, and Michael Gertz. Eur-lex-sum: A multi- andcross-lingualdatasetforlong-formsummarizationinthelegaldomain.arXiv preprint arXiv:2210.13448, 2022

arXiv 2022

[61] [62]

Ho, and Joel Niklaus

Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, and Joel Niklaus. Scale: Scaling up the complexity for advanced language model evaluation, 2023

2023

[62] [63]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

[63] [64]

URL https://aclanthology

Association for Computational Linguistics. URL https://aclanthology. org/W04-1013/

[64] [65]

B leu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, page 311, Philadelphia, Pennsylvania, 2001. Association for Computational Linguistics. doi: 10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2001

[65] [66]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization 86 evaluation.Transactions of the Association for Computational Linguistics, 9: 391–409, 2021

2021

[66] [67]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT, February 2020. URL http://arxiv.org/abs/1904.09675. arXiv:1904.09675 [cs]

Pith/arXiv arXiv 2020

[67] [68]

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024

Ansar Aynetdinov and Alan Akbik. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity, January 2024. URL http://arxiv.org/abs/2401.17072. arXiv:2401.17072 [cs]

arXiv 2024

[68] [69]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, December 2023. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs]

Pith/arXiv arXiv 2023

[69] [70]

G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634, 2023

Pith/arXiv arXiv 2023

[70] [71]

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024

Pith/arXiv arXiv 2024

[71] [72]

News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

Tanya Goyal, Junyi Jessy Li, and Greg Durrett. News summarization and evaluation in the era of gpt-3.arXiv preprint arXiv:2209.12356, 2022

arXiv 2022

[72] [73]

Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024. ISSN 2307-387X

2024

[73] [74]

QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension

Anna Rogers, Matt Gardner, and Isabelle Augenstein. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Computing Surveys, 55:1–45, 2023. doi: 10.1145/3560260

work page doi:10.1145/3560260 2023

[74] [75]

Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

Penghao Zhu, Zhiwei Lin, Zijian Wang, Xiaodan Liang, et al. Retrieval- augmented generation for AI-generated content: A survey.arXiv preprint arXiv:2402.19473, 2024

Pith/arXiv arXiv 2024

[75] [76]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. InAdvances in Neural Information Processing Systems, volume 28, 2015

2015

[76] [78]

Reading Wikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading Wikipedia to answer open-domain questions. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. Association for Computational Linguistics, 2017. 87

2017

[77] [79]

Latent retrieval for weakly supervised open domain question answering

Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domain question answering. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–

[78] [81]

TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods, May 2022. URL http://arxiv.org/abs/2109. 07958. arXiv:2109.07958 [cs]

Pith/arXiv arXiv 2022

[79] [82]

Know what you don’t know: Unanswerable questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789. Association for Computational Linguistics, 2018

2018

[80] [83]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

Tom Kwiatkowski et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453– 466, 2019

2019