ltzGLUE: Luxembourgish General Language Understanding Evaluation

Alistair Plum; Anne-Marie Lutgen; Barbara Plank; C\'edric Lothritz; Christoph Purschke; Emilia Milano; Felicia K\"orner; Fred Philippy; Laura Bernardy; Nils Rehlinger

arxiv: 2604.17976 · v1 · submitted 2026-04-20 · 💻 cs.CL

ltzGLUE: Luxembourgish General Language Understanding Evaluation

Alistair Plum , Felicia K\"orner , Anne-Marie Lutgen , Laura Bernardy , Fred Philippy , Emilia Milano , Nils Rehlinger , C\'edric Lothritz

show 3 more authors

Tharindu Ranasinghe Barbara Plank Christoph Purschke

This is my paper

Pith reviewed 2026-05-10 04:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords LuxembourgishNLU benchmarkGLUEnamed entity recognitiontopic classificationintent classificationlanguage modelslow-resource languages

0 comments

The pith

ltzGLUE introduces the first benchmark for measuring natural language understanding in Luxembourgish.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors construct a set of tasks covering named entity recognition, topic classification, and intent classification to create a standardized evaluation for Luxembourgish. This matters because the language, an official national tongue in Europe, has lacked any dedicated way to test how well models grasp its structure and meaning. They then run multiple pre-trained encoder models on these tasks to map current performance levels. A sympathetic reader would view the result as a practical starting point for developing AI systems that can reliably process Luxembourgish text.

Core claim

The paper presents ltzGLUE, the first NLU benchmark for Luxembourgish, built by constructing new tasks and reusing existing ones in binary and multi-class classification settings, then evaluates various pre-trained language models to give an overview of current capabilities on the language.

What carries the argument

The ltzGLUE benchmark suite, which adapts GLUE-style tasks to Luxembourgish through new dataset construction for named entity recognition, topic classification, and intent classification.

If this is right

Sets baseline performance scores for pre-trained models on Luxembourgish classification and recognition tasks.
Supplies a common standard that future models for the language can be measured against.
Allows systematic tracking of improvements in Luxembourgish language technology over time.
Identifies specific gaps in how current models handle this under-resourced official language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same task-construction approach could be applied to create benchmarks for other official languages that currently lack them.
Models that score well on ltzGLUE could be prioritized for practical tools such as automated translation or public information systems in Luxembourg.
Adding tasks beyond classification and recognition would give a more complete view of model strengths and weaknesses on the language.

Load-bearing premise

The constructed tasks serve as valid and representative measures of natural language understanding for Luxembourgish speakers.

What would settle it

A comparison showing that human Luxembourgish speakers achieve accuracy patterns on these tasks that diverge sharply from the error patterns of the tested models would indicate the benchmark may not capture genuine understanding.

read the original abstract

This paper presents ltzGLUE, the first Natural Language Understanding (NLU) benchmark for Luxembourgish (LTZ) based on the popular GLUE benchmark for English. Although NLU tasks are available for many European languages nowadays, LTZ is one of the official national languages that is often overlooked. We construct new tasks and reuse existing ones to introduce the first official NLU benchmark and accompanying evaluation of encoder models for the language. Our tasks include common natural language processing tasks in binary and multi-class classification settings, including named entity recognition, topic classification, and intent classification. We evaluate various pre-trained language models for LTZ to present an overview of the current capabilities of these models on the LTZ language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper introduces ltzGLUE, the first NLU benchmark for Luxembourgish (LTZ), adapting the GLUE framework by constructing new tasks (NER, topic classification, intent classification) and reusing existing ones, then evaluating pre-trained encoder models to overview current LTZ capabilities.

Significance. If the tasks prove valid and the evaluations robust, this fills a clear resource gap for an official but low-resource European language, enabling standardized model comparison and future LTZ NLP work.

major comments (2)

[Task construction] Task construction section: no inter-annotator agreement scores, native-speaker validation steps, data provenance details, or dataset statistics (e.g., size, domain, label distribution) are reported for the newly constructed tasks. This directly undermines the claim that these tasks constitute valid, representative NLU measures rather than translation artifacts or surface patterns.
[Evaluation] Evaluation and results sections: the manuscript contains no results tables, performance metrics, baseline comparisons, or dataset statistics, making it impossible to verify the claimed overview of model capabilities or to assess whether the selected pre-trained models are appropriate.

minor comments (1)

[Abstract] The abstract states tasks are 'constructed' and 'reused' but does not clarify which specific tasks fall into binary vs. multi-class settings or how many examples each contains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing ltzGLUE. We agree that additional details are needed to strengthen the claims about task validity and model evaluation, and we will revise the paper accordingly to address both major comments.

read point-by-point responses

Referee: [Task construction] Task construction section: no inter-annotator agreement scores, native-speaker validation steps, data provenance details, or dataset statistics (e.g., size, domain, label distribution) are reported for the newly constructed tasks. This directly undermines the claim that these tasks constitute valid, representative NLU measures rather than translation artifacts or surface patterns.

Authors: We acknowledge that the current manuscript lacks these explicit details in the task construction section, which is a valid concern for establishing task quality. In the revised version, we will add inter-annotator agreement scores from multiple native Luxembourgish speakers, describe the native-speaker validation process used to ensure tasks reflect genuine NLU phenomena, provide full data provenance for all sources, and report dataset statistics including sizes, domains, and label distributions. These additions will clarify that the tasks were not mere translation artifacts. revision: yes
Referee: [Evaluation] Evaluation and results sections: the manuscript contains no results tables, performance metrics, baseline comparisons, or dataset statistics, making it impossible to verify the claimed overview of model capabilities or to assess whether the selected pre-trained models are appropriate.

Authors: We agree that the submitted version omitted the full results tables, metrics, and comparisons, which prevents verification. The revised manuscript will include complete results tables with performance metrics (e.g., accuracy, F1) for all evaluated pre-trained encoder models, explicit baseline comparisons, and the associated dataset statistics. This will enable readers to assess model appropriateness and the current state of LTZ NLU capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential fits

full rationale

The paper constructs and releases an NLU benchmark for Luxembourgish by adapting/creating tasks (NER, topic/intent classification) and evaluating pre-trained models. No equations, parameter fitting, or derivation chain exists. All claims rest on dataset provenance, model evaluations, and external verifiability rather than any reduction to self-defined inputs or self-citations. This matches the default non-circular outcome for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GLUE-style tasks remain meaningful when adapted to Luxembourgish and that the chosen pre-trained models constitute a fair current baseline. No free parameters or invented entities are introduced.

axioms (1)

domain assumption GLUE tasks and similar classification problems are appropriate proxies for natural language understanding in Luxembourgish
The paper reuses and constructs tasks based on the English GLUE benchmark without additional justification in the abstract.

pith-pipeline@v0.9.0 · 5452 in / 1249 out tokens · 41265 ms · 2026-05-10T04:29:27.852817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

InProceedings of the Second PAS- CAL Challenges Workshop on Recognising Textual Entailment

The second pascal recognising textual entail- ment challenge. InProceedings of the Second PAS- CAL Challenges Workshop on Recognising Textual Entailment. Momchil Hardalov, Todor Mihaylov, Kiril Simov, and Preslav Nakov. 2023. BgGLUE: A Bulgarian Gen- eral Language Understanding Evaluation Benchmark. InProceedings of RANLP. Hansi Hettiarachchi, Tharindu Ra...

work page 2023
[2]

InProceedings of ICML

XTREME: A Massively Multilingual Multi- task Benchmark for Evaluating Cross-lingual Gener- alization. InProceedings of ICML. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zha...

work page 2024
[3]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

The Winograd Schema Challenge. InProceed- ings of KR. Yaobo Liang, Yeyun Gong, Weizhen Bian, Nan Jiang, Guoqing Xie, Ruize Lin, Jiuhai Feng, Ruochen Xu, Wenjie Wang, Zhifang Chen, et al. 2020. XGLUE: A New Benchmark Dataset for Cross-lingual Pre- training, Understanding and Generation. InProceed- ings of EMNLP. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengba...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

InProceedings of LREC

LuxemBERT: Simple and Practical Data Aug- mentation in Language Model Pre-Training for Lux- embourgish. InProceedings of LREC. Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, and Barbara Plank. 2025. Neural Text Normalization for Luxembourgish Using Real-Life Variation Data. InProceedings of VarDial. Hyunji Hayley Park, Katherine J. Zhang, Coleman H...

work page 2025
[5]

InProceedings of SIGUL

Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish. InProceedings of SIGUL. Alistair Plum, Laura Bernardy, and Tharindu Ranas- inghe. 2026. Do LLMs Judge Distantly Super- vised Named Entity Labels Well? Constructing the JudgeWEL Dataset. InProceedings of LREC. Alistair Plum, Caroline D...

work page 2026
[6]

InProceedings of EMNLP

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation. InProceedings of EMNLP. Tatiana Shavrina, Denis Shevelev, Alena Fenogenova, Irina Nikishina, et al. 2020. RussianSuperGLUE: A Russian Language Understanding Evaluation Bench- mark. InProceedings of EMNLP. Joshgun Sirajzade, Daniela Gierschek, and Christoph Schommer. 2020. An Annotatio...

work page 2020
[7]

InProceedings of ICLR

Seq vs Seq: An Open Suite of Paired Encoders and Decoders. InProceedings of ICLR. Adina Williams, Nikita Nangia, and Samuel R. Bow- man. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of NAACL-HLT. Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. 2023. Sta...

work page 2018
[8]

O", "B-PER

XLNet: Generalized Autoregressive Pretrain- ing for Language Understanding. InProceedings of NeurIPS. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling Vision Transformers. In Proceedings of CVPR. Xiang Zhang, Senyu Li, Bradley Hauer, Ning Shi, and Grzegorz Kondrak. 2023. Don’t Trust ChatGPT when Your Question is not in Engl...

work page 2022
[9]

Respond ONLY in valid JSON

work page
[10]

Do NOT add explanations, comments or text outside of JSON

work page
[11]

Use field: "output": <model_answer>

work page
[12]

task": "<task_name>

Use field: "task": "<task_name>"

work page
[13]

input":

Use field: "input": "<input example text>"

work page
[14]

Predict only the requested outputs and 14 label(s) in the given formats

work page
[15]

If determined labels are 0 and 1 then 0 is used for False, 1 is used for True. Here is the NLP task definition: TASK: {task_name} DESCRIPTION: {task_description} 7.8 Task descriptions for zero-shot testing of LLMs headline_classification: Decide if the given title/headline fits the text. Output True or False. sentiment_analysis: Classify sentiment of the ...

work page

[1] [1]

InProceedings of the Second PAS- CAL Challenges Workshop on Recognising Textual Entailment

The second pascal recognising textual entail- ment challenge. InProceedings of the Second PAS- CAL Challenges Workshop on Recognising Textual Entailment. Momchil Hardalov, Todor Mihaylov, Kiril Simov, and Preslav Nakov. 2023. BgGLUE: A Bulgarian Gen- eral Language Understanding Evaluation Benchmark. InProceedings of RANLP. Hansi Hettiarachchi, Tharindu Ra...

work page 2023

[2] [2]

InProceedings of ICML

XTREME: A Massively Multilingual Multi- task Benchmark for Evaluating Cross-lingual Gener- alization. InProceedings of ICML. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zha...

work page 2024

[3] [3]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

The Winograd Schema Challenge. InProceed- ings of KR. Yaobo Liang, Yeyun Gong, Weizhen Bian, Nan Jiang, Guoqing Xie, Ruize Lin, Jiuhai Feng, Ruochen Xu, Wenjie Wang, Zhifang Chen, et al. 2020. XGLUE: A New Benchmark Dataset for Cross-lingual Pre- training, Understanding and Generation. InProceed- ings of EMNLP. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengba...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

InProceedings of LREC

LuxemBERT: Simple and Practical Data Aug- mentation in Language Model Pre-Training for Lux- embourgish. InProceedings of LREC. Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, and Barbara Plank. 2025. Neural Text Normalization for Luxembourgish Using Real-Life Variation Data. InProceedings of VarDial. Hyunji Hayley Park, Katherine J. Zhang, Coleman H...

work page 2025

[5] [5]

InProceedings of SIGUL

Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish. InProceedings of SIGUL. Alistair Plum, Laura Bernardy, and Tharindu Ranas- inghe. 2026. Do LLMs Judge Distantly Super- vised Named Entity Labels Well? Constructing the JudgeWEL Dataset. InProceedings of LREC. Alistair Plum, Caroline D...

work page 2026

[6] [6]

InProceedings of EMNLP

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation. InProceedings of EMNLP. Tatiana Shavrina, Denis Shevelev, Alena Fenogenova, Irina Nikishina, et al. 2020. RussianSuperGLUE: A Russian Language Understanding Evaluation Bench- mark. InProceedings of EMNLP. Joshgun Sirajzade, Daniela Gierschek, and Christoph Schommer. 2020. An Annotatio...

work page 2020

[7] [7]

InProceedings of ICLR

Seq vs Seq: An Open Suite of Paired Encoders and Decoders. InProceedings of ICLR. Adina Williams, Nikita Nangia, and Samuel R. Bow- man. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of NAACL-HLT. Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. 2023. Sta...

work page 2018

[8] [8]

O", "B-PER

XLNet: Generalized Autoregressive Pretrain- ing for Language Understanding. InProceedings of NeurIPS. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling Vision Transformers. In Proceedings of CVPR. Xiang Zhang, Senyu Li, Bradley Hauer, Ning Shi, and Grzegorz Kondrak. 2023. Don’t Trust ChatGPT when Your Question is not in Engl...

work page 2022

[9] [9]

Respond ONLY in valid JSON

work page

[10] [10]

Do NOT add explanations, comments or text outside of JSON

work page

[11] [11]

Use field: "output": <model_answer>

work page

[12] [12]

task": "<task_name>

Use field: "task": "<task_name>"

work page

[13] [13]

input":

Use field: "input": "<input example text>"

work page

[14] [14]

Predict only the requested outputs and 14 label(s) in the given formats

work page

[15] [15]

If determined labels are 0 and 1 then 0 is used for False, 1 is used for True. Here is the NLP task definition: TASK: {task_name} DESCRIPTION: {task_description} 7.8 Task descriptions for zero-shot testing of LLMs headline_classification: Decide if the given title/headline fits the text. Output True or False. sentiment_analysis: Classify sentiment of the ...

work page