LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Anna Murphy H{\o}genhaug; Asta S. Stage Jarlner; Desmond Elliott; Galadrielle Humblot-Renaux; Maria Vlachou; Marieke Anne Heyl; Mohammad N. S. Jahromi; Rohat Bakuri-J{\o}rgensen; Thomas B. Moeslund; Thomas Gammeltoft-Hansen

arxiv: 2605.13412 · v1 · pith:BTFLURS4new · submitted 2026-05-13 · 💻 cs.CL · cs.AI

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Galadrielle Humblot-Renaux , Mohammad N. S. Jahromi , Rohat Bakuri-J{\o}rgensen , Marieke Anne Heyl , Asta S. Stage Jarlner , Maria Vlachou , Anna Murphy H{\o}genhaug , Desmond Elliott

show 2 more authors

Thomas Gammeltoft-Hansen Thomas B. Moeslund

This is my paper

Pith reviewed 2026-05-14 19:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLMs as annotatorscredibility assessmentasylum decisionsDanish text classificationlegal NLPerror analysiszero-shot promptingdataset creation

0 comments

The pith

Large language models can annotate credibility assessments in Danish asylum decisions at moderate accuracy but show inconsistent errors that vary by model and prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether off-the-shelf LLMs can serve as reliable annotators for a specialized legal task: detecting the presence and sentiment of credibility assessments in Danish asylum decision texts. It introduces the RAB-Cred dataset, which supplies expert-annotated examples along with metadata on annotator confidence and case outcomes. Through benchmarks of 21 open-weight models and 30 prompt variations in zero-shot and few-shot settings, the authors move past overall accuracy scores to examine specific error types, inter-model disagreement, and alignment with human difficulty signals. The work establishes that LLMs offer a practical route to cheaper labeling of large legal corpora in underrepresented languages, yet their unreliability demands scrutiny beyond any single arbitrary model choice.

Core claim

The central claim is that LLMs demonstrate viable performance for cost-effective labeling of credibility assessments in asylum texts, yet their outputs remain imperfect and inconsistent, with error patterns that differ across models and prompts, requiring evaluation methods that extend past aggregated metrics to include error consistency, class confusion, and correlation with human confidence.

What carries the argument

The RAB-Cred dataset of expert-annotated Danish asylum decisions, paired with systematic error analysis that tracks inter-model consistency, inter-class confusion, and sample-level difficulty beyond aggregate accuracy.

If this is right

LLMs could scale labeling of large volumes of asylum decisions for research or policy oversight at lower cost than full expert review.
Reliable use would require combining predictions from multiple models rather than relying on any single one.
Error analysis beyond accuracy metrics becomes essential for legal NLP tasks that involve nuanced judgments.
Prompt and model selection must be validated specifically for low-resource languages and specialized legal domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Human oversight would likely remain necessary for any high-stakes application involving asylum outcomes.
The same benchmarking approach could be applied to other legal or administrative texts in underrepresented languages once comparable expert datasets exist.
Inconsistency across LLMs may indicate limits in their grasp of implicit legal reasoning, pointing toward targeted fine-tuning as a potential improvement.

Load-bearing premise

The expert annotations in the RAB-Cred dataset constitute reliable ground truth for the subtle legal concept of credibility assessment, and the task can be adequately captured by the chosen classification labels without deeper domain-specific legal context.

What would settle it

A new round of independent expert annotations on the same RAB-Cred texts that produces substantially different credibility labels than the original set, or a demonstration that LLM errors align predictably with actual case outcomes rather than varying randomly across models.

Figures

Figures reproduced from arXiv: 2605.13412 by Anna Murphy H{\o}genhaug, Asta S. Stage Jarlner, Desmond Elliott, Galadrielle Humblot-Renaux, Maria Vlachou, Marieke Anne Heyl, Mohammad N. S. Jahromi, Rohat Bakuri-J{\o}rgensen, Thomas B. Moeslund, Thomas Gammeltoft-Hansen.

**Figure 3.** Figure 3: Confusion matrix showing inter-annotator [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Inter-class hesitation of the human annotators, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 2.** Figure 2: Label (top) and confidence (bottom) distribu [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Relation between case outcome and gold cred [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Validation set classification performance per model for different user prompts. Each boxplot is across 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Best-case classification performance (taking [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Validation set classification performance [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Individual LLM and ensemble mistakes, color-coded by class confusion (cf. Appendix [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: LLM agreement vs. LLM correctness vs. human confidence. Each point corresponds to a single case in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Danish dataset for LLM annotation of asylum credibility plus detailed error breakdowns, but label reliability needs more scrutiny.

read the letter

The paper's main contribution is the RAB-Cred dataset of expert-annotated Danish asylum decisions, paired with benchmarks across 21 models and 30 prompt setups that go beyond aggregate accuracy to examine error patterns. They release the data and code, which is straightforward to use for follow-up work on low-resource legal text tasks. The error analysis covers consistency across models, class confusion, links to human confidence scores, and sample difficulty, which gives a clearer picture than most LLM annotation studies that stop at F1 scores. This setup shows LLMs can handle the task at moderate levels in zero- and few-shot settings but vary enough that no single model or prompt is reliably best. The work is grounded in direct comparison to expert labels rather than circular claims. The soft spot is the treatment of those expert annotations as stable ground truth for a subtle legal judgment. The abstract describes them as high-quality with confidence metadata, yet the provided details do not include inter-annotator agreement figures or checks against case outcomes, so any reported LLM errors could partly reflect label noise instead of model limits. That issue is real but not fatal given the public artifacts. Readers working on legal NLP, asylum processing tools, or LLM evaluation in specialized domains will find the dataset and error breakdowns useful. It deserves peer review because the new data and systematic testing add concrete evidence even if annotation validation could be strengthened.

Referee Report

1 major / 3 minor

Summary. The paper introduces the RAB-Cred dataset of Danish asylum decision texts annotated by experts for the presence and sentiment of credibility assessments, along with metadata on annotator confidence and case outcomes. It benchmarks 21 open-weight LLMs across 30 zero-shot and few-shot prompt combinations, then performs detailed error analysis covering consistency across models, inter-class confusion, correlations with human confidence, sample difficulty, and mistake severity. The central claim is that LLMs show promise for cost-effective annotation in this specialized legal domain but remain imperfect and inconsistent, so practitioners should avoid relying on any single arbitrarily chosen model; the dataset and code are released publicly.

Significance. If the empirical results hold, the work is significant for legal NLP and low-resource language applications: it supplies a reproducible benchmark and error taxonomy for a nuanced classification task, demonstrates concrete limitations of current LLMs as annotators, and supplies public resources that enable follow-on research on reliable automated labeling in high-stakes settings.

major comments (1)

[§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.

minor comments (3)

[§5.2] §5.2 (Error Analysis): The discussion of LLM mistake severity would be strengthened by an explicit comparison to the distribution of human annotator disagreements on the same samples.
[Table 2] Table 2 (Model and Prompt Results): Ensure every model size and prompt template is listed with the exact hyper-parameters used; a few entries appear to omit the precise few-shot example selection method.
[Abstract] Abstract: The phrase 'high-quality, expert annotations' is used without a quantitative qualifier; a single sentence summarizing IAA or confidence statistics would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major comment point-by-point below and have made revisions to incorporate additional details on the dataset creation process.

read point-by-point responses

Referee: [§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.

Authors: We agree with the referee that providing inter-annotator agreement (IAA) statistics and more details on the annotation process would enhance the transparency and reliability of the ground-truth labels. In the revised manuscript, we will add a new subsection in §3 that includes: (1) the annotation guidelines used by the experts, (2) the adjudication process for resolving any disagreements, and (3) IAA statistics (e.g., Cohen's kappa or Fleiss' kappa) calculated on the subset of texts annotated by multiple experts. Additionally, we will include a sensitivity analysis and correlation between the credibility labels, annotator confidence, and the available case outcome metadata to validate the labels against external indicators. These changes will be made to address the concern about the unverified reliability of the ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking against external expert labels

full rationale

The paper introduces the RAB-Cred dataset with expert annotations and directly measures LLM classification performance (zero-shot/few-shot accuracy, error consistency, inter-class confusion) against those fixed labels. No derivations, equations, fitted parameters, or predictions are claimed; all results are computed from direct comparison to the provided ground truth. No self-citations are load-bearing, no ansatzes are smuggled, and no uniqueness theorems are invoked. The evaluation chain is self-contained and externally falsifiable via the released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that expert labels are gold-standard ground truth and that prompt-based classification adequately captures the legal concept; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert annotations provide reliable ground truth for credibility assessment labels
The paper uses these annotations to evaluate all LLM performance and error patterns.

pith-pipeline@v0.9.0 · 5583 in / 1190 out tokens · 118416 ms · 2026-05-14T19:42:52.760692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Large Language Models in Legal Systems: A Survey , volume =

Dehghani, Fatemeh and Dehghani, Roya and Ardebili, Yazdan and Rahnamayan, Shahryar , year =. Large Language Models in Legal Systems: A Survey , volume =. Humanities and Social Sciences Communications , doi =

work page
[3]

2025 , MONTH = Nov, KEYWORDS =

Berghegger, Christina and Philippe, C. 2025 , MONTH = Nov, KEYWORDS =

work page 2025
[4]

arXiv preprint arXiv:2410.07504 , year=

Using llms to discover legal factors , author=. arXiv preprint arXiv:2410.07504 , year=

work page arXiv
[5]

Computer Law & Security Review , volume=

LLMs for legal reasoning: A unified framework and future perspectives , author=. Computer Law & Security Review , volume=. 2025 , publisher=

work page 2025
[6]

Gray, Morgan and Savelka, Jaromir and Oliver, Wesley and Ashley, Kevin , booktitle=. Can. 2023 , publisher=

work page 2023
[7]

Frontiers in Artificial Intelligence , volume=

The Unreasonable Effectiveness of Large Language Models in Zero-shot Semantic Annotation of Legal Texts , author=. Frontiers in Artificial Intelligence , volume=. 2023 , doi=

work page 2023
[8]

Unlocking Practical Applications in Legal Domain: Evaluation of

Savelka, Jaromir and Ashley, Kevin , booktitle=. Unlocking Practical Applications in Legal Domain: Evaluation of. 2023 , publisher=

work page 2023
[9]

Discovering the Potential of

Berghegger, Christina and Philippe, C. Discovering the Potential of. International Workshop on Argumentation and Applications (Arg&App 2025) , year=

work page 2025
[10]

Artificial Intelligence and Law , year=

Classifying Legal Interpretations Using Large Language Models , author=. Artificial Intelligence and Law , year=. doi:10.1007/s10506-025-09447-9 , url=

work page doi:10.1007/s10506-025-09447-9
[11]

arXiv preprint arXiv:2407.02039 , year=

Prompt Stability Scoring for Text Annotation with Large Language Models , author=. arXiv preprint arXiv:2407.02039 , year=

work page internal anchor Pith review arXiv
[12]

What Did I Do Wrong? Quantifying

Errica, Federico and others , booktitle=. What Did I Do Wrong? Quantifying. 2025 , publisher=

work page 2025
[13]

2024 , eprint=

Best Practices for Text Annotation with Large Language Models , author=. 2024 , eprint=

work page 2024
[14]

The Use of

Carlson, Kevin and others , journal=. The Use of. 2025 , doi=

work page 2025
[15]

2024 , url=

Political Bias in Large Language Models , author=. 2024 , url=

work page 2024
[16]

2025 , doi=

Ennser-Jedenastik, Laurenz and others , journal=. 2025 , doi=

work page 2025
[17]

Computational Linguistics , year=

Bias and Fairness in Large Language Models: A Survey , author=. Computational Linguistics , year=

work page
[18]

D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T

Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

work page doi:10.18653/v1/2023.findings-emnlp.878 2023
[19]

Journal of Multilingual and Multicultural Development , volume =

Charlotte Gooskens , title =. Journal of Multilingual and Multicultural Development , volume =. 2007 , publisher =. doi:10.2167/jmmd511.0 , URL =

work page doi:10.2167/jmmd511.0 2007
[20]

The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

Pavlovic, Maja and Poesio, Massimo. The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024

work page 2024
[21]

A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements

Almanea, Dina and Poesio, Massimo. A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

work page 2022
[22]

Semeval-2023 task 11: Learning with disagreements (lewidi),

Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...

work page doi:10.18653/v1/2023.semeval-1.314 2023
[23]

Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content

Bhat, Savita and Varma, Vasudeva. Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. 2023. doi:10.18653/v1/2023.eval4nlp-1.8

work page doi:10.18653/v1/2023.eval4nlp-1.8 2023
[24]

A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation

R nningstad, Egil and Velldal, Erik and vrelid, Lilja. A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024

work page 2024
[25]

LLM s as annotators of argumentation

Lindahl, Anna. LLM s as annotators of argumentation. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). 2025. doi:10.18653/v1/2025.starsem-1.19

work page doi:10.18653/v1/2025.starsem-1.19 2025
[26]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025
[27]

Can Large Language Models Transform Computational Social Science?

Ziems, Caleb and Held, William and Shaikh, Omar and Chen, Jiaao and Zhang, Zhehao and Yang, Diyi. Can Large Language Models Transform Computational Social Science?. Computational Linguistics. 2024. doi:10.1162/coli_a_00502

work page doi:10.1162/coli_a_00502 2024
[28]

LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

work page doi:10.18653/v1/2025.acl-short.20 2025
[29]

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025
[30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page
[31]

Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

What makes good in-context examples for GPT-3? , author=. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

work page 2022
[32]

Active learning principles for in-context learning with large language models

Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334

work page doi:10.18653/v1/2023.findings-emnlp.334 2023
[33]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.759

work page doi:10.18653/v1/2022.emnlp-main.759 2022
[34]

2025 , howpublished=

Poro 2: Continued Pretraining for Language Acquisition , author=. 2025 , howpublished=

work page 2025
[35]

2024 , eprint=

Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier , author=. 2024 , eprint=

work page 2024
[36]

2026 , eprint=

EuroLLM-22B: Technical Report , author=. 2026 , eprint=

work page 2026
[37]

2025 , url =

Bielik-11B-v3.0-Instruct model card , author =. 2025 , url =

work page 2025
[38]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[39]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

work page 2024
[40]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[41]

microsoft/phi-4 Hugging Face model card , author =

work page
[42]

2023 , eprint=

Efficient Guided Generation for Large Language Models , author=. 2023 , eprint=

work page 2023
[43]

Contemporary LLM s struggle with extracting formal legal arguments

Held, Lena and Habernal, Ivan. Contemporary LLM s struggle with extracting formal legal arguments. Proceedings of the Natural Legal Language Processing Workshop 2025. 2025. doi:10.18653/v1/2025.nllp-1.20

work page doi:10.18653/v1/2025.nllp-1.20 2025
[44]

2026 , isbn =

Blair-Stanek, Andrew and Van Durme, Benjamin , title =. 2026 , isbn =. doi:10.1145/3769126.3769245 , booktitle =

work page doi:10.1145/3769126.3769245 2026
[45]

Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

Saattrup Nielsen, Dan and Enevoldsen, Kenneth and Schneider-Kamp, Peter. Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). 2025

work page 2025
[46]

2024 , eprint=

SnakModel: Lessons Learned from Training an Open Danish Large Language Model , author=. 2024 , eprint=

work page 2024
[47]

2023 , eprint=

Danish Foundation Models , author=. 2023 , eprint=

work page 2023
[48]

GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages

Ekgren, Ariel and Cuba Gyllensten, Amaru and Stollenwerk, Felix and. GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

work page 2024
[49]

Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025
[50]

MEGA : Multilingual evaluation of generative AI

Ahuja, Kabir and Diddee, Harshita and Hada, Rishav and Ochieng, Millicent and Ramesh, Krithika and Jain, Prachi and Nambi, Akshay and Ganu, Tanuja and Segal, Sameer and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGA : Multilingual Evaluation of Generative AI. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...

work page doi:10.18653/v1/2023.emnlp-main.258 2023
[51]

Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

DaNLP: An open-source toolkit for Danish Natural Language Processing , author=. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

work page
[52]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025
[53]

2024 , eprint=

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , author=. 2024 , eprint=

work page 2024
[54]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022
[55]

Metacognitive Prompting Improves Understanding in Large Language Models

Wang, Yuqing and Zhao, Yun. Metacognitive Prompting Improves Understanding in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.106

work page doi:10.18653/v1/2024.naacl-long.106 2024
[56]

Statistics of Common Crawl Monthly Archives - Distribution of Languages , author =

work page
[57]

The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc ' Aurelio and Guzm \'a n, Francisco and Fan, Angela. The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics. 2022...

work page doi:10.1162/tacl_a_00474 2022
[58]

The Thirteenth International Conference on Learning Representations , year=

Lawma: The Power of Specialization for Legal Annotation , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[59]

M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Chalkidis, Ilias and Fergadiotis, Manos and Androutsopoulos, Ion. M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.559

work page doi:10.18653/v1/2021.emnlp-main.559 2021
[60]

WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

Michala Clante Bendixen. WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

work page
[61]

Journal of Ethnic and Migration Studies , pages=

Credibility as a fuzzy concept in refugee law: a systematic literature review , author=. Journal of Ethnic and Migration Studies , pages=. 2026 , publisher=. doi:10.1080/1369183X.2026.2619660 , url=

work page doi:10.1080/1369183x.2026.2619660 2026
[62]

P ro SA : Assessing and understanding the prompt sensitivity of LLM s

Zhuo, Jingming and Zhang, Songyang and Fang, Xinyu and Duan, Haodong and Lin, Dahua and Chen, Kai. P ro SA : Assessing and Understanding the Prompt Sensitivity of LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.108

work page doi:10.18653/v1/2024.findings-emnlp.108 2024
[63]

POSIX: A Prompt Sensitivity Index For Large Language Models

Chatterjee, Anwoy and Renduchintala, H S V N S Kowndinya and Bhatia, Sumit and Chakraborty, Tanmoy. POSIX : A Prompt Sensitivity Index For Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.852

work page doi:10.18653/v1/2024.findings-emnlp.852 2024
[64]

Political Analysis , author=

Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts , DOI=. Political Analysis , author=. 2025 , pages=

work page 2025
[65]

Corpus Pragmatics , volume=

Evaluating an LLM’s Performance in Annotating Discourse Strategies , author=. Corpus Pragmatics , volume=. 2026 , publisher=

work page 2026
[66]

Socius , volume =

Start generating: Harnessing generative artificial intelligence for sociological research , author=. Socius , volume =. 2024 , doi =

work page 2024
[67]

Measuring

Jacomy, Mathieu and Borra, Erik , journal=. Measuring

work page
[68]

AI & SOCIETY , pages=

Leveraging large language models for thematic analysis: a case study in the charity sector , author=. AI & SOCIETY , pages=. 2025 , publisher=

work page 2025
[69]

Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =

Rask Nielsen, Trine and Holten M. Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =. Proc. ACM Hum.-Comput. Interact. , month = jan, articleno =. doi:10.1145/3492825 , abstract =

work page doi:10.1145/3492825
[70]

Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets

Ruckdeschel, Mattes. Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[71]

2023 , eprint=

Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies , author=. 2023 , eprint=

work page 2023
[72]

2025 , eprint=

Bielik 11B v3: Multilingual Large Language Model for European Languages , author=. 2025 , eprint=

work page 2025
[73]

ACM Comput

Ariai, Farid and Mackenzie, Joel and Demartini, Gianluca , title =. ACM Comput. Surv. , month = dec, articleno =. 2025 , issue_date =. doi:10.1145/3777009 , abstract =

work page doi:10.1145/3777009 2025
[74]

A sy L ex: A Dataset for Legal Language Processing of Refugee Claims

Barale, Claire and Klaisoongnoen, Mark and Minervini, Pasquale and Rovatsos, Michael and Bhuta, Nehal. A sy L ex: A Dataset for Legal Language Processing of Refugee Claims. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.24

work page doi:10.18653/v1/2023.nllp-1.24 2023
[75]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

work page doi:10.1162/tacl_a_00681 2024
[76]

Frontiers in Human Dynamics , VOLUME=

Hertz, Maya Ellen and Jarlner, Asta Sofie Stage , TITLE=. Frontiers in Human Dynamics , VOLUME=. 2025 , URL=. doi:10.3389/fhumd.2025.1625988 , ISSN=

work page doi:10.3389/fhumd.2025.1625988 2025
[77]

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

work page doi:10.18653/v1/2025.emnlp-main.1006 2025
[78]

Legal and Protection Policy Research Series , year=

Nordic Asylum Practice in Relation to Religious Conversion: Insights from Denmark, Norway and Sweden , author=. Legal and Protection Policy Research Series , year=

work page
[79]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[80]

Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?

Majer, Laura and S najder, Jan. Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER). 2024. doi:10.18653/v1/2024.fever-1.27

work page doi:10.18653/v1/2024.fever-1.27 2024

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Large Language Models in Legal Systems: A Survey , volume =

Dehghani, Fatemeh and Dehghani, Roya and Ardebili, Yazdan and Rahnamayan, Shahryar , year =. Large Language Models in Legal Systems: A Survey , volume =. Humanities and Social Sciences Communications , doi =

work page

[3] [3]

2025 , MONTH = Nov, KEYWORDS =

Berghegger, Christina and Philippe, C. 2025 , MONTH = Nov, KEYWORDS =

work page 2025

[4] [4]

arXiv preprint arXiv:2410.07504 , year=

Using llms to discover legal factors , author=. arXiv preprint arXiv:2410.07504 , year=

work page arXiv

[5] [5]

Computer Law & Security Review , volume=

LLMs for legal reasoning: A unified framework and future perspectives , author=. Computer Law & Security Review , volume=. 2025 , publisher=

work page 2025

[6] [6]

Gray, Morgan and Savelka, Jaromir and Oliver, Wesley and Ashley, Kevin , booktitle=. Can. 2023 , publisher=

work page 2023

[7] [7]

Frontiers in Artificial Intelligence , volume=

The Unreasonable Effectiveness of Large Language Models in Zero-shot Semantic Annotation of Legal Texts , author=. Frontiers in Artificial Intelligence , volume=. 2023 , doi=

work page 2023

[8] [8]

Unlocking Practical Applications in Legal Domain: Evaluation of

Savelka, Jaromir and Ashley, Kevin , booktitle=. Unlocking Practical Applications in Legal Domain: Evaluation of. 2023 , publisher=

work page 2023

[9] [9]

Discovering the Potential of

Berghegger, Christina and Philippe, C. Discovering the Potential of. International Workshop on Argumentation and Applications (Arg&App 2025) , year=

work page 2025

[10] [10]

Artificial Intelligence and Law , year=

Classifying Legal Interpretations Using Large Language Models , author=. Artificial Intelligence and Law , year=. doi:10.1007/s10506-025-09447-9 , url=

work page doi:10.1007/s10506-025-09447-9

[11] [11]

arXiv preprint arXiv:2407.02039 , year=

Prompt Stability Scoring for Text Annotation with Large Language Models , author=. arXiv preprint arXiv:2407.02039 , year=

work page internal anchor Pith review arXiv

[12] [12]

What Did I Do Wrong? Quantifying

Errica, Federico and others , booktitle=. What Did I Do Wrong? Quantifying. 2025 , publisher=

work page 2025

[13] [13]

2024 , eprint=

Best Practices for Text Annotation with Large Language Models , author=. 2024 , eprint=

work page 2024

[14] [14]

The Use of

Carlson, Kevin and others , journal=. The Use of. 2025 , doi=

work page 2025

[15] [15]

2024 , url=

Political Bias in Large Language Models , author=. 2024 , url=

work page 2024

[16] [16]

2025 , doi=

Ennser-Jedenastik, Laurenz and others , journal=. 2025 , doi=

work page 2025

[17] [17]

Computational Linguistics , year=

Bias and Fairness in Large Language Models: A Survey , author=. Computational Linguistics , year=

work page

[18] [18]

D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T

Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

work page doi:10.18653/v1/2023.findings-emnlp.878 2023

[19] [19]

Journal of Multilingual and Multicultural Development , volume =

Charlotte Gooskens , title =. Journal of Multilingual and Multicultural Development , volume =. 2007 , publisher =. doi:10.2167/jmmd511.0 , URL =

work page doi:10.2167/jmmd511.0 2007

[20] [20]

The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

Pavlovic, Maja and Poesio, Massimo. The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024

work page 2024

[21] [21]

A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements

Almanea, Dina and Poesio, Massimo. A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

work page 2022

[22] [22]

Semeval-2023 task 11: Learning with disagreements (lewidi),

Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...

work page doi:10.18653/v1/2023.semeval-1.314 2023

[23] [23]

Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content

Bhat, Savita and Varma, Vasudeva. Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. 2023. doi:10.18653/v1/2023.eval4nlp-1.8

work page doi:10.18653/v1/2023.eval4nlp-1.8 2023

[24] [24]

A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation

R nningstad, Egil and Velldal, Erik and vrelid, Lilja. A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024

work page 2024

[25] [25]

LLM s as annotators of argumentation

Lindahl, Anna. LLM s as annotators of argumentation. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). 2025. doi:10.18653/v1/2025.starsem-1.19

work page doi:10.18653/v1/2025.starsem-1.19 2025

[26] [26]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025

[27] [27]

Can Large Language Models Transform Computational Social Science?

Ziems, Caleb and Held, William and Shaikh, Omar and Chen, Jiaao and Zhang, Zhehao and Yang, Diyi. Can Large Language Models Transform Computational Social Science?. Computational Linguistics. 2024. doi:10.1162/coli_a_00502

work page doi:10.1162/coli_a_00502 2024

[28] [28]

LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

work page doi:10.18653/v1/2025.acl-short.20 2025

[29] [29]

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

work page 2025

[30] [30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page

[31] [31]

Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

What makes good in-context examples for GPT-3? , author=. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

work page 2022

[32] [32]

Active learning principles for in-context learning with large language models

Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334

work page doi:10.18653/v1/2023.findings-emnlp.334 2023

[33] [33]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.759

work page doi:10.18653/v1/2022.emnlp-main.759 2022

[34] [34]

2025 , howpublished=

Poro 2: Continued Pretraining for Language Acquisition , author=. 2025 , howpublished=

work page 2025

[35] [35]

2024 , eprint=

Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier , author=. 2024 , eprint=

work page 2024

[36] [36]

2026 , eprint=

EuroLLM-22B: Technical Report , author=. 2026 , eprint=

work page 2026

[37] [37]

2025 , url =

Bielik-11B-v3.0-Instruct model card , author =. 2025 , url =

work page 2025

[38] [38]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[39] [39]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

work page 2024

[40] [40]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page

[41] [41]

microsoft/phi-4 Hugging Face model card , author =

work page

[42] [42]

2023 , eprint=

Efficient Guided Generation for Large Language Models , author=. 2023 , eprint=

work page 2023

[43] [43]

Contemporary LLM s struggle with extracting formal legal arguments

Held, Lena and Habernal, Ivan. Contemporary LLM s struggle with extracting formal legal arguments. Proceedings of the Natural Legal Language Processing Workshop 2025. 2025. doi:10.18653/v1/2025.nllp-1.20

work page doi:10.18653/v1/2025.nllp-1.20 2025

[44] [44]

2026 , isbn =

Blair-Stanek, Andrew and Van Durme, Benjamin , title =. 2026 , isbn =. doi:10.1145/3769126.3769245 , booktitle =

work page doi:10.1145/3769126.3769245 2026

[45] [45]

Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

Saattrup Nielsen, Dan and Enevoldsen, Kenneth and Schneider-Kamp, Peter. Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). 2025

work page 2025

[46] [46]

2024 , eprint=

SnakModel: Lessons Learned from Training an Open Danish Large Language Model , author=. 2024 , eprint=

work page 2024

[47] [47]

2023 , eprint=

Danish Foundation Models , author=. 2023 , eprint=

work page 2023

[48] [48]

GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages

Ekgren, Ariel and Cuba Gyllensten, Amaru and Stollenwerk, Felix and. GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

work page 2024

[49] [49]

Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

work page doi:10.18653/v1/2025.acl-long.919 2025

[50] [50]

MEGA : Multilingual evaluation of generative AI

Ahuja, Kabir and Diddee, Harshita and Hada, Rishav and Ochieng, Millicent and Ramesh, Krithika and Jain, Prachi and Nambi, Akshay and Ganu, Tanuja and Segal, Sameer and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGA : Multilingual Evaluation of Generative AI. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...

work page doi:10.18653/v1/2023.emnlp-main.258 2023

[51] [51]

Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

DaNLP: An open-source toolkit for Danish Natural Language Processing , author=. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

work page

[52] [52]

MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

work page doi:10.18653/v1/2025.emnlp-main.79 2025

[53] [53]

2024 , eprint=

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , author=. 2024 , eprint=

work page 2024

[54] [54]

Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

work page 2022

[55] [55]

Metacognitive Prompting Improves Understanding in Large Language Models

Wang, Yuqing and Zhao, Yun. Metacognitive Prompting Improves Understanding in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.106

work page doi:10.18653/v1/2024.naacl-long.106 2024

[56] [56]

Statistics of Common Crawl Monthly Archives - Distribution of Languages , author =

work page

[57] [57]

The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc ' Aurelio and Guzm \'a n, Francisco and Fan, Angela. The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics. 2022...

work page doi:10.1162/tacl_a_00474 2022

[58] [58]

The Thirteenth International Conference on Learning Representations , year=

Lawma: The Power of Specialization for Legal Annotation , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[59] [59]

M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Chalkidis, Ilias and Fergadiotis, Manos and Androutsopoulos, Ion. M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.559

work page doi:10.18653/v1/2021.emnlp-main.559 2021

[60] [60]

WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

Michala Clante Bendixen. WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

work page

[61] [61]

Journal of Ethnic and Migration Studies , pages=

Credibility as a fuzzy concept in refugee law: a systematic literature review , author=. Journal of Ethnic and Migration Studies , pages=. 2026 , publisher=. doi:10.1080/1369183X.2026.2619660 , url=

work page doi:10.1080/1369183x.2026.2619660 2026

[62] [62]

P ro SA : Assessing and understanding the prompt sensitivity of LLM s

Zhuo, Jingming and Zhang, Songyang and Fang, Xinyu and Duan, Haodong and Lin, Dahua and Chen, Kai. P ro SA : Assessing and Understanding the Prompt Sensitivity of LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.108

work page doi:10.18653/v1/2024.findings-emnlp.108 2024

[63] [63]

POSIX: A Prompt Sensitivity Index For Large Language Models

Chatterjee, Anwoy and Renduchintala, H S V N S Kowndinya and Bhatia, Sumit and Chakraborty, Tanmoy. POSIX : A Prompt Sensitivity Index For Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.852

work page doi:10.18653/v1/2024.findings-emnlp.852 2024

[64] [64]

Political Analysis , author=

Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts , DOI=. Political Analysis , author=. 2025 , pages=

work page 2025

[65] [65]

Corpus Pragmatics , volume=

Evaluating an LLM’s Performance in Annotating Discourse Strategies , author=. Corpus Pragmatics , volume=. 2026 , publisher=

work page 2026

[66] [66]

Socius , volume =

Start generating: Harnessing generative artificial intelligence for sociological research , author=. Socius , volume =. 2024 , doi =

work page 2024

[67] [67]

Measuring

Jacomy, Mathieu and Borra, Erik , journal=. Measuring

work page

[68] [68]

AI & SOCIETY , pages=

Leveraging large language models for thematic analysis: a case study in the charity sector , author=. AI & SOCIETY , pages=. 2025 , publisher=

work page 2025

[69] [69]

Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =

Rask Nielsen, Trine and Holten M. Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =. Proc. ACM Hum.-Comput. Interact. , month = jan, articleno =. doi:10.1145/3492825 , abstract =

work page doi:10.1145/3492825

[70] [70]

Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets

Ruckdeschel, Mattes. Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025

[71] [71]

2023 , eprint=

Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies , author=. 2023 , eprint=

work page 2023

[72] [72]

2025 , eprint=

Bielik 11B v3: Multilingual Large Language Model for European Languages , author=. 2025 , eprint=

work page 2025

[73] [73]

ACM Comput

Ariai, Farid and Mackenzie, Joel and Demartini, Gianluca , title =. ACM Comput. Surv. , month = dec, articleno =. 2025 , issue_date =. doi:10.1145/3777009 , abstract =

work page doi:10.1145/3777009 2025

[74] [74]

A sy L ex: A Dataset for Legal Language Processing of Refugee Claims

Barale, Claire and Klaisoongnoen, Mark and Minervini, Pasquale and Rovatsos, Michael and Bhuta, Nehal. A sy L ex: A Dataset for Legal Language Processing of Refugee Claims. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.24

work page doi:10.18653/v1/2023.nllp-1.24 2023

[75] [75]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

work page doi:10.1162/tacl_a_00681 2024

[76] [76]

Frontiers in Human Dynamics , VOLUME=

Hertz, Maya Ellen and Jarlner, Asta Sofie Stage , TITLE=. Frontiers in Human Dynamics , VOLUME=. 2025 , URL=. doi:10.3389/fhumd.2025.1625988 , ISSN=

work page doi:10.3389/fhumd.2025.1625988 2025

[77] [77]

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

work page doi:10.18653/v1/2025.emnlp-main.1006 2025

[78] [78]

Legal and Protection Policy Research Series , year=

Nordic Asylum Practice in Relation to Religious Conversion: Insights from Denmark, Norway and Sweden , author=. Legal and Protection Policy Research Series , year=

work page

[79] [79]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[80] [80]

Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?

Majer, Laura and S najder, Jan. Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER). 2024. doi:10.18653/v1/2024.fever-1.27

work page doi:10.18653/v1/2024.fever-1.27 2024