pith. sign in

arxiv: 2605.13412 · v1 · pith:BTFLURS4new · submitted 2026-05-13 · 💻 cs.CL · cs.AI

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

Pith reviewed 2026-05-14 19:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLMs as annotatorscredibility assessmentasylum decisionsDanish text classificationlegal NLPerror analysiszero-shot promptingdataset creation
0
0 comments X

The pith

Large language models can annotate credibility assessments in Danish asylum decisions at moderate accuracy but show inconsistent errors that vary by model and prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether off-the-shelf LLMs can serve as reliable annotators for a specialized legal task: detecting the presence and sentiment of credibility assessments in Danish asylum decision texts. It introduces the RAB-Cred dataset, which supplies expert-annotated examples along with metadata on annotator confidence and case outcomes. Through benchmarks of 21 open-weight models and 30 prompt variations in zero-shot and few-shot settings, the authors move past overall accuracy scores to examine specific error types, inter-model disagreement, and alignment with human difficulty signals. The work establishes that LLMs offer a practical route to cheaper labeling of large legal corpora in underrepresented languages, yet their unreliability demands scrutiny beyond any single arbitrary model choice.

Core claim

The central claim is that LLMs demonstrate viable performance for cost-effective labeling of credibility assessments in asylum texts, yet their outputs remain imperfect and inconsistent, with error patterns that differ across models and prompts, requiring evaluation methods that extend past aggregated metrics to include error consistency, class confusion, and correlation with human confidence.

What carries the argument

The RAB-Cred dataset of expert-annotated Danish asylum decisions, paired with systematic error analysis that tracks inter-model consistency, inter-class confusion, and sample-level difficulty beyond aggregate accuracy.

If this is right

  • LLMs could scale labeling of large volumes of asylum decisions for research or policy oversight at lower cost than full expert review.
  • Reliable use would require combining predictions from multiple models rather than relying on any single one.
  • Error analysis beyond accuracy metrics becomes essential for legal NLP tasks that involve nuanced judgments.
  • Prompt and model selection must be validated specifically for low-resource languages and specialized legal domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Human oversight would likely remain necessary for any high-stakes application involving asylum outcomes.
  • The same benchmarking approach could be applied to other legal or administrative texts in underrepresented languages once comparable expert datasets exist.
  • Inconsistency across LLMs may indicate limits in their grasp of implicit legal reasoning, pointing toward targeted fine-tuning as a potential improvement.

Load-bearing premise

The expert annotations in the RAB-Cred dataset constitute reliable ground truth for the subtle legal concept of credibility assessment, and the task can be adequately captured by the chosen classification labels without deeper domain-specific legal context.

What would settle it

A new round of independent expert annotations on the same RAB-Cred texts that produces substantially different credibility labels than the original set, or a demonstration that LLM errors align predictably with actual case outcomes rather than varying randomly across models.

Figures

Figures reproduced from arXiv: 2605.13412 by Anna Murphy H{\o}genhaug, Asta S. Stage Jarlner, Desmond Elliott, Galadrielle Humblot-Renaux, Maria Vlachou, Marieke Anne Heyl, Mohammad N. S. Jahromi, Rohat Bakuri-J{\o}rgensen, Thomas B. Moeslund, Thomas Gammeltoft-Hansen.

Figure 1
Figure 1. Figure 1: Distribution of case lengths in RAB-Cred. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix showing inter-annotator [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inter-class hesitation of the human annotators, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: Label (top) and confidence (bottom) distribu [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relation between case outcome and gold cred [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation set classification performance per model for different user prompts. Each boxplot is across 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Best-case classification performance (taking [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Validation set classification performance [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Individual LLM and ensemble mistakes, color-coded by class confusion (cf. Appendix [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LLM agreement vs. LLM correctness vs. human confidence. Each point corresponds to a single case in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces the RAB-Cred dataset of Danish asylum decision texts annotated by experts for the presence and sentiment of credibility assessments, along with metadata on annotator confidence and case outcomes. It benchmarks 21 open-weight LLMs across 30 zero-shot and few-shot prompt combinations, then performs detailed error analysis covering consistency across models, inter-class confusion, correlations with human confidence, sample difficulty, and mistake severity. The central claim is that LLMs show promise for cost-effective annotation in this specialized legal domain but remain imperfect and inconsistent, so practitioners should avoid relying on any single arbitrarily chosen model; the dataset and code are released publicly.

Significance. If the empirical results hold, the work is significant for legal NLP and low-resource language applications: it supplies a reproducible benchmark and error taxonomy for a nuanced classification task, demonstrates concrete limitations of current LLMs as annotators, and supplies public resources that enable follow-on research on reliable automated labeling in high-stakes settings.

major comments (1)
  1. [§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.
minor comments (3)
  1. [§5.2] §5.2 (Error Analysis): The discussion of LLM mistake severity would be strengthened by an explicit comparison to the distribution of human annotator disagreements on the same samples.
  2. [Table 2] Table 2 (Model and Prompt Results): Ensure every model size and prompt template is listed with the exact hyper-parameters used; a few entries appear to omit the precise few-shot example selection method.
  3. [Abstract] Abstract: The phrase 'high-quality, expert annotations' is used without a quantitative qualifier; a single sentence summarizing IAA or confidence statistics would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major comment point-by-point below and have made revisions to incorporate additional details on the dataset creation process.

read point-by-point responses
  1. Referee: [§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.

    Authors: We agree with the referee that providing inter-annotator agreement (IAA) statistics and more details on the annotation process would enhance the transparency and reliability of the ground-truth labels. In the revised manuscript, we will add a new subsection in §3 that includes: (1) the annotation guidelines used by the experts, (2) the adjudication process for resolving any disagreements, and (3) IAA statistics (e.g., Cohen's kappa or Fleiss' kappa) calculated on the subset of texts annotated by multiple experts. Additionally, we will include a sensitivity analysis and correlation between the credibility labels, annotator confidence, and the available case outcome metadata to validate the labels against external indicators. These changes will be made to address the concern about the unverified reliability of the ground truth. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmarking against external expert labels

full rationale

The paper introduces the RAB-Cred dataset with expert annotations and directly measures LLM classification performance (zero-shot/few-shot accuracy, error consistency, inter-class confusion) against those fixed labels. No derivations, equations, fitted parameters, or predictions are claimed; all results are computed from direct comparison to the provided ground truth. No self-citations are load-bearing, no ansatzes are smuggled, and no uniqueness theorems are invoked. The evaluation chain is self-contained and externally falsifiable via the released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that expert labels are gold-standard ground truth and that prompt-based classification adequately captures the legal concept; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert annotations provide reliable ground truth for credibility assessment labels
    The paper uses these annotations to evaluate all LLM performance and error patterns.

pith-pipeline@v0.9.0 · 5583 in / 1190 out tokens · 118416 ms · 2026-05-14T19:42:52.760692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 1 internal anchor

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Large Language Models in Legal Systems: A Survey , volume =

    Dehghani, Fatemeh and Dehghani, Roya and Ardebili, Yazdan and Rahnamayan, Shahryar , year =. Large Language Models in Legal Systems: A Survey , volume =. Humanities and Social Sciences Communications , doi =

  3. [3]

    2025 , MONTH = Nov, KEYWORDS =

    Berghegger, Christina and Philippe, C. 2025 , MONTH = Nov, KEYWORDS =

  4. [4]

    arXiv preprint arXiv:2410.07504 , year=

    Using llms to discover legal factors , author=. arXiv preprint arXiv:2410.07504 , year=

  5. [5]

    Computer Law & Security Review , volume=

    LLMs for legal reasoning: A unified framework and future perspectives , author=. Computer Law & Security Review , volume=. 2025 , publisher=

  6. [6]

    Gray, Morgan and Savelka, Jaromir and Oliver, Wesley and Ashley, Kevin , booktitle=. Can. 2023 , publisher=

  7. [7]

    Frontiers in Artificial Intelligence , volume=

    The Unreasonable Effectiveness of Large Language Models in Zero-shot Semantic Annotation of Legal Texts , author=. Frontiers in Artificial Intelligence , volume=. 2023 , doi=

  8. [8]

    Unlocking Practical Applications in Legal Domain: Evaluation of

    Savelka, Jaromir and Ashley, Kevin , booktitle=. Unlocking Practical Applications in Legal Domain: Evaluation of. 2023 , publisher=

  9. [9]

    Discovering the Potential of

    Berghegger, Christina and Philippe, C. Discovering the Potential of. International Workshop on Argumentation and Applications (Arg&App 2025) , year=

  10. [10]

    Artificial Intelligence and Law , year=

    Classifying Legal Interpretations Using Large Language Models , author=. Artificial Intelligence and Law , year=. doi:10.1007/s10506-025-09447-9 , url=

  11. [11]

    arXiv preprint arXiv:2407.02039 , year=

    Prompt Stability Scoring for Text Annotation with Large Language Models , author=. arXiv preprint arXiv:2407.02039 , year=

  12. [12]

    What Did I Do Wrong? Quantifying

    Errica, Federico and others , booktitle=. What Did I Do Wrong? Quantifying. 2025 , publisher=

  13. [13]

    2024 , eprint=

    Best Practices for Text Annotation with Large Language Models , author=. 2024 , eprint=

  14. [14]

    The Use of

    Carlson, Kevin and others , journal=. The Use of. 2025 , doi=

  15. [15]

    2024 , url=

    Political Bias in Large Language Models , author=. 2024 , url=

  16. [16]

    2025 , doi=

    Ennser-Jedenastik, Laurenz and others , journal=. 2025 , doi=

  17. [17]

    Computational Linguistics , year=

    Bias and Fairness in Large Language Models: A Survey , author=. Computational Linguistics , year=

  18. [18]

    D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T

    Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878

  19. [19]

    Journal of Multilingual and Multicultural Development , volume =

    Charlotte Gooskens , title =. Journal of Multilingual and Multicultural Development , volume =. 2007 , publisher =. doi:10.2167/jmmd511.0 , URL =

  20. [20]

    The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation

    Pavlovic, Maja and Poesio, Massimo. The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024

  21. [21]

    A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements

    Almanea, Dina and Poesio, Massimo. A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

  22. [22]

    Semeval-2023 task 11: Learning with disagreements (lewidi),

    Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...

  23. [23]

    Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content

    Bhat, Savita and Varma, Vasudeva. Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. 2023. doi:10.18653/v1/2023.eval4nlp-1.8

  24. [24]

    A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation

    R nningstad, Egil and Velldal, Erik and vrelid, Lilja. A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024

  25. [25]

    LLM s as annotators of argumentation

    Lindahl, Anna. LLM s as annotators of argumentation. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). 2025. doi:10.18653/v1/2025.starsem-1.19

  26. [26]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

  27. [27]

    Can Large Language Models Transform Computational Social Science?

    Ziems, Caleb and Held, William and Shaikh, Omar and Chen, Jiaao and Zhang, Zhehao and Yang, Diyi. Can Large Language Models Transform Computational Social Science?. Computational Linguistics. 2024. doi:10.1162/coli_a_00502

  28. [28]

    LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks

    Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...

  29. [29]

    Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges

    Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025

  30. [30]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

    Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

  31. [31]

    Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

    What makes good in-context examples for GPT-3? , author=. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=

  32. [32]

    Active learning principles for in-context learning with large language models

    Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334

  33. [33]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.759

  34. [34]

    2025 , howpublished=

    Poro 2: Continued Pretraining for Language Acquisition , author=. 2025 , howpublished=

  35. [35]

    2024 , eprint=

    Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier , author=. 2024 , eprint=

  36. [36]

    2026 , eprint=

    EuroLLM-22B: Technical Report , author=. 2026 , eprint=

  37. [37]

    2025 , url =

    Bielik-11B-v3.0-Instruct model card , author =. 2025 , url =

  38. [38]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  39. [39]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  40. [40]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  41. [41]

    microsoft/phi-4 Hugging Face model card , author =

  42. [42]

    2023 , eprint=

    Efficient Guided Generation for Large Language Models , author=. 2023 , eprint=

  43. [43]

    Contemporary LLM s struggle with extracting formal legal arguments

    Held, Lena and Habernal, Ivan. Contemporary LLM s struggle with extracting formal legal arguments. Proceedings of the Natural Legal Language Processing Workshop 2025. 2025. doi:10.18653/v1/2025.nllp-1.20

  44. [44]

    2026 , isbn =

    Blair-Stanek, Andrew and Van Durme, Benjamin , title =. 2026 , isbn =. doi:10.1145/3769126.3769245 , booktitle =

  45. [45]

    Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks

    Saattrup Nielsen, Dan and Enevoldsen, Kenneth and Schneider-Kamp, Peter. Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). 2025

  46. [46]

    2024 , eprint=

    SnakModel: Lessons Learned from Training an Open Danish Large Language Model , author=. 2024 , eprint=

  47. [47]

    2023 , eprint=

    Danish Foundation Models , author=. 2023 , eprint=

  48. [48]

    GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages

    Ekgren, Ariel and Cuba Gyllensten, Amaru and Stollenwerk, Felix and. GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  49. [49]

    Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation

    Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

  50. [50]

    MEGA : Multilingual evaluation of generative AI

    Ahuja, Kabir and Diddee, Harshita and Hada, Rishav and Ochieng, Millicent and Ramesh, Krithika and Jain, Prachi and Nambi, Akshay and Ganu, Tanuja and Segal, Sameer and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGA : Multilingual Evaluation of Generative AI. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...

  51. [51]

    Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

    DaNLP: An open-source toolkit for Danish Natural Language Processing , author=. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=

  52. [52]

    MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation

    Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...

  53. [53]

    2024 , eprint=

    A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , author=. 2024 , eprint=

  54. [54]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  55. [55]

    Metacognitive Prompting Improves Understanding in Large Language Models

    Wang, Yuqing and Zhao, Yun. Metacognitive Prompting Improves Understanding in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.106

  56. [56]

    Statistics of Common Crawl Monthly Archives - Distribution of Languages , author =

  57. [57]

    The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

    Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc ' Aurelio and Guzm \'a n, Francisco and Fan, Angela. The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics. 2022...

  58. [58]

    The Thirteenth International Conference on Learning Representations , year=

    Lawma: The Power of Specialization for Legal Annotation , author=. The Thirteenth International Conference on Learning Representations , year=

  59. [59]

    M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

    Chalkidis, Ilias and Fergadiotis, Manos and Androutsopoulos, Ion. M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.559

  60. [60]

    WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

    Michala Clante Bendixen. WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES

  61. [61]

    Journal of Ethnic and Migration Studies , pages=

    Credibility as a fuzzy concept in refugee law: a systematic literature review , author=. Journal of Ethnic and Migration Studies , pages=. 2026 , publisher=. doi:10.1080/1369183X.2026.2619660 , url=

  62. [62]

    P ro SA : Assessing and understanding the prompt sensitivity of LLM s

    Zhuo, Jingming and Zhang, Songyang and Fang, Xinyu and Duan, Haodong and Lin, Dahua and Chen, Kai. P ro SA : Assessing and Understanding the Prompt Sensitivity of LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.108

  63. [63]

    POSIX: A Prompt Sensitivity Index For Large Language Models

    Chatterjee, Anwoy and Renduchintala, H S V N S Kowndinya and Bhatia, Sumit and Chakraborty, Tanmoy. POSIX : A Prompt Sensitivity Index For Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.852

  64. [64]

    Political Analysis , author=

    Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts , DOI=. Political Analysis , author=. 2025 , pages=

  65. [65]

    Corpus Pragmatics , volume=

    Evaluating an LLM’s Performance in Annotating Discourse Strategies , author=. Corpus Pragmatics , volume=. 2026 , publisher=

  66. [66]

    Socius , volume =

    Start generating: Harnessing generative artificial intelligence for sociological research , author=. Socius , volume =. 2024 , doi =

  67. [67]

    Measuring

    Jacomy, Mathieu and Borra, Erik , journal=. Measuring

  68. [68]

    AI & SOCIETY , pages=

    Leveraging large language models for thematic analysis: a case study in the charity sector , author=. AI & SOCIETY , pages=. 2025 , publisher=

  69. [69]

    Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =

    Rask Nielsen, Trine and Holten M. Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =. Proc. ACM Hum.-Comput. Interact. , month = jan, articleno =. doi:10.1145/3492825 , abstract =

  70. [70]

    Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets

    Ruckdeschel, Mattes. Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  71. [71]

    2023 , eprint=

    Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies , author=. 2023 , eprint=

  72. [72]

    2025 , eprint=

    Bielik 11B v3: Multilingual Large Language Model for European Languages , author=. 2025 , eprint=

  73. [73]

    ACM Comput

    Ariai, Farid and Mackenzie, Joel and Demartini, Gianluca , title =. ACM Comput. Surv. , month = dec, articleno =. 2025 , issue_date =. doi:10.1145/3777009 , abstract =

  74. [74]

    A sy L ex: A Dataset for Legal Language Processing of Refugee Claims

    Barale, Claire and Klaisoongnoen, Mark and Minervini, Pasquale and Rovatsos, Michael and Bhuta, Nehal. A sy L ex: A Dataset for Legal Language Processing of Refugee Claims. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.24

  75. [75]

    State of What Art? A Call for Multi-Prompt LLM Evaluation

    Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

  76. [76]

    Frontiers in Human Dynamics , VOLUME=

    Hertz, Maya Ellen and Jarlner, Asta Sofie Stage , TITLE=. Frontiers in Human Dynamics , VOLUME=. 2025 , URL=. doi:10.3389/fhumd.2025.1625988 , ISSN=

  77. [77]

    Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

    Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

  78. [78]

    Legal and Protection Policy Research Series , year=

    Nordic Asylum Practice in Relation to Religious Conversion: Insights from Denmark, Norway and Sweden , author=. Legal and Protection Policy Research Series , year=

  79. [79]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  80. [80]

    Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?

    Majer, Laura and S najder, Jan. Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER). 2024. doi:10.18653/v1/2024.fever-1.27

Showing first 80 references.