LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
Pith reviewed 2026-05-14 19:42 UTC · model grok-4.3
The pith
Large language models can annotate credibility assessments in Danish asylum decisions at moderate accuracy but show inconsistent errors that vary by model and prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLMs demonstrate viable performance for cost-effective labeling of credibility assessments in asylum texts, yet their outputs remain imperfect and inconsistent, with error patterns that differ across models and prompts, requiring evaluation methods that extend past aggregated metrics to include error consistency, class confusion, and correlation with human confidence.
What carries the argument
The RAB-Cred dataset of expert-annotated Danish asylum decisions, paired with systematic error analysis that tracks inter-model consistency, inter-class confusion, and sample-level difficulty beyond aggregate accuracy.
If this is right
- LLMs could scale labeling of large volumes of asylum decisions for research or policy oversight at lower cost than full expert review.
- Reliable use would require combining predictions from multiple models rather than relying on any single one.
- Error analysis beyond accuracy metrics becomes essential for legal NLP tasks that involve nuanced judgments.
- Prompt and model selection must be validated specifically for low-resource languages and specialized legal domains.
Where Pith is reading between the lines
- Human oversight would likely remain necessary for any high-stakes application involving asylum outcomes.
- The same benchmarking approach could be applied to other legal or administrative texts in underrepresented languages once comparable expert datasets exist.
- Inconsistency across LLMs may indicate limits in their grasp of implicit legal reasoning, pointing toward targeted fine-tuning as a potential improvement.
Load-bearing premise
The expert annotations in the RAB-Cred dataset constitute reliable ground truth for the subtle legal concept of credibility assessment, and the task can be adequately captured by the chosen classification labels without deeper domain-specific legal context.
What would settle it
A new round of independent expert annotations on the same RAB-Cred texts that produces substantially different credibility labels than the original set, or a demonstration that LLM errors align predictably with actual case outcomes rather than varying randomly across models.
Figures
read the original abstract
Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the RAB-Cred dataset of Danish asylum decision texts annotated by experts for the presence and sentiment of credibility assessments, along with metadata on annotator confidence and case outcomes. It benchmarks 21 open-weight LLMs across 30 zero-shot and few-shot prompt combinations, then performs detailed error analysis covering consistency across models, inter-class confusion, correlations with human confidence, sample difficulty, and mistake severity. The central claim is that LLMs show promise for cost-effective annotation in this specialized legal domain but remain imperfect and inconsistent, so practitioners should avoid relying on any single arbitrarily chosen model; the dataset and code are released publicly.
Significance. If the empirical results hold, the work is significant for legal NLP and low-resource language applications: it supplies a reproducible benchmark and error taxonomy for a nuanced classification task, demonstrates concrete limitations of current LLMs as annotators, and supplies public resources that enable follow-on research on reliable automated labeling in high-stakes settings.
major comments (1)
- [§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.
minor comments (3)
- [§5.2] §5.2 (Error Analysis): The discussion of LLM mistake severity would be strengthened by an explicit comparison to the distribution of human annotator disagreements on the same samples.
- [Table 2] Table 2 (Model and Prompt Results): Ensure every model size and prompt template is listed with the exact hyper-parameters used; a few entries appear to omit the precise few-shot example selection method.
- [Abstract] Abstract: The phrase 'high-quality, expert annotations' is used without a quantitative qualifier; a single sentence summarizing IAA or confidence statistics would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address the major comment point-by-point below and have made revisions to incorporate additional details on the dataset creation process.
read point-by-point responses
-
Referee: [§3] §3 (RAB-Cred Dataset): The manuscript describes the expert annotations as high-quality and supplies annotator-confidence metadata, yet reports no inter-annotator agreement statistics, no details on the annotation guidelines or adjudication process, and no validation against case outcomes. Because the central evaluation metrics (accuracy, error consistency, confusion matrices) are computed against these labels, the absence of IAA or sensitivity analysis leaves the reliability of the ground-truth target unverified for a subtle legal concept.
Authors: We agree with the referee that providing inter-annotator agreement (IAA) statistics and more details on the annotation process would enhance the transparency and reliability of the ground-truth labels. In the revised manuscript, we will add a new subsection in §3 that includes: (1) the annotation guidelines used by the experts, (2) the adjudication process for resolving any disagreements, and (3) IAA statistics (e.g., Cohen's kappa or Fleiss' kappa) calculated on the subset of texts annotated by multiple experts. Additionally, we will include a sensitivity analysis and correlation between the credibility labels, annotator confidence, and the available case outcome metadata to validate the labels against external indicators. These changes will be made to address the concern about the unverified reliability of the ground truth. revision: yes
Circularity Check
No circularity: pure empirical benchmarking against external expert labels
full rationale
The paper introduces the RAB-Cred dataset with expert annotations and directly measures LLM classification performance (zero-shot/few-shot accuracy, error consistency, inter-class confusion) against those fixed labels. No derivations, equations, fitted parameters, or predictions are claimed; all results are computed from direct comparison to the provided ground truth. No self-citations are load-bearing, no ansatzes are smuggled, and no uniqueness theorems are invoked. The evaluation chain is self-contained and externally falsifiable via the released dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations provide reliable ground truth for credibility assessment labels
Reference graph
Works this paper leans on
- [1]
-
[2]
Large Language Models in Legal Systems: A Survey , volume =
Dehghani, Fatemeh and Dehghani, Roya and Ardebili, Yazdan and Rahnamayan, Shahryar , year =. Large Language Models in Legal Systems: A Survey , volume =. Humanities and Social Sciences Communications , doi =
-
[3]
2025 , MONTH = Nov, KEYWORDS =
Berghegger, Christina and Philippe, C. 2025 , MONTH = Nov, KEYWORDS =
work page 2025
-
[4]
arXiv preprint arXiv:2410.07504 , year=
Using llms to discover legal factors , author=. arXiv preprint arXiv:2410.07504 , year=
-
[5]
Computer Law & Security Review , volume=
LLMs for legal reasoning: A unified framework and future perspectives , author=. Computer Law & Security Review , volume=. 2025 , publisher=
work page 2025
-
[6]
Gray, Morgan and Savelka, Jaromir and Oliver, Wesley and Ashley, Kevin , booktitle=. Can. 2023 , publisher=
work page 2023
-
[7]
Frontiers in Artificial Intelligence , volume=
The Unreasonable Effectiveness of Large Language Models in Zero-shot Semantic Annotation of Legal Texts , author=. Frontiers in Artificial Intelligence , volume=. 2023 , doi=
work page 2023
-
[8]
Unlocking Practical Applications in Legal Domain: Evaluation of
Savelka, Jaromir and Ashley, Kevin , booktitle=. Unlocking Practical Applications in Legal Domain: Evaluation of. 2023 , publisher=
work page 2023
-
[9]
Berghegger, Christina and Philippe, C. Discovering the Potential of. International Workshop on Argumentation and Applications (Arg&App 2025) , year=
work page 2025
-
[10]
Artificial Intelligence and Law , year=
Classifying Legal Interpretations Using Large Language Models , author=. Artificial Intelligence and Law , year=. doi:10.1007/s10506-025-09447-9 , url=
-
[11]
arXiv preprint arXiv:2407.02039 , year=
Prompt Stability Scoring for Text Annotation with Large Language Models , author=. arXiv preprint arXiv:2407.02039 , year=
work page internal anchor Pith review arXiv
-
[12]
What Did I Do Wrong? Quantifying
Errica, Federico and others , booktitle=. What Did I Do Wrong? Quantifying. 2025 , publisher=
work page 2025
-
[13]
Best Practices for Text Annotation with Large Language Models , author=. 2024 , eprint=
work page 2024
- [14]
- [15]
- [16]
-
[17]
Computational Linguistics , year=
Bias and Fairness in Large Language Models: A Survey , author=. Computational Linguistics , year=
-
[18]
D., Ngo, N., Pouran Ben Veyseh, A., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T
Lai, Viet Dac and Ngo, Nghia and Pouran Ben Veyseh, Amir and Man, Hieu and Dernoncourt, Franck and Bui, Trung and Nguyen, Thien Huu. C hat GPT Beyond E nglish: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.878
-
[19]
Journal of Multilingual and Multicultural Development , volume =
Charlotte Gooskens , title =. Journal of Multilingual and Multicultural Development , volume =. 2007 , publisher =. doi:10.2167/jmmd511.0 , URL =
-
[20]
Pavlovic, Maja and Poesio, Massimo. The Effectiveness of LLM s as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation. Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024. 2024
work page 2024
-
[21]
A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements
Almanea, Dina and Poesio, Massimo. A r MIS - The A rabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022
work page 2022
-
[22]
Semeval-2023 task 11: Learning with disagreements (lewidi),
Leonardelli, Elisa and Abercrombie, Gavin and Almanea, Dina and Basile, Valerio and Fornaciari, Tommaso and Plank, Barbara and Rieser, Verena and Uma, Alexandra and Poesio, Massimo. S em E val-2023 Task 11: Learning with Disagreements ( L e W i D i). Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). 2023. doi:10.18653/v...
-
[23]
Bhat, Savita and Varma, Vasudeva. Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language Content. Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. 2023. doi:10.18653/v1/2023.eval4nlp-1.8
-
[24]
A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation
R nningstad, Egil and Velldal, Erik and vrelid, Lilja. A GPT among Annotators: LLM -based Entity-Level Sentiment Annotation. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024
work page 2024
-
[25]
LLM s as annotators of argumentation
Lindahl, Anna. LLM s as annotators of argumentation. Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). 2025. doi:10.18653/v1/2025.starsem-1.19
-
[26]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782
-
[27]
Can Large Language Models Transform Computational Social Science?
Ziems, Caleb and Held, William and Shaikh, Omar and Chen, Jiaao and Zhang, Zhehao and Yang, Diyi. Can Large Language Models Transform Computational Social Science?. Computational Linguistics. 2024. doi:10.1162/coli_a_00502
-
[28]
LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks
Bavaresco, Anna and Bernardi, Raffaella and Bertolazzi, Leonardo and Elliott, Desmond and Fern \'a ndez, Raquel and Gatt, Albert and Ghaleb, Esam and Giulianelli, Mario and Hanna, Michael and Koller, Alexander and Martins, Andre and Mondorf, Philipp and Neplenbroek, Vera and Pezzelle, Sandro and Plank, Barbara and Schlangen, David and Suglia, Alessandro a...
-
[29]
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges
Thakur, Aman Singh and Choudhary, Kartik and Ramayapally, Venkat Srinik and Vaidyanathan, Sankaran and Hupkes, Dieuwke. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLM s-as-Judges. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM ). 2025
work page 2025
-
[30]
Llms instead of human judges? a large scale empirical study across 20 nlp evaluation tasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
-
[31]
What makes good in-context examples for GPT-3? , author=. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures , pages=
work page 2022
-
[32]
Active learning principles for in-context learning with large language models
Margatina, Katerina and Schick, Timo and Aletras, Nikolaos and Dwivedi-Yu, Jane. Active Learning Principles for In-Context Learning with Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.334
-
[33]
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Min, Sewon and Lyu, Xinxi and Holtzman, Ari and Artetxe, Mikel and Lewis, Mike and Hajishirzi, Hannaneh and Zettlemoyer, Luke. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.759
-
[34]
Poro 2: Continued Pretraining for Language Acquisition , author=. 2025 , howpublished=
work page 2025
-
[35]
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier , author=. 2024 , eprint=
work page 2024
- [36]
- [37]
- [38]
- [39]
-
[40]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
-
[41]
microsoft/phi-4 Hugging Face model card , author =
-
[42]
Efficient Guided Generation for Large Language Models , author=. 2023 , eprint=
work page 2023
-
[43]
Contemporary LLM s struggle with extracting formal legal arguments
Held, Lena and Habernal, Ivan. Contemporary LLM s struggle with extracting formal legal arguments. Proceedings of the Natural Legal Language Processing Workshop 2025. 2025. doi:10.18653/v1/2025.nllp-1.20
-
[44]
Blair-Stanek, Andrew and Van Durme, Benjamin , title =. 2026 , isbn =. doi:10.1145/3769126.3769245 , booktitle =
-
[45]
Saattrup Nielsen, Dan and Enevoldsen, Kenneth and Schneider-Kamp, Peter. Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025). 2025
work page 2025
-
[46]
SnakModel: Lessons Learned from Training an Open Danish Large Language Model , author=. 2024 , eprint=
work page 2024
- [47]
-
[48]
GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages
Ekgren, Ariel and Cuba Gyllensten, Amaru and Stollenwerk, Felix and. GPT - SW 3: An Autoregressive Language Model for the S candinavian Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024
work page 2024
-
[49]
Global MMLU : Understanding and addressing cultural and linguistic biases in multilingual evaluation
Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...
-
[50]
MEGA : Multilingual evaluation of generative AI
Ahuja, Kabir and Diddee, Harshita and Hada, Rishav and Ochieng, Millicent and Ramesh, Krithika and Jain, Prachi and Nambi, Akshay and Ganu, Tanuja and Segal, Sameer and Ahmed, Mohamed and Bali, Kalika and Sitaram, Sunayana. MEGA : Multilingual Evaluation of Generative AI. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processi...
-
[51]
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=
DaNLP: An open-source toolkit for Danish Natural Language Processing , author=. Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) , pages=
-
[52]
MMLU - P ro X : A Multilingual Benchmark for Advanced Large Language Model Evaluation
Xuan, Weihao and Yang, Rui and Qi, Heli and Zeng, Qingcheng and Xiao, Yunze and Feng, Aosong and Liu, Dairui and Xing, Yun and Wang, Junjue and Gao, Fan and Lu, Jinghui and Jiang, Yuang and Li, Huitao and Li, Xin and Yu, Kunyu and Dong, Ruihai and Gu, Shangding and Li, Yuekang and Xie, Xiaofei and Juefei-Xu, Felix and Khomh, Foutse and Yoshie, Osamu and C...
-
[53]
A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks , author=. 2024 , eprint=
work page 2024
-
[54]
Kojima, Takeshi and Gu, Shixiang Shane and Reid, Machel and Matsuo, Yutaka and Iwasawa, Yusuke , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =
work page 2022
-
[55]
Metacognitive Prompting Improves Understanding in Large Language Models
Wang, Yuqing and Zhao, Yun. Metacognitive Prompting Improves Understanding in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.106
-
[56]
Statistics of Common Crawl Monthly Archives - Distribution of Languages , author =
-
[57]
The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc ' Aurelio and Guzm \'a n, Francisco and Fan, Angela. The F lores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Transactions of the Association for Computational Linguistics. 2022...
-
[58]
The Thirteenth International Conference on Learning Representations , year=
Lawma: The Power of Specialization for Legal Annotation , author=. The Thirteenth International Conference on Learning Representations , year=
-
[59]
Chalkidis, Ilias and Fergadiotis, Manos and Androutsopoulos, Ion. M ulti EURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.559
-
[60]
WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES
Michala Clante Bendixen. WELL-FOUNDED FEAR – CREDIBILITY AND RISK ASSESSMENT IN DANISH ASYLUM CASES
-
[61]
Journal of Ethnic and Migration Studies , pages=
Credibility as a fuzzy concept in refugee law: a systematic literature review , author=. Journal of Ethnic and Migration Studies , pages=. 2026 , publisher=. doi:10.1080/1369183X.2026.2619660 , url=
-
[62]
P ro SA : Assessing and understanding the prompt sensitivity of LLM s
Zhuo, Jingming and Zhang, Songyang and Fang, Xinyu and Duan, Haodong and Lin, Dahua and Chen, Kai. P ro SA : Assessing and Understanding the Prompt Sensitivity of LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.108
-
[63]
POSIX: A Prompt Sensitivity Index For Large Language Models
Chatterjee, Anwoy and Renduchintala, H S V N S Kowndinya and Bhatia, Sumit and Chakraborty, Tanmoy. POSIX : A Prompt Sensitivity Index For Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.852
-
[64]
Codebook LLMs: Evaluating LLMs as Measurement Tools for Political Science Concepts , DOI=. Political Analysis , author=. 2025 , pages=
work page 2025
-
[65]
Evaluating an LLM’s Performance in Annotating Discourse Strategies , author=. Corpus Pragmatics , volume=. 2026 , publisher=
work page 2026
-
[66]
Start generating: Harnessing generative artificial intelligence for sociological research , author=. Socius , volume =. 2024 , doi =
work page 2024
- [67]
-
[68]
Leveraging large language models for thematic analysis: a case study in the charity sector , author=. AI & SOCIETY , pages=. 2025 , publisher=
work page 2025
-
[69]
Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =
Rask Nielsen, Trine and Holten M. Data as a Lens for Understanding what Constitutes Credibility in Asylum Decision-making , year =. Proc. ACM Hum.-Comput. Interact. , month = jan, articleno =. doi:10.1145/3492825 , abstract =
-
[70]
Ruckdeschel, Mattes. Just Read the Codebook! Make Use of Quality Codebooks in Zero-Shot Classification of Multilabel Frame Datasets. Proceedings of the 31st International Conference on Computational Linguistics. 2025
work page 2025
-
[71]
Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies , author=. 2023 , eprint=
work page 2023
-
[72]
Bielik 11B v3: Multilingual Large Language Model for European Languages , author=. 2025 , eprint=
work page 2025
-
[73]
Ariai, Farid and Mackenzie, Joel and Demartini, Gianluca , title =. ACM Comput. Surv. , month = dec, articleno =. 2025 , issue_date =. doi:10.1145/3777009 , abstract =
-
[74]
A sy L ex: A Dataset for Legal Language Processing of Refugee Claims
Barale, Claire and Klaisoongnoen, Mark and Minervini, Pasquale and Rovatsos, Michael and Bhuta, Nehal. A sy L ex: A Dataset for Legal Language Processing of Refugee Claims. Proceedings of the Natural Legal Language Processing Workshop 2023. 2023. doi:10.18653/v1/2023.nllp-1.24
-
[75]
State of What Art? A Call for Multi-Prompt LLM Evaluation
Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681
-
[76]
Frontiers in Human Dynamics , VOLUME=
Hertz, Maya Ellen and Jarlner, Asta Sofie Stage , TITLE=. Frontiers in Human Dynamics , VOLUME=. 2025 , URL=. doi:10.3389/fhumd.2025.1625988 , ISSN=
-
[77]
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s
Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006
-
[78]
Legal and Protection Policy Research Series , year=
Nordic Asylum Practice in Relation to Religious Conversion: Insights from Denmark, Norway and Sweden , author=. Legal and Protection Policy Research Series , year=
- [79]
-
[80]
Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?
Majer, Laura and S najder, Jan. Claim Check-Worthiness Detection: How Well do LLM s Grasp Annotation Guidelines?. Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER). 2024. doi:10.18653/v1/2024.fever-1.27
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.