LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

Amogh Raina; Daniel Hershcovich; Henrik Palmer Olsen; Johan Lindholm; Shanshan Xu

arxiv: 2605.19815 · v1 · pith:KV2JJHXWnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

Shanshan Xu , Johan Lindholm , Amogh Raina , Henrik Palmer Olsen , Daniel Hershcovich This is my paper

Pith reviewed 2026-05-20 06:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal proposition generationlegal NLPLLM evaluationrubric designCJEU decisionsexpert annotationquality assessmentproposition evaluation

0 comments

The pith

A three-step rubric co-designed with legal experts shows LLMs generate mostly high-quality propositions from CJEU decisions, with better results from established cases and closer expert alignment when the rubric guides LLM judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LP-Eval, a rubric that breaks legal proposition quality into formal validity and substantive dimensions through a three-step process developed with experts. It releases expert annotations on 100 LLM-generated propositions drawn from Court of Justice of the European Union decisions. Findings indicate LLMs produce predominantly well-formed outputs, though quality scores are higher for propositions from well-established cases than from recent ones. Rubric-guided LLM evaluations match human expert judgments more closely than direct overall scoring, but still miss some fine-grained distinctions that experts detect.

Core claim

Large language models can generate predominantly well-formed and high-quality legal propositions from CJEU decisions. Expert evaluation using the LP-Eval rubric finds higher quality for propositions derived from well-established cases than from recent ones. Rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, though they remain insensitive to finer-grained distinctions captured by human experts.

What carries the argument

LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions.

If this is right

LLMs can serve as a practical starting point for drafting legal propositions in doctrinal work and legal reasoning tasks.
Structured rubrics improve the reliability of both human and LLM-based evaluation compared with simple overall scores.
Quality of generated propositions varies with the maturity of the underlying legal cases, suggesting selective use of LLM output.
Rubric-guided LLM evaluators can reduce the annotation burden in legal NLP datasets while preserving reasonable alignment with experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric approach could be adapted to decisions from other courts or national legal systems to test generalizability.
Hybrid pipelines that combine LLM generation with targeted human review on recent or novel cases may be more efficient than full manual drafting.
Training or fine-tuning LLMs explicitly on the rubric steps might reduce the observed gap in sensitivity to fine distinctions.

Load-bearing premise

The three-step rubric accurately and comprehensively captures the key dimensions of legal proposition quality for the selected CJEU decisions without overlooking important aspects or introducing expert bias.

What would settle it

A follow-up study in which legal experts rate the same propositions using only holistic scoring or an alternative rubric and show substantially lower agreement with the original LP-Eval annotations.

Figures

Figures reproduced from arXiv: 2605.19815 by Amogh Raina, Daniel Hershcovich, Henrik Palmer Olsen, Johan Lindholm, Shanshan Xu.

read the original abstract

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LP-Eval gives legal NLP a new rubric and small annotated dataset for judging LLM-generated propositions, but the claims rest on just two experts.

read the letter

Hi there, The main thing this paper delivers is LP-Eval, a three-step rubric built with legal experts to score formal validity and substantive quality of propositions drawn from CJEU decisions, plus a dataset of two experts' annotations on 100 LLM outputs. That is a concrete addition in an area where evaluation has mostly used generic metrics. They show LLMs mostly produce well-formed propositions, that quality looks higher for established cases than recent ones, and that rubric-guided LLM judgments track the experts better than plain overall scores. The setup is straightforward and the results are reported clearly enough to be usable as a starting point. The soft spot is the narrow expert base. Two annotators on 100 items without reported agreement figures or details on how the cases and propositions were picked leaves the quality differences and the alignment results open to questions about coverage or bias. The abstract also skips statistical tests, so the support for the central claims stays moderate until more validation appears. This paper is for people working on legal generation or domain-specific evaluation who need something more tailored than standard benchmarks. Anyone testing LLMs on doctrinal tasks would get immediate use from the rubric and the released annotations. It deserves peer review because the new artifacts are practical and the initial findings are worth checking with larger expert pools and clearer selection methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces LP-Eval, a three-step rubric co-designed with legal experts to evaluate LLM-generated legal propositions from CJEU decisions. It releases a dataset of annotations by two experts on 100 propositions and reports empirical results showing that LLMs produce predominantly well-formed, high-quality propositions (with higher quality for established cases than recent ones) and that rubric-guided LLM-as-judge evaluations align more closely with expert assessments than direct overall scoring.

Significance. If the central claims hold, the work supplies a new rubric and annotated dataset for an under-examined task in Legal NLP, offers concrete evidence on LLM proposition quality differentiated by case recency, and demonstrates a practical improvement in LLM evaluation via structured rubrics. These elements could serve as a reproducible starting point for future legal-reasoning benchmarks.

major comments (2)

[§4] §4 (Dataset and Annotations): Annotations rely on only two legal experts for the full set of 100 items, with no inter-annotator agreement statistic (e.g., Cohen’s κ or percentage agreement) reported. This directly affects the reliability of the human gold labels that support all quality comparisons in §5 and the LLM-as-judge alignment results in §6.
[§3] §3 (Rubric Design): The three-step rubric is presented as comprehensively decomposing formal validity and substantive dimensions, yet the manuscript provides no external validation (e.g., comparison against a larger expert panel or alternative doctrinal criteria). This leaves open the possibility that systematic gaps in coverage undermine both the human labels and the subsequent alignment claims.

minor comments (2)

[§5.1] Table 1 or §5.1: Clarify the exact sampling procedure used to select the 100 propositions and the distribution across established vs. recent cases to improve reproducibility.
[§6] Abstract and §6: The phrase “insensitive to finer-grained distinctions” would benefit from a concrete example or quantitative measure of the granularity gap between rubric-guided and direct LLM scoring.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing LP-Eval. We address each of the major comments below and specify the revisions planned for the next version of the paper.

read point-by-point responses

Referee: §4 (Dataset and Annotations): Annotations rely on only two legal experts for the full set of 100 items, with no inter-annotator agreement statistic (e.g., Cohen’s κ or percentage agreement) reported. This directly affects the reliability of the human gold labels that support all quality comparisons in §5 and the LLM-as-judge alignment results in §6.

Authors: We concur that reporting inter-annotator agreement would improve the assessment of annotation reliability. The annotations were carried out by two domain experts using the established rubric, yet the agreement statistics were omitted from the original manuscript. In the revised version, we will include Cohen’s κ and percentage agreement figures to better support the validity of the gold labels used in our analyses. revision: yes
Referee: §3 (Rubric Design): The three-step rubric is presented as comprehensively decomposing formal validity and substantive dimensions, yet the manuscript provides no external validation (e.g., comparison against a larger expert panel or alternative doctrinal criteria). This leaves open the possibility that systematic gaps in coverage undermine both the human labels and the subsequent alignment claims.

Authors: Our rubric was developed in close collaboration with legal experts to systematically address formal validity and substantive quality. Although we did not conduct a separate validation study with an expanded expert panel, the co-design process incorporated expert input to minimize coverage gaps. We will update the manuscript with additional details on the rubric's iterative development and will add a limitations section acknowledging the absence of broader external validation, proposing this as an avenue for future work. revision: partial

Circularity Check

0 steps flagged

Purely empirical study with new annotations and no derivations

full rationale

The paper introduces LP-Eval as a new three-step rubric co-designed with legal experts, releases a fresh dataset of annotations by two experts on 100 LLM-generated propositions from CJEU decisions, and reports direct empirical comparisons of quality scores, case recency effects, and LLM-as-judge alignment. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. All central claims rest on the newly collected human judgments rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study introducing an evaluation rubric and dataset; it contains no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5690 in / 1134 out tokens · 51206 ms · 2026-05-20T06:23:22.997427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

[1]

Abhishek Agarwal, Shanshan Xu, and Matthias Grabmair. 2022. Extractive Sum- marization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance. InFindings of EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1857–1872. doi:10.18653/v1/2022.fin...

work page doi:10.18653/v1/2022.findings-emnlp.134 2022
[2]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Alessia Battisti, Katja Tissi, and et al. 2024. Advancing Annotation for Continu- ous Data in Swiss German Sign Language. InProceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources. ELRA and ICCL, Torino, Italia, 1–12. https://aclanthology.org/2024.signlang-1.1/

work page 2024
[4]

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2633–2650

work page 2021
[5]

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4317–4323. doi:10.18653/v1/P19-1424

work page doi:10.18653/v1/p19-1424 2019
[6]

Ashish Chouhan and Michael Gertz. 2024. LexDrafter: Terminology Drafting for Legislative Documents Using Retrieval Augmented Generation. InProceedings of the LREC-COLING 2024. ELRA and ICCL. https://aclanthology.org/2024.lrec- main.913/

work page 2024
[7]

Pierre Colombo, Telmo Pessoa Pires, and et al. 2024. Saullm-7b: A pioneering large language model for law. arXiv. (2024)

work page 2024
[8]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis16, 1 (2024), 64–93

work page 2024
[9]

Phillip Dawson. 2017. Assessment rubrics: towards clearer and more replicable design, research and practice.Assessment & Evaluation in Higher Education42, 3 (2017), 347–360

work page 2017
[10]

Michael O Eshelman. 2018. A History of the Digests.Law Library Journal110, 2 (2018), 235–260

work page 2018
[11]

Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology43, 6 (1990), 543–549

work page 1990
[12]

Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, and et al. 2025. Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset. InProceedings of ACL 2025. Association for Computational Linguistics, Vienna, Austria, 23800–23839. doi:10.18653/v1/2025.acl-long.1160

work page doi:10.18653/v1/2025.acl-long.1160 2025
[13]

Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement.Brit. J. Math. Statist. Psych.61, 1 (2008), 29–48

work page 2008
[14]

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. 2024. LLM-Rubric: A Multidimensional, Calibrated Approach to Auto- mated Evaluation of Natural Language Texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10....

work page doi:10.18653/v1/2024.acl-long.745 2024
[15]

Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering. InProceedings of the Natural Legal Language Processing Workshop 2020. https: //arxiv.org/abs/2005.05257

work page arXiv 2020
[16]

Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2024. CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists.arXiv preprint arXiv:2403.18771 (2024). https://arxiv.org/abs/2403.18771

work page arXiv 2024
[17]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint arXiv:2412.05579(2024). https://arxiv.org/ abs/2412.05579

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Junyi Li, Jie Chen, and et al. 2024. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. InProceedings of ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10879–10899. doi:10.18653/v1/ 2024.acl-long.586

work page doi:10.18653/v1/ 2024
[19]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

work page 2004
[20]

Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2021. LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2342–2348

work page 2021
[21]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. 2025. Olmo 3.arXiv preprint arXiv:2512.13961(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of ACL 2002, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10. 3115/1073083.1073135

work page arXiv 2002
[23]

T. Y. S. S. Santosh, Isaac Misael Olguín Nolasco, and Matthias Grabmair. 2025. LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases. InFindings of NAACL 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-naacl.89

work page doi:10.18653/v1/2025.findings-naacl.89 2025
[24]

Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Andrew Bertagnolli, and Chaitanya Shivade. 2025. TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), Georg Rehm and Yunyao Li (Eds.). Associatio...

work page doi:10.18653/v1/2025.acl-industry.14 2025
[25]

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. 2024. Large Language Models are Inconsistent and Biased Evaluators.arXiv preprint arXiv:2405.01724 (2024). https://arxiv.org/abs/2405.01724

work page arXiv 2024
[26]

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov

work page
[27]

https://github.com/HumanSignal/ label-studio

Label Studio: Data labeling software. https://github.com/HumanSignal/ label-studio

work page
[28]

Santosh T.y.s.s., Rashid Haddad, and Matthias Grabmair. 2024. ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights. InProceedings of the LREC-COLING 2024. ELRA and ICCL, Torino, Italia, 5473–5483. https://aclanthology.org/2024.lrec-main.486/

work page 2024
[29]

Santosh T.y.s.s, Shanshan Xu, Oana Ichim, and Matthias Grabmair. 2022. Decon- founding Legal Judgment Prediction for European Court of Human Rights Cases Towards Better Alignment with Experts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Co...

work page doi:10.18653/v1/2022.emnlp-main.74 2022
[30]

Shanshan Xu, Barbara Plank, et al. 2025. From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP.arXiv preprint arXiv:2510.12817(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Shanshan Xu, TYS Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, and Matthias Grabmair. 2025. Better aligned with survey respondents or training data? unveiling political leanings of llms on us supreme court cases.arXiv preprint arXiv:2502.18282(2025)

work page arXiv 2025
[32]

Shanshan Xu, Leon Staufer, Santosh T.y.s.s, Oana Ichim, Corina Heri, and Matthias Grabmair. 2023. VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.)....

work page doi:10.18653/v1/2023.emnlp-main.718 2023
[33]

Shanshan Xu, Santosh Tyss, Oana Ichim, Barbara Plank, and Matthias Grabmair

work page
[34]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 199–216

work page
[35]

Shanshan Xu, Santosh T.y.s.s, Oana Ichim, Isabella Risini, Barbara Plank, and Matthias Grabmair. 2023. From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda ICAIL 2026, June 03–05, 2026, Singapore Trova...

work page doi:10.18653/v1/2023.emnlp-main.594 2023
[36]

Lianmin Zheng, Wei-Lin Chiang, and et al. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Sys- tems, Vol. 36. 46595–46623. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2023

[1] [1]

Abhishek Agarwal, Shanshan Xu, and Matthias Grabmair. 2022. Extractive Sum- marization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance. InFindings of EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1857–1872. doi:10.18653/v1/2022.fin...

work page doi:10.18653/v1/2022.findings-emnlp.134 2022

[2] [2]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Alessia Battisti, Katja Tissi, and et al. 2024. Advancing Annotation for Continu- ous Data in Swiss German Sign Language. InProceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources. ELRA and ICCL, Torino, Italia, 1–12. https://aclanthology.org/2024.signlang-1.1/

work page 2024

[4] [4]

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2633–2650

work page 2021

[5] [5]

Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4317–4323. doi:10.18653/v1/P19-1424

work page doi:10.18653/v1/p19-1424 2019

[6] [6]

Ashish Chouhan and Michael Gertz. 2024. LexDrafter: Terminology Drafting for Legislative Documents Using Retrieval Augmented Generation. InProceedings of the LREC-COLING 2024. ELRA and ICCL. https://aclanthology.org/2024.lrec- main.913/

work page 2024

[7] [7]

Pierre Colombo, Telmo Pessoa Pires, and et al. 2024. Saullm-7b: A pioneering large language model for law. arXiv. (2024)

work page 2024

[8] [8]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis16, 1 (2024), 64–93

work page 2024

[9] [9]

Phillip Dawson. 2017. Assessment rubrics: towards clearer and more replicable design, research and practice.Assessment & Evaluation in Higher Education42, 3 (2017), 347–360

work page 2017

[10] [10]

Michael O Eshelman. 2018. A History of the Digests.Law Library Journal110, 2 (2018), 235–260

work page 2018

[11] [11]

Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology43, 6 (1990), 543–549

work page 1990

[12] [12]

Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, and et al. 2025. Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset. InProceedings of ACL 2025. Association for Computational Linguistics, Vienna, Austria, 23800–23839. doi:10.18653/v1/2025.acl-long.1160

work page doi:10.18653/v1/2025.acl-long.1160 2025

[13] [13]

Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement.Brit. J. Math. Statist. Psych.61, 1 (2008), 29–48

work page 2008

[14] [14]

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. 2024. LLM-Rubric: A Multidimensional, Calibrated Approach to Auto- mated Evaluation of Natural Language Texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10....

work page doi:10.18653/v1/2024.acl-long.745 2024

[15] [15]

Nils Holzenberger, Andrew Blair-Stanek, and Benjamin Van Durme. 2020. A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering. InProceedings of the Natural Legal Language Processing Workshop 2020. https: //arxiv.org/abs/2005.05257

work page arXiv 2020

[16] [16]

Yukyung Lee, Joonghoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2024. CheckEval: A Reliable LLM-as-a-Judge Framework for Evaluating Text Generation Using Checklists.arXiv preprint arXiv:2403.18771 (2024). https://arxiv.org/abs/2403.18771

work page arXiv 2024

[17] [17]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint arXiv:2412.05579(2024). https://arxiv.org/ abs/2412.05579

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Junyi Li, Jie Chen, and et al. 2024. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. InProceedings of ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10879–10899. doi:10.18653/v1/ 2024.acl-long.586

work page doi:10.18653/v1/ 2024

[19] [19]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/

work page 2004

[20] [20]

Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2021. LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2342–2348

work page 2021

[21] [21]

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. 2025. Olmo 3.arXiv preprint arXiv:2512.13961(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of ACL 2002, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10. 3115/1073083.1073135

work page arXiv 2002

[23] [23]

T. Y. S. S. Santosh, Isaac Misael Olguín Nolasco, and Matthias Grabmair. 2025. LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases. InFindings of NAACL 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-naacl.89

work page doi:10.18653/v1/2025.findings-naacl.89 2025

[24] [24]

Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Andrew Bertagnolli, and Chaitanya Shivade. 2025. TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), Georg Rehm and Yunyao Li (Eds.). Associatio...

work page doi:10.18653/v1/2025.acl-industry.14 2025

[25] [25]

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. 2024. Large Language Models are Inconsistent and Biased Evaluators.arXiv preprint arXiv:2405.01724 (2024). https://arxiv.org/abs/2405.01724

work page arXiv 2024

[26] [26]

Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov

work page

[27] [27]

https://github.com/HumanSignal/ label-studio

Label Studio: Data labeling software. https://github.com/HumanSignal/ label-studio

work page

[28] [28]

Santosh T.y.s.s., Rashid Haddad, and Matthias Grabmair. 2024. ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights. InProceedings of the LREC-COLING 2024. ELRA and ICCL, Torino, Italia, 5473–5483. https://aclanthology.org/2024.lrec-main.486/

work page 2024

[29] [29]

Santosh T.y.s.s, Shanshan Xu, Oana Ichim, and Matthias Grabmair. 2022. Decon- founding Legal Judgment Prediction for European Court of Human Rights Cases Towards Better Alignment with Experts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Co...

work page doi:10.18653/v1/2022.emnlp-main.74 2022

[30] [30]

Shanshan Xu, Barbara Plank, et al. 2025. From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP.arXiv preprint arXiv:2510.12817(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Shanshan Xu, TYS Santosh, Yanai Elazar, Quirin Vogel, Barbara Plank, and Matthias Grabmair. 2025. Better aligned with survey respondents or training data? unveiling political leanings of llms on us supreme court cases.arXiv preprint arXiv:2502.18282(2025)

work page arXiv 2025

[32] [32]

Shanshan Xu, Leon Staufer, Santosh T.y.s.s, Oana Ichim, Corina Heri, and Matthias Grabmair. 2023. VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.)....

work page doi:10.18653/v1/2023.emnlp-main.718 2023

[33] [33]

Shanshan Xu, Santosh Tyss, Oana Ichim, Barbara Plank, and Matthias Grabmair

work page

[34] [34]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 199–216

work page

[35] [35]

Shanshan Xu, Santosh T.y.s.s, Oana Ichim, Isabella Risini, Barbara Plank, and Matthias Grabmair. 2023. From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda ICAIL 2026, June 03–05, 2026, Singapore Trova...

work page doi:10.18653/v1/2023.emnlp-main.594 2023

[36] [36]

Lianmin Zheng, Wei-Lin Chiang, and et al. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Sys- tems, Vol. 36. 46595–46623. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page 2023