LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
Pith reviewed 2026-05-20 06:23 UTC · model grok-4.3
The pith
A three-step rubric co-designed with legal experts shows LLMs generate mostly high-quality propositions from CJEU decisions, with better results from established cases and closer expert alignment when the rubric guides LLM judgments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models can generate predominantly well-formed and high-quality legal propositions from CJEU decisions. Expert evaluation using the LP-Eval rubric finds higher quality for propositions derived from well-established cases than from recent ones. Rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, though they remain insensitive to finer-grained distinctions captured by human experts.
What carries the argument
LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions.
If this is right
- LLMs can serve as a practical starting point for drafting legal propositions in doctrinal work and legal reasoning tasks.
- Structured rubrics improve the reliability of both human and LLM-based evaluation compared with simple overall scores.
- Quality of generated propositions varies with the maturity of the underlying legal cases, suggesting selective use of LLM output.
- Rubric-guided LLM evaluators can reduce the annotation burden in legal NLP datasets while preserving reasonable alignment with experts.
Where Pith is reading between the lines
- The same rubric approach could be adapted to decisions from other courts or national legal systems to test generalizability.
- Hybrid pipelines that combine LLM generation with targeted human review on recent or novel cases may be more efficient than full manual drafting.
- Training or fine-tuning LLMs explicitly on the rubric steps might reduce the observed gap in sensitivity to fine distinctions.
Load-bearing premise
The three-step rubric accurately and comprehensively captures the key dimensions of legal proposition quality for the selected CJEU decisions without overlooking important aspects or introducing expert bias.
What would settle it
A follow-up study in which legal experts rate the same propositions using only holistic scoring or an alternative rubric and show substantially lower agreement with the original LP-Eval annotations.
Figures
read the original abstract
Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LP-Eval, a three-step rubric co-designed with legal experts to evaluate LLM-generated legal propositions from CJEU decisions. It releases a dataset of annotations by two experts on 100 propositions and reports empirical results showing that LLMs produce predominantly well-formed, high-quality propositions (with higher quality for established cases than recent ones) and that rubric-guided LLM-as-judge evaluations align more closely with expert assessments than direct overall scoring.
Significance. If the central claims hold, the work supplies a new rubric and annotated dataset for an under-examined task in Legal NLP, offers concrete evidence on LLM proposition quality differentiated by case recency, and demonstrates a practical improvement in LLM evaluation via structured rubrics. These elements could serve as a reproducible starting point for future legal-reasoning benchmarks.
major comments (2)
- [§4] §4 (Dataset and Annotations): Annotations rely on only two legal experts for the full set of 100 items, with no inter-annotator agreement statistic (e.g., Cohen’s κ or percentage agreement) reported. This directly affects the reliability of the human gold labels that support all quality comparisons in §5 and the LLM-as-judge alignment results in §6.
- [§3] §3 (Rubric Design): The three-step rubric is presented as comprehensively decomposing formal validity and substantive dimensions, yet the manuscript provides no external validation (e.g., comparison against a larger expert panel or alternative doctrinal criteria). This leaves open the possibility that systematic gaps in coverage undermine both the human labels and the subsequent alignment claims.
minor comments (2)
- [§5.1] Table 1 or §5.1: Clarify the exact sampling procedure used to select the 100 propositions and the distribution across established vs. recent cases to improve reproducibility.
- [§6] Abstract and §6: The phrase “insensitive to finer-grained distinctions” would benefit from a concrete example or quantitative measure of the granularity gap between rubric-guided and direct LLM scoring.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing LP-Eval. We address each of the major comments below and specify the revisions planned for the next version of the paper.
read point-by-point responses
-
Referee: §4 (Dataset and Annotations): Annotations rely on only two legal experts for the full set of 100 items, with no inter-annotator agreement statistic (e.g., Cohen’s κ or percentage agreement) reported. This directly affects the reliability of the human gold labels that support all quality comparisons in §5 and the LLM-as-judge alignment results in §6.
Authors: We concur that reporting inter-annotator agreement would improve the assessment of annotation reliability. The annotations were carried out by two domain experts using the established rubric, yet the agreement statistics were omitted from the original manuscript. In the revised version, we will include Cohen’s κ and percentage agreement figures to better support the validity of the gold labels used in our analyses. revision: yes
-
Referee: §3 (Rubric Design): The three-step rubric is presented as comprehensively decomposing formal validity and substantive dimensions, yet the manuscript provides no external validation (e.g., comparison against a larger expert panel or alternative doctrinal criteria). This leaves open the possibility that systematic gaps in coverage undermine both the human labels and the subsequent alignment claims.
Authors: Our rubric was developed in close collaboration with legal experts to systematically address formal validity and substantive quality. Although we did not conduct a separate validation study with an expanded expert panel, the co-design process incorporated expert input to minimize coverage gaps. We will update the manuscript with additional details on the rubric's iterative development and will add a limitations section acknowledging the absence of broader external validation, proposing this as an avenue for future work. revision: partial
Circularity Check
Purely empirical study with new annotations and no derivations
full rationale
The paper introduces LP-Eval as a new three-step rubric co-designed with legal experts, releases a fresh dataset of annotations by two experts on 100 LLM-generated propositions from CJEU decisions, and reports direct empirical comparisons of quality scores, case recency effects, and LLM-as-judge alignment. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. All central claims rest on the newly collected human judgments rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abhishek Agarwal, Shanshan Xu, and Matthias Grabmair. 2022. Extractive Sum- marization of Legal Decisions using Multi-task Learning and Maximal Marginal Relevance. InFindings of EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1857–1872. doi:10.18653/v1/2022.fin...
-
[2]
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. 2025. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Alessia Battisti, Katja Tissi, and et al. 2024. Advancing Annotation for Continu- ous Data in Swiss German Sign Language. InProceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources. ELRA and ICCL, Torino, Italia, 1–12. https://aclanthology.org/2024.signlang-1.1/
work page 2024
-
[4]
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting Training Data from Large Lan- guage Models. In30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2633–2650
work page 2021
-
[5]
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4317–4323. doi:10.18653/v1/P19-1424
-
[6]
Ashish Chouhan and Michael Gertz. 2024. LexDrafter: Terminology Drafting for Legislative Documents Using Retrieval Augmented Generation. InProceedings of the LREC-COLING 2024. ELRA and ICCL. https://aclanthology.org/2024.lrec- main.913/
work page 2024
-
[7]
Pierre Colombo, Telmo Pessoa Pires, and et al. 2024. Saullm-7b: A pioneering large language model for law. arXiv. (2024)
work page 2024
-
[8]
Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis16, 1 (2024), 64–93
work page 2024
-
[9]
Phillip Dawson. 2017. Assessment rubrics: towards clearer and more replicable design, research and practice.Assessment & Evaluation in Higher Education42, 3 (2017), 347–360
work page 2017
-
[10]
Michael O Eshelman. 2018. A History of the Digests.Law Library Journal110, 2 (2018), 235–260
work page 2018
-
[11]
Alvan R Feinstein and Domenic V Cicchetti. 1990. High agreement but low kappa: I. The problems of two paradoxes.Journal of clinical epidemiology43, 6 (1990), 543–549
work page 1990
-
[12]
Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, and et al. 2025. Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset. InProceedings of ACL 2025. Association for Computational Linguistics, Vienna, Austria, 23800–23839. doi:10.18653/v1/2025.acl-long.1160
-
[13]
Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement.Brit. J. Math. Statist. Psych.61, 1 (2008), 29–48
work page 2008
-
[14]
Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. 2024. LLM-Rubric: A Multidimensional, Calibrated Approach to Auto- mated Evaluation of Natural Language Texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. doi:10....
- [15]
- [16]
-
[17]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint arXiv:2412.05579(2024). https://arxiv.org/ abs/2412.05579
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Junyi Li, Jie Chen, and et al. 2024. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. InProceedings of ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 10879–10899. doi:10.18653/v1/ 2024.acl-long.586
-
[19]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013/
work page 2004
-
[20]
Yixiao Ma, Yunqiu Shao, Yueyue Wu, Yiqun Liu, Ruizhe Zhang, Min Zhang, and Shaoping Ma. 2021. LeCaRD: A Legal Case Retrieval Dataset for Chinese Law System. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2342–2348
work page 2021
-
[21]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. 2025. Olmo 3.arXiv preprint arXiv:2512.13961(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of ACL 2002, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. doi:10. 3115/1073083.1073135
-
[23]
T. Y. S. S. Santosh, Isaac Misael Olguín Nolasco, and Matthias Grabmair. 2025. LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases. InFindings of NAACL 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-naacl.89
-
[24]
Raj Sanjay Shah, Lei Xu, Qianchu Liu, Jon Burnsky, Andrew Bertagnolli, and Chaitanya Shivade. 2025. TN-Eval: Rubric and Evaluation Protocols for Measuring the Quality of Behavioral Therapy Notes. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), Georg Rehm and Yunyao Li (Eds.). Associatio...
- [25]
-
[26]
Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov
-
[27]
https://github.com/HumanSignal/ label-studio
Label Studio: Data labeling software. https://github.com/HumanSignal/ label-studio
-
[28]
Santosh T.y.s.s., Rashid Haddad, and Matthias Grabmair. 2024. ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights. InProceedings of the LREC-COLING 2024. ELRA and ICCL, Torino, Italia, 5473–5483. https://aclanthology.org/2024.lrec-main.486/
work page 2024
-
[29]
Santosh T.y.s.s, Shanshan Xu, Oana Ichim, and Matthias Grabmair. 2022. Decon- founding Legal Judgment Prediction for European Court of Human Rights Cases Towards Better Alignment with Experts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Co...
-
[30]
Shanshan Xu, Barbara Plank, et al. 2025. From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP.arXiv preprint arXiv:2510.12817(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [31]
-
[32]
Shanshan Xu, Leon Staufer, Santosh T.y.s.s, Oana Ichim, Corina Heri, and Matthias Grabmair. 2023. VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.)....
-
[33]
Shanshan Xu, Santosh Tyss, Oana Ichim, Barbara Plank, and Matthias Grabmair
-
[34]
Through the Lens of Split Vote: Exploring Disagreement, Difficulty and Calibration in Legal Case Outcome Classification. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 199–216
-
[35]
Shanshan Xu, Santosh T.y.s.s, Oana Ichim, Isabella Risini, Barbara Plank, and Matthias Grabmair. 2023. From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda ICAIL 2026, June 03–05, 2026, Singapore Trova...
-
[36]
Lianmin Zheng, Wei-Lin Chiang, and et al. 2023. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Sys- tems, Vol. 36. 46595–46623. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.