pith. sign in

arxiv: 2605.21076 · v1 · pith:IALLUQIBnew · submitted 2026-05-20 · 💻 cs.CL

GradeLegal: Automated Grading for German Legal Cases

Pith reviewed 2026-05-21 05:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords automated gradinglarge language modelsGerman legal examspublic lawcriminal lawprompt engineeringquadratic weighted kappa
0
0 comments X

The pith

Reasoning-oriented LLMs match human experts at grading German public law exam solutions when given sample answers and rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can ease the bottleneck in grading high-stakes German legal exams by providing scalable, timely feedback to students. It systematically evaluates 27 models on criminal and public law cases, showing that adding a sample solution and grading rubric lifts agreement with experts to 0.91 in public law but only 0.60 in criminal law. This matters because exam grades shape careers in Germany, and current manual grading creates delays amid rising student numbers. The results indicate that prompt design and model choice determine whether such automation becomes reliable enough for real use.

Core claim

The authors demonstrate that reasoning-oriented LLMs, when supplied with both a sample solution and a grading rubric, can reach quadratic weighted kappa agreement of up to 0.91 with expert graders on public law cases while criminal law cases remain harder at around 0.60. Ensembling multiple models further raises agreement by as much as 0.15 over the strongest single model, offering a practical route to better performance without switching to larger closed models.

What carries the argument

Incremental prompting that supplies a sample solution and then a grading rubric to steer the LLM's evaluation of student answers against expert standards.

If this is right

  • Public law grading tasks appear more suitable for LLM support than criminal law ones.
  • Ensembling provides a direct way to improve reliability beyond any single model.
  • Careful prompt construction with sample solutions and rubrics becomes essential for practical deployment.
  • Automated systems could reduce feedback delays and support student self-testing in law programs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar rubric-plus-sample approaches might transfer to grading in other structured professional fields like medicine or accounting.
  • Criminal law cases may need hybrid human-AI workflows where the model flags ambiguous sections for expert review.
  • Students could run private versions of these prompts on their own drafts to identify weaknesses before official exams.

Load-bearing premise

Human expert grades used as ground truth are consistent across different graders and the selected exam cases capture the full complexity of real legal reasoning.

What would settle it

A new set of independent human experts re-grading the same student solutions and producing markedly lower agreement scores than the original experts.

Figures

Figures reproduced from arXiv: 2605.21076 by Abdullah Al Zubaer, Jelena Mitrovic, Lorenz Wendlinger, Michael Granitzer, Simon Alexander Nonn.

Figure 1
Figure 1. Figure 1: Performance improvement (Δ) over the Task Ag￾nostic baseline for each model, averaged across the criminal and public dataset, comparing Instr.+Rubric, Instr.+Solution, and Instr.+Rubric+Sol [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
read the original abstract

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a systematic evaluation of 27 proprietary and open-source LLMs for automated grading of German legal exam solutions in criminal and public law. It benchmarks prompting strategies that incrementally incorporate task information such as sample solutions and grading rubrics, reporting quadratic weighted kappa (QWK) scores up to 0.91 in public law and 0.60 in criminal law when both are provided. Ensembling is shown to improve agreement by up to 0.15 over the best single model.

Significance. If the results hold after addressing evaluation gaps, the work could have practical value in addressing grader shortages for high-stakes German state exams by enabling scalable feedback and self-testing. The comprehensive scope across 27 models, multiple prompting variants, and ensembling experiments constitutes a strength in experimental breadth.

major comments (2)
  1. [Abstract] Abstract: The reported QWK scores (up to 0.91 public law, 0.60 criminal law) are presented without any information on dataset size, number of expert graders, inter-rater agreement among humans, statistical tests, or error analysis. This omission leaves the central claim that LLMs 'can approximate expert grading' difficult to evaluate.
  2. [Results] Results section: The headline results treat single-expert grades as ground truth. Without reported pairwise or multi-rater QWK among human experts on the same cases, it is impossible to determine whether the LLM scores exceed, match, or fall below typical human consistency; legal grading is known to be subjective, so the public/criminal gap may reflect task ambiguity rather than model capability.
minor comments (1)
  1. [Abstract] Abstract: The phrasing 'reasoning-oriented LLMs can approximate expert grading' should be qualified to reflect the substantial performance difference between public and criminal law.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful for the referee's insightful comments, which highlight important aspects of our evaluation methodology. We address each major comment below and propose revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported QWK scores (up to 0.91 public law, 0.60 criminal law) are presented without any information on dataset size, number of expert graders, inter-rater agreement among humans, statistical tests, or error analysis. This omission leaves the central claim that LLMs 'can approximate expert grading' difficult to evaluate.

    Authors: We agree that the abstract would benefit from additional context to make the claims more evaluable. The full paper details the dataset consisting of legal exam solutions in criminal and public law, graded by expert legal professionals. We will revise the abstract to include information on the number of cases, the grading setup, and references to the statistical analyses and error analysis provided in the results section. Regarding inter-rater agreement, only single-expert grades were available for this study, which we will explicitly note as a limitation in the revised abstract and discussion. revision: yes

  2. Referee: [Results] Results section: The headline results treat single-expert grades as ground truth. Without reported pairwise or multi-rater QWK among human experts on the same cases, it is impossible to determine whether the LLM scores exceed, match, or fall below typical human consistency; legal grading is known to be subjective, so the public/criminal gap may reflect task ambiguity rather than model capability.

    Authors: This is a valid concern. Our evaluation relies on grades provided by a single expert per case, which serves as the reference standard in the absence of multi-rater data. We will update the results section to include a more detailed discussion of this limitation, clarifying that the QWK scores represent agreement with the available expert judgment rather than a direct comparison to human inter-rater reliability. We will also elaborate on how the differences in performance between public and criminal law cases may relate to the inherent subjectivity and complexity of the respective legal domains, as suggested. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking against external human grades

full rationale

The paper reports direct experimental results from evaluating 27 LLMs on legal exam grading tasks using quadratic weighted kappa (QWK) against human expert grades as ground truth. Claims such as QWK up to 0.91 in public law with sample solutions and rubrics are measured outcomes from prompting strategies, not quantities derived from self-referential definitions, fitted parameters called predictions, or self-citation chains. The central evaluation chain relies on external annotations and standard agreement metrics, with no equations or steps that reduce to the paper's own inputs by construction. This is a standard empirical study whose results stand independently of any internal fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that quadratic weighted kappa is an appropriate agreement metric for this high-stakes domain and that the chosen prompting additions (sample solution plus rubric) are sufficient to elicit reliable legal reasoning from the models.

axioms (1)
  • domain assumption Quadratic weighted kappa reliably measures agreement between LLM-generated grades and expert human grades for legal exam solutions.
    Used as the primary reported metric for all model comparisons.

pith-pipeline@v0.9.0 · 5763 in / 1146 out tokens · 54022 ms · 2026-05-21T05:03:32.863727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

  1. [1]

    Abdullah Al Zubaer, Stephan Geschwind, Deborah Voss, Lorenz Wendlinger, Johann Graf Lambsdorff, Michael Granitzer, and Jelena Mitrović. 2025. Compara- tive Study of Language Models and Prompt Paradigms in Short Answer Grading. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, Athens, Greece, 321–329. doi:10.110...

  2. [2]

    Abdullah Al Zubaer, Michael Granitzer, and Jelena Mitrović. 2023. Performance analysis of large language models in the domain of legal argument mining. Frontiers in artificial intelligence6 (2023), 1278796

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- Document Transformer. arXiv:2004.05150 [cs.CL] https://arxiv.org/abs/2004. 05150

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Kaplan, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  5. [5]

    Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, St. Julian’s, Malta, 2015–2027. doi:10. 18653/v1/2024.eacl-long.122

  6. [6]

    Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos

    Odysseas S. Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations. InFindings of the Association for Computational Lin- guistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 25099–25119. doi:10.18653/v1/2025.findings-emnlp.1368

  7. [7]

    Pierre Colombo, Telmo Pires, Malik Boudiaf, Rui Melo, Dominic Culver, et al

  8. [8]

    InAdvances in Neural Information Processing Systems, Vol

    SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 129672–129695. doi:10.52202/079017-4120

  9. [9]

    Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883 [cs.CL] https://arxiv.org/abs/2403.03883

  10. [10]

    Cope, Jens Frankenreiter, Scott Hirst, Eric A

    Kevin L. Cope, Jens Frankenreiter, Scott Hirst, Eric A. Posner, Daniel Schwarcz, and Dane Thorley. 2025. Grading Machines: Can AI Exam-Grading Replace Law Professors? doi:10.2139/ssrn.5851362

  11. [11]

    2025, Astronomy and Computing, 52, 100954, doi: 10.1016/j.ascom.2025.100954

    Scott A. Crossley, Perpetual Baffour, L. Burleigh, and Jules King. 2025. A large- scale corpus for assessing source-based writing quality: ASAP 2.0.Assessing Writing65 (2025), 100954. doi:10.1016/j.asw.2025.100954

  12. [12]

    DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

  13. [13]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Comput...

  14. [14]

    Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

    Ahmed M. Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

  15. [15]

    IEEE Access13 (2025), 121902–121917

    A Systematic Review of Pretrained Models in Automated Essay Scoring. IEEE Access13 (2025), 121902–121917. doi:10.1109/ACCESS.2025.3584784

  16. [16]

    Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, et al. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. arXiv:2505.12864 [cs.CL] https://arxiv.org/abs/2505.12864

  17. [17]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vincent Ng. 2024. LawBench: Benchmarking Legal Knowledge of Large Language Models. Conference’17, July 2017, Washington, DC, USA Abdullah Al Zubaer et al. InProceedings of the 2024 Conference on Empirical Methods in Na...

  18. [18]

    doi:10.18653/v1/2024.emnlp-main.452

  19. [19]

    2026.Grading Scales for In- coming Exchange Students

    Freie Universität Berlin, Department of Law. 2026.Grading Scales for In- coming Exchange Students. https://www.jura.fu-berlin.de/en/international/ studierendenaustausch/incomings/noten/

  20. [20]

    Shahriar Golchin, Nikhil Garuda, Christopher Impey, and Matthew Wenger

  21. [21]

    InProceedings of the 31st International Conference on Computational Linguistics

    Grading Massive Open Online Courses Using Large Language Models. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 3899–3912

  22. [22]

    2025.Gemini 2.5

    Google DeepMind. 2025.Gemini 2.5. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking Accessed: 2026-01-03

  23. [23]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al. 2023. LEGALBENCH: a collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Ho...

  24. [24]

    Ali Hakimi Parizi, Yuyang Liu, Prudhvi Nokku, Sina Gholamian, and David Emerson. 2023. A Comparative Study of Prompting Strategies for Legal Text Classification. InProceedings of the Natural Legal Language Processing Workshop

  25. [25]

    Association for Computational Linguistics, Singapore, 258–265. doi:10. 18653/v1/2023.nllp-1.25

  26. [26]

    Ben Hamner, Jaison Morgan, lynnvandev, Mark Shermis, and Tom Vander Ark

  27. [27]

    https://kaggle.com/ competitions/asap-aes

    The Hewlett Foundation: Automated Essay Scoring. https://kaggle.com/ competitions/asap-aes. Kaggle

  28. [28]

    Emaan Hariri and Daniel E Ho. 2026. AI for Statutory Simplification: A Com- prehensive State Legal Corpus and Labor Benchmark. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 177–187. doi:10.1145/3769126.3769256

  29. [29]

    Yue Huang and Joshua Wilson. 2025. Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity. InProceedings of the Artificial Intelli- gence in Measurement and Education Conference (AIME-Con): Works in Progress. National Council on Measurement in Education (NCME), Wyndham Grand Pitts- burgh, Downtown, Pittsburgh, Pennsylvania, United ...

  30. [30]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

  31. [31]

    2025.LeoLM/leo-hessianai-70b-chat

    LAION and LeoLM. 2025.LeoLM/leo-hessianai-70b-chat. Hugging Face. https: //huggingface.co/LeoLM/leo-hessianai-70b-chat Accessed: 2026-01-25

  32. [32]

    Shengjie Li and Vincent Ng. 2024. Automated Essay Scoring: A Reflection on the State of the Art. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 17876–17888. doi:10.18653/v1/2024.emnlp-main.991

  33. [33]

    Pei Yee Liew and Ian K. T. Tan. 2025. On Automated Essay Grading using Large Language Models. InProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI ’24). Association for Computing Machinery, New York, NY, USA, 204–211. doi:10.1145/3709026.3709030

  34. [34]

    2024.Completion API Documentation

    LiteLLM Contributors. 2024.Completion API Documentation. https://docs.litellm. ai/docs/completion Accessed January 9, 2026

  35. [35]

    Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

    Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

  36. [36]

    Ministral 3

    Ministral 3. arXiv:2601.08584 [cs.CL] https://arxiv.org/abs/2601.08584

  37. [37]

    Watheq Ahmad Mansour, Salam Albatarni, Sohaila Eltanbouly, and Tamer El- sayed. 2024. Can Large Language Models Automatically Score Proficiency of Writ- ten Essays?. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 2777–2786. https:...

  38. [38]

    2024.LLaMA 3.3 Model Card

    Meta AI. 2024.LLaMA 3.3 Model Card. Meta. https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md Accessed: 2026- 01-26

  39. [39]

    2025.Mistral 3

    Mistral AI. 2025.Mistral 3. Mistral AI. https://mistral.ai/news/mistral-3 Accessed: 2026-01-26

  40. [40]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, et al. 2025. gpt-oss-120b & gpt-oss- 20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

  41. [41]

    2024.GPT-4o mini: Advancing Cost-Efficient Intelligence

    OpenAI. 2024.GPT-4o mini: Advancing Cost-Efficient Intelligence. OpenAI. https:// openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Accessed: 2026-01-26

  42. [42]

    2024.Hello GPT-4o

    OpenAI. 2024.Hello GPT-4o. OpenAI. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-03

  43. [43]

    2024.OpenAI API Reference: Chat

    OpenAI. 2024.OpenAI API Reference: Chat. OpenAI. https://platform.openai. com/docs/api-reference/chat Accessed: 2026-01-26

  44. [44]

    OpenAI. 2025. GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card. pdf. Accessed: 2026-01-03

  45. [45]

    2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

    OpenAI. 2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. System card addendum. OpenAI. https://cdn.openai.com/pdf/4173ec8d-1229- 47db-96de-06d87147e07e/5_1_system_card.pdf Accessed: 2026-01-03

  46. [46]

    2025.Introducing GPT-4.1 in the API

    OpenAI. 2025.Introducing GPT-4.1 in the API. OpenAI. https://openai.com/ index/gpt-4-1/ Accessed: 2026-01-03

  47. [47]

    2025.Update to GPT-5 System Card: GPT-5.2

    OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. System card update. OpenAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/ oai_5_2_system-card.pdf Accessed: 2026-01-03

  48. [48]

    2026.OpenRouter

    OpenRouter, Inc. 2026.OpenRouter. https://openrouter.ai/ Accessed Jan 9, 2026

  49. [49]

    Christopher Ormerod and Gitit Kehat. 2025. Long context Automated Essay Scoring with Language Models. InProceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers. National Council on Measurement in Education (NCME), Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States, 35–42

  50. [50]

    Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227

  51. [51]

    2025.Qwen Blog: Research and Model Advancements

    Qwen Team. 2025.Qwen Blog: Research and Model Advancements. Qwen. https: //qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd Accessed: 2026- 01-25

  52. [52]

    Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An Automated Essay Scoring Systems: A Systematic Literature Review.Artificial Intelligence Review55, 3 (2022), 2495–2527. doi:10.1007/s10462-021-10068-2

  53. [53]

    Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 10776–10787

  54. [54]

    Sergio Servantez, Joe Barrow, Kristian Hammond, and Rajiv Jain. 2024. Chain of Logic: Rule-Based Reasoning with Large Language Models. InFindings of the As- sociation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2721–2733. doi:10.18653/v1/2024.findings-acl.159

  55. [55]

    Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, and Enkelejda Kasneci. 2025. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. InProceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25). Association for Computing Machinery, New York, NY, ...

  56. [56]

    Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS ’25). Association for Computing Machinery, New York, NY, USA, 425–441

  57. [57]

    Michael B. Strecker, Susanne Hähnchen, Martin Heidebach, Marie Herberger, Simon Alexander Nonn, Sarah Großkopf, Gökhan Erol, Menaf Erol, Ilja Garber, Constantin Höhmann, Enci Huang, Clemens Hufeld, and Louisa Zachmann

  58. [58]

    KI-Unterstützung und Rohpunkteschemata: Die Zukunft der juristischen Klausurkorrektur? https://ordnungderwissenschaft.de/wp-content/uploads/ 2025/12/Herberger.pdf

  59. [59]

    2025.Apertus-70B-Instruct-2509

    Swiss AI Initiative. 2025.Apertus-70B-Instruct-2509. Hugging Face. https: //huggingface.co/swiss-ai/Apertus-70B-Instruct-2509 Accessed: 2026-01-25

  60. [60]

    Gemma Team. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

  61. [61]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  62. [62]

    Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

  63. [63]

    Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika48, 2 (July 2021), 459–484. doi:10.1007/s41237-021-00142-y

  64. [64]

    2025.EuroLLM-22B-Instruct-2512

    UTTER Project. 2025.EuroLLM-22B-Instruct-2512. Hugging Face. https:// huggingface.co/utter-project/EuroLLM-22B-Instruct-2512 Accessed: 2026-01-25

  65. [65]

    Yongjie Wang, Chuang Wang, Ruobing Li, and Hui Lin. 2022. On the Use of Bert for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Repre- sentation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle,...

  66. [66]

    Oliver Wardas and Florian Matthes. 2026. LLMs for Legal Subsumption in German Employment Contracts. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 169–176. doi:10.1145/3769126.3769209

  67. [67]

    Lorenz Wendlinger, Christian Braun, Abdullah Al Zubaer, Simon Alexander Nonn, Sarah Großkopf, Christofer Fellicious, and Michael Granitzer. 2024. On the Suitability of pre-trained foundational LLMs for Analysis in German Legal Education. arXiv:2412.15902 [cs.CL] https://arxiv.org/abs/2412.15902 GradeLegal: Automated Grading for German Legal Cases Conferen...

  68. [68]

    Yiquan Wu, Siying Zhou, Yifei Liu, Weiming Lu, Xiaozhong Liu, Yating Zhang, Changlong Sun, Fei Wu, and Kun Kuang. 2023. Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 12060–1...

  69. [69]

    Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, and Weiguang Qu. 2022. Auto- mated Essay Scoring via Pairwise Contrastive Regression. InProceedings of the 29th International Conference on Computational Linguistics. International Com- mittee on Computational Linguistics, Gyeongju, Republic of Korea, 2724–2733. https://aclanthology.org/2022.coling-1.240/

  70. [70]

    Fangyi Yu, Lee Quartey, and Frank Schilder. 2023. Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 13582–13596. doi:10.18653/v1/2023.findings-acl.858

  71. [71]

    Weikang Yuan, Junjie Cao, Zhuoren Jiang, Yangyang Kang, Jun Lin, et al. 2024. Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguis- tics, Miami, Florida, USA, 7577–7597. doi:10.1865...

  72. [72]

    No I have never in my life been drunk

    Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...