GradeLegal: Automated Grading for German Legal Cases

Abdullah Al Zubaer; Jelena Mitrovic; Lorenz Wendlinger; Michael Granitzer; Simon Alexander Nonn

arxiv: 2605.21076 · v1 · pith:IALLUQIBnew · submitted 2026-05-20 · 💻 cs.CL

GradeLegal: Automated Grading for German Legal Cases

Abdullah Al Zubaer , Lorenz Wendlinger , Simon Alexander Nonn , Michael Granitzer , Jelena Mitrovic This is my paper

Pith reviewed 2026-05-21 05:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords automated gradinglarge language modelsGerman legal examspublic lawcriminal lawprompt engineeringquadratic weighted kappa

0 comments

The pith

Reasoning-oriented LLMs match human experts at grading German public law exam solutions when given sample answers and rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can ease the bottleneck in grading high-stakes German legal exams by providing scalable, timely feedback to students. It systematically evaluates 27 models on criminal and public law cases, showing that adding a sample solution and grading rubric lifts agreement with experts to 0.91 in public law but only 0.60 in criminal law. This matters because exam grades shape careers in Germany, and current manual grading creates delays amid rising student numbers. The results indicate that prompt design and model choice determine whether such automation becomes reliable enough for real use.

Core claim

The authors demonstrate that reasoning-oriented LLMs, when supplied with both a sample solution and a grading rubric, can reach quadratic weighted kappa agreement of up to 0.91 with expert graders on public law cases while criminal law cases remain harder at around 0.60. Ensembling multiple models further raises agreement by as much as 0.15 over the strongest single model, offering a practical route to better performance without switching to larger closed models.

What carries the argument

Incremental prompting that supplies a sample solution and then a grading rubric to steer the LLM's evaluation of student answers against expert standards.

If this is right

Public law grading tasks appear more suitable for LLM support than criminal law ones.
Ensembling provides a direct way to improve reliability beyond any single model.
Careful prompt construction with sample solutions and rubrics becomes essential for practical deployment.
Automated systems could reduce feedback delays and support student self-testing in law programs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar rubric-plus-sample approaches might transfer to grading in other structured professional fields like medicine or accounting.
Criminal law cases may need hybrid human-AI workflows where the model flags ambiguous sections for expert review.
Students could run private versions of these prompts on their own drafts to identify weaknesses before official exams.

Load-bearing premise

Human expert grades used as ground truth are consistent across different graders and the selected exam cases capture the full complexity of real legal reasoning.

What would settle it

A new set of independent human experts re-grading the same student solutions and producing markedly lower agreement scores than the original experts.

Figures

Figures reproduced from arXiv: 2605.21076 by Abdullah Al Zubaer, Jelena Mitrovic, Lorenz Wendlinger, Michael Granitzer, Simon Alexander Nonn.

**Figure 1.** Figure 1: Performance improvement (Δ) over the Task Agnostic baseline for each model, averaged across the criminal and public dataset, comparing Instr.+Rubric, Instr.+Solution, and Instr.+Rubric+Sol [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

read the original abstract

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first broad LLM benchmark for grading German legal exams and shows prompting plus ensembling can help, but missing human inter-rater numbers make the headline QWK scores hard to read as true expert approximation.

read the letter

Hi, the main thing to know is that this paper runs the first systematic test of 27 LLMs on grading German legal case solutions in criminal and public law. With a sample solution and rubric added to the prompt, reasoning models reach 0.91 QWK in public law and 0.60 in criminal law, and ensembling adds up to 0.15 over the best single model. That difference between the two areas is the clearest signal they report. The work is new because prior literature had not done a head-to-head on this exact high-stakes task, and the incremental prompting setup is a clean way to show what extra context actually moves the needle. They also cover both open and closed models, which is useful for anyone thinking about deployment. The practical angle is real: German state exams create a grading bottleneck that affects careers, so even modest automation could matter for law schools. The soft spot is the missing human baseline. The paper treats single-expert grades as ground truth without reporting how much two or more graders agree on the same answers. Legal grading is known to be subjective, so if typical human QWK sits around 0.65-0.75, the 0.91 result looks less like model strength and more like matching one grader's idiosyncrasies. The criminal-public gap could simply reflect differing ambiguity levels rather than model limits. Dataset size, number of cases, and grader count are also not stated in the abstract, which makes it harder to judge how stable the numbers are. This is for researchers working on automated assessment in professional education or domain-specific LLM evaluation. A reader who needs concrete numbers on prompting for subjective legal tasks will find something usable here. It deserves peer review because the task is concrete, the comparison is broad, and the ensembling result is a practical takeaway, even if the authors will need to add inter-rater stats and error analysis to make the claims tighter. I would send it out with those requests rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a systematic evaluation of 27 proprietary and open-source LLMs for automated grading of German legal exam solutions in criminal and public law. It benchmarks prompting strategies that incrementally incorporate task information such as sample solutions and grading rubrics, reporting quadratic weighted kappa (QWK) scores up to 0.91 in public law and 0.60 in criminal law when both are provided. Ensembling is shown to improve agreement by up to 0.15 over the best single model.

Significance. If the results hold after addressing evaluation gaps, the work could have practical value in addressing grader shortages for high-stakes German state exams by enabling scalable feedback and self-testing. The comprehensive scope across 27 models, multiple prompting variants, and ensembling experiments constitutes a strength in experimental breadth.

major comments (2)

[Abstract] Abstract: The reported QWK scores (up to 0.91 public law, 0.60 criminal law) are presented without any information on dataset size, number of expert graders, inter-rater agreement among humans, statistical tests, or error analysis. This omission leaves the central claim that LLMs 'can approximate expert grading' difficult to evaluate.
[Results] Results section: The headline results treat single-expert grades as ground truth. Without reported pairwise or multi-rater QWK among human experts on the same cases, it is impossible to determine whether the LLM scores exceed, match, or fall below typical human consistency; legal grading is known to be subjective, so the public/criminal gap may reflect task ambiguity rather than model capability.

minor comments (1)

[Abstract] Abstract: The phrasing 'reasoning-oriented LLMs can approximate expert grading' should be qualified to reflect the substantial performance difference between public and criminal law.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful for the referee's insightful comments, which highlight important aspects of our evaluation methodology. We address each major comment below and propose revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported QWK scores (up to 0.91 public law, 0.60 criminal law) are presented without any information on dataset size, number of expert graders, inter-rater agreement among humans, statistical tests, or error analysis. This omission leaves the central claim that LLMs 'can approximate expert grading' difficult to evaluate.

Authors: We agree that the abstract would benefit from additional context to make the claims more evaluable. The full paper details the dataset consisting of legal exam solutions in criminal and public law, graded by expert legal professionals. We will revise the abstract to include information on the number of cases, the grading setup, and references to the statistical analyses and error analysis provided in the results section. Regarding inter-rater agreement, only single-expert grades were available for this study, which we will explicitly note as a limitation in the revised abstract and discussion. revision: yes
Referee: [Results] Results section: The headline results treat single-expert grades as ground truth. Without reported pairwise or multi-rater QWK among human experts on the same cases, it is impossible to determine whether the LLM scores exceed, match, or fall below typical human consistency; legal grading is known to be subjective, so the public/criminal gap may reflect task ambiguity rather than model capability.

Authors: This is a valid concern. Our evaluation relies on grades provided by a single expert per case, which serves as the reference standard in the absence of multi-rater data. We will update the results section to include a more detailed discussion of this limitation, clarifying that the QWK scores represent agreement with the available expert judgment rather than a direct comparison to human inter-rater reliability. We will also elaborate on how the differences in performance between public and criminal law cases may relate to the inherent subjectivity and complexity of the respective legal domains, as suggested. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking against external human grades

full rationale

The paper reports direct experimental results from evaluating 27 LLMs on legal exam grading tasks using quadratic weighted kappa (QWK) against human expert grades as ground truth. Claims such as QWK up to 0.91 in public law with sample solutions and rubrics are measured outcomes from prompting strategies, not quantities derived from self-referential definitions, fitted parameters called predictions, or self-citation chains. The central evaluation chain relies on external annotations and standard agreement metrics, with no equations or steps that reduce to the paper's own inputs by construction. This is a standard empirical study whose results stand independently of any internal fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the assumption that quadratic weighted kappa is an appropriate agreement metric for this high-stakes domain and that the chosen prompting additions (sample solution plus rubric) are sufficient to elicit reliable legal reasoning from the models.

axioms (1)

domain assumption Quadratic weighted kappa reliably measures agreement between LLM-generated grades and expert human grades for legal exam solutions.
Used as the primary reported metric for all model comparisons.

pith-pipeline@v0.9.0 · 5763 in / 1146 out tokens · 54022 ms · 2026-05-21T05:03:32.863727+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK)...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

[1]

Abdullah Al Zubaer, Stephan Geschwind, Deborah Voss, Lorenz Wendlinger, Johann Graf Lambsdorff, Michael Granitzer, and Jelena Mitrović. 2025. Compara- tive Study of Language Models and Prompt Paradigms in Short Answer Grading. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, Athens, Greece, 321–329. doi:10.110...

work page doi:10.1109/ictai66417.2025.00049 2025
[2]

Abdullah Al Zubaer, Michael Granitzer, and Jelena Mitrović. 2023. Performance analysis of large language models in the domain of legal argument mining. Frontiers in artificial intelligence6 (2023), 1278796

work page 2023
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- Document Transformer. arXiv:2004.05150 [cs.CL] https://arxiv.org/abs/2004. 05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Kaplan, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020
[5]

Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, St. Julian’s, Malta, 2015–2027. doi:10. 18653/v1/2024.eacl-long.122

work page 2024
[6]

Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos

Odysseas S. Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations. InFindings of the Association for Computational Lin- guistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 25099–25119. doi:10.18653/v1/2025.findings-emnlp.1368

work page doi:10.18653/v1/2025.findings-emnlp.1368 2025
[7]

Pierre Colombo, Telmo Pires, Malik Boudiaf, Rui Melo, Dominic Culver, et al

work page
[8]

InAdvances in Neural Information Processing Systems, Vol

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 129672–129695. doi:10.52202/079017-4120

work page doi:10.52202/079017-4120
[9]

Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883 [cs.CL] https://arxiv.org/abs/2403.03883

work page arXiv 2024
[10]

Cope, Jens Frankenreiter, Scott Hirst, Eric A

Kevin L. Cope, Jens Frankenreiter, Scott Hirst, Eric A. Posner, Daniel Schwarcz, and Dane Thorley. 2025. Grading Machines: Can AI Exam-Grading Replace Law Professors? doi:10.2139/ssrn.5851362

work page doi:10.2139/ssrn.5851362 2025
[11]

2025, Astronomy and Computing, 52, 100954, doi: 10.1016/j.ascom.2025.100954

Scott A. Crossley, Perpetual Baffour, L. Burleigh, and Jules King. 2025. A large- scale corpus for assessing source-based writing quality: ASAP 2.0.Assessing Writing65 (2025), 100954. doi:10.1016/j.asw.2025.100954

work page doi:10.1016/j.asw.2025.100954 2025
[12]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Comput...

work page doi:10.18653/v1/n19-1423 2019
[14]

Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

Ahmed M. Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

work page
[15]

IEEE Access13 (2025), 121902–121917

A Systematic Review of Pretrained Models in Automated Essay Scoring. IEEE Access13 (2025), 121902–121917. doi:10.1109/ACCESS.2025.3584784

work page doi:10.1109/access.2025.3584784 2025
[16]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, et al. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. arXiv:2505.12864 [cs.CL] https://arxiv.org/abs/2505.12864

work page arXiv 2025
[17]

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vincent Ng. 2024. LawBench: Benchmarking Legal Knowledge of Large Language Models. Conference’17, July 2017, Washington, DC, USA Abdullah Al Zubaer et al. InProceedings of the 2024 Conference on Empirical Methods in Na...

work page 2024
[18]

doi:10.18653/v1/2024.emnlp-main.452

work page doi:10.18653/v1/2024.emnlp-main.452 2024
[19]

2026.Grading Scales for In- coming Exchange Students

Freie Universität Berlin, Department of Law. 2026.Grading Scales for In- coming Exchange Students. https://www.jura.fu-berlin.de/en/international/ studierendenaustausch/incomings/noten/

work page 2026
[20]

Shahriar Golchin, Nikhil Garuda, Christopher Impey, and Matthew Wenger

work page
[21]

InProceedings of the 31st International Conference on Computational Linguistics

Grading Massive Open Online Courses Using Large Language Models. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 3899–3912

work page
[22]

2025.Gemini 2.5

Google DeepMind. 2025.Gemini 2.5. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking Accessed: 2026-01-03

work page 2025
[23]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al. 2023. LEGALBENCH: a collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Ho...

work page 2023
[24]

Ali Hakimi Parizi, Yuyang Liu, Prudhvi Nokku, Sina Gholamian, and David Emerson. 2023. A Comparative Study of Prompting Strategies for Legal Text Classification. InProceedings of the Natural Legal Language Processing Workshop

work page 2023
[25]

Association for Computational Linguistics, Singapore, 258–265. doi:10. 18653/v1/2023.nllp-1.25

work page 2023
[26]

Ben Hamner, Jaison Morgan, lynnvandev, Mark Shermis, and Tom Vander Ark

work page
[27]

https://kaggle.com/ competitions/asap-aes

The Hewlett Foundation: Automated Essay Scoring. https://kaggle.com/ competitions/asap-aes. Kaggle

work page
[28]

Emaan Hariri and Daniel E Ho. 2026. AI for Statutory Simplification: A Com- prehensive State Legal Corpus and Labor Benchmark. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 177–187. doi:10.1145/3769126.3769256

work page doi:10.1145/3769126.3769256 2026
[29]

Yue Huang and Joshua Wilson. 2025. Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity. InProceedings of the Artificial Intelli- gence in Measurement and Education Conference (AIME-Con): Works in Progress. National Council on Measurement in Education (NCME), Wyndham Grand Pitts- burgh, Downtown, Pittsburgh, Pennsylvania, United ...

work page 2025
[30]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

work page doi:10.1145/3600006.3613165 2023
[31]

2025.LeoLM/leo-hessianai-70b-chat

LAION and LeoLM. 2025.LeoLM/leo-hessianai-70b-chat. Hugging Face. https: //huggingface.co/LeoLM/leo-hessianai-70b-chat Accessed: 2026-01-25

work page 2025
[32]

Shengjie Li and Vincent Ng. 2024. Automated Essay Scoring: A Reflection on the State of the Art. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 17876–17888. doi:10.18653/v1/2024.emnlp-main.991

work page doi:10.18653/v1/2024.emnlp-main.991 2024
[33]

Pei Yee Liew and Ian K. T. Tan. 2025. On Automated Essay Grading using Large Language Models. InProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI ’24). Association for Computing Machinery, New York, NY, USA, 204–211. doi:10.1145/3709026.3709030

work page doi:10.1145/3709026.3709030 2025
[34]

2024.Completion API Documentation

LiteLLM Contributors. 2024.Completion API Documentation. https://docs.litellm. ai/docs/completion Accessed January 9, 2026

work page 2024
[35]

Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

work page
[36]

Ministral 3

Ministral 3. arXiv:2601.08584 [cs.CL] https://arxiv.org/abs/2601.08584

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Watheq Ahmad Mansour, Salam Albatarni, Sohaila Eltanbouly, and Tamer El- sayed. 2024. Can Large Language Models Automatically Score Proficiency of Writ- ten Essays?. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 2777–2786. https:...

work page 2024
[38]

2024.LLaMA 3.3 Model Card

Meta AI. 2024.LLaMA 3.3 Model Card. Meta. https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md Accessed: 2026- 01-26

work page 2024
[39]

2025.Mistral 3

Mistral AI. 2025.Mistral 3. Mistral AI. https://mistral.ai/news/mistral-3 Accessed: 2026-01-26

work page 2025
[40]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, et al. 2025. gpt-oss-120b & gpt-oss- 20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

2024.GPT-4o mini: Advancing Cost-Efficient Intelligence

OpenAI. 2024.GPT-4o mini: Advancing Cost-Efficient Intelligence. OpenAI. https:// openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Accessed: 2026-01-26

work page 2024
[42]

2024.Hello GPT-4o

OpenAI. 2024.Hello GPT-4o. OpenAI. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-03

work page 2024
[43]

2024.OpenAI API Reference: Chat

OpenAI. 2024.OpenAI API Reference: Chat. OpenAI. https://platform.openai. com/docs/api-reference/chat Accessed: 2026-01-26

work page 2024
[44]

OpenAI. 2025. GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card. pdf. Accessed: 2026-01-03

work page 2025
[45]

2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

OpenAI. 2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. System card addendum. OpenAI. https://cdn.openai.com/pdf/4173ec8d-1229- 47db-96de-06d87147e07e/5_1_system_card.pdf Accessed: 2026-01-03

work page 2025
[46]

2025.Introducing GPT-4.1 in the API

OpenAI. 2025.Introducing GPT-4.1 in the API. OpenAI. https://openai.com/ index/gpt-4-1/ Accessed: 2026-01-03

work page 2025
[47]

2025.Update to GPT-5 System Card: GPT-5.2

OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. System card update. OpenAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/ oai_5_2_system-card.pdf Accessed: 2026-01-03

work page 2025
[48]

2026.OpenRouter

OpenRouter, Inc. 2026.OpenRouter. https://openrouter.ai/ Accessed Jan 9, 2026

work page 2026
[49]

Christopher Ormerod and Gitit Kehat. 2025. Long context Automated Essay Scoring with Language Models. InProceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers. National Council on Measurement in Education (NCME), Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States, 35–42

work page 2025
[50]

Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227

work page doi:10.1145/3769126.3769227 2026
[51]

2025.Qwen Blog: Research and Model Advancements

Qwen Team. 2025.Qwen Blog: Research and Model Advancements. Qwen. https: //qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd Accessed: 2026- 01-25

work page 2025
[52]

Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An Automated Essay Scoring Systems: A Systematic Literature Review.Artificial Intelligence Review55, 3 (2022), 2495–2527. doi:10.1007/s10462-021-10068-2

work page doi:10.1007/s10462-021-10068-2 2022
[53]

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 10776–10787

work page 2023
[54]

Sergio Servantez, Joe Barrow, Kristian Hammond, and Rajiv Jain. 2024. Chain of Logic: Rule-Based Reasoning with Large Language Models. InFindings of the As- sociation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2721–2733. doi:10.18653/v1/2024.findings-acl.159

work page doi:10.18653/v1/2024.findings-acl.159 2024
[55]

Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, and Enkelejda Kasneci. 2025. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. InProceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25). Association for Computing Machinery, New York, NY, ...

work page doi:10.1145/3706468 2025
[56]

Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS ’25). Association for Computing Machinery, New York, NY, USA, 425–441

work page 2025
[57]

Michael B. Strecker, Susanne Hähnchen, Martin Heidebach, Marie Herberger, Simon Alexander Nonn, Sarah Großkopf, Gökhan Erol, Menaf Erol, Ilja Garber, Constantin Höhmann, Enci Huang, Clemens Hufeld, and Louisa Zachmann

work page
[58]

KI-Unterstützung und Rohpunkteschemata: Die Zukunft der juristischen Klausurkorrektur? https://ordnungderwissenschaft.de/wp-content/uploads/ 2025/12/Herberger.pdf

work page 2025
[59]

2025.Apertus-70B-Instruct-2509

Swiss AI Initiative. 2025.Apertus-70B-Instruct-2509. Hugging Face. https: //huggingface.co/swiss-ai/Apertus-70B-Instruct-2509 Accessed: 2026-01-25

work page 2025
[60]

Gemma Team. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[63]

Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika48, 2 (July 2021), 459–484. doi:10.1007/s41237-021-00142-y

work page doi:10.1007/s41237-021-00142-y 2021
[64]

2025.EuroLLM-22B-Instruct-2512

UTTER Project. 2025.EuroLLM-22B-Instruct-2512. Hugging Face. https:// huggingface.co/utter-project/EuroLLM-22B-Instruct-2512 Accessed: 2026-01-25

work page 2025
[65]

Yongjie Wang, Chuang Wang, Ruobing Li, and Hui Lin. 2022. On the Use of Bert for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Repre- sentation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle,...

work page doi:10.18653/v1/2022.naacl-main.249 2022
[66]

Oliver Wardas and Florian Matthes. 2026. LLMs for Legal Subsumption in German Employment Contracts. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 169–176. doi:10.1145/3769126.3769209

work page doi:10.1145/3769126.3769209 2026
[67]

Lorenz Wendlinger, Christian Braun, Abdullah Al Zubaer, Simon Alexander Nonn, Sarah Großkopf, Christofer Fellicious, and Michael Granitzer. 2024. On the Suitability of pre-trained foundational LLMs for Analysis in German Legal Education. arXiv:2412.15902 [cs.CL] https://arxiv.org/abs/2412.15902 GradeLegal: Automated Grading for German Legal Cases Conferen...

work page arXiv 2024
[68]

Yiquan Wu, Siying Zhou, Yifei Liu, Weiming Lu, Xiaozhong Liu, Yating Zhang, Changlong Sun, Fei Wu, and Kun Kuang. 2023. Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 12060–1...

work page 2023
[69]

Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, and Weiguang Qu. 2022. Auto- mated Essay Scoring via Pairwise Contrastive Regression. InProceedings of the 29th International Conference on Computational Linguistics. International Com- mittee on Computational Linguistics, Gyeongju, Republic of Korea, 2724–2733. https://aclanthology.org/2022.coling-1.240/

work page 2022
[70]

Fangyi Yu, Lee Quartey, and Frank Schilder. 2023. Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 13582–13596. doi:10.18653/v1/2023.findings-acl.858

work page doi:10.18653/v1/2023.findings-acl.858 2023
[71]

Weikang Yuan, Junjie Cao, Zhuoren Jiang, Yangyang Kang, Jun Lin, et al. 2024. Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguis- tics, Miami, Florida, USA, 7577–7597. doi:10.1865...

work page doi:10.18653/v1/2024.findings-emnlp.445 2024
[72]

No I have never in my life been drunk

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...

work page doi:10.1145/3709025.3712219 2025

[1] [1]

Abdullah Al Zubaer, Stephan Geschwind, Deborah Voss, Lorenz Wendlinger, Johann Graf Lambsdorff, Michael Granitzer, and Jelena Mitrović. 2025. Compara- tive Study of Language Models and Prompt Paradigms in Short Answer Grading. In2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, Athens, Greece, 321–329. doi:10.110...

work page doi:10.1109/ictai66417.2025.00049 2025

[2] [2]

Abdullah Al Zubaer, Michael Granitzer, and Jelena Mitrović. 2023. Performance analysis of large language models in the domain of legal argument mining. Frontiers in artificial intelligence6 (2023), 1278796

work page 2023

[3] [3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long- Document Transformer. arXiv:2004.05150 [cs.CL] https://arxiv.org/abs/2004. 05150

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Kaplan, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

work page 2020

[5] [5]

Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, St. Julian’s, Malta, 2015–2027. doi:10. 18653/v1/2024.eacl-long.122

work page 2024

[6] [6]

Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos

Odysseas S. Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations. InFindings of the Association for Computational Lin- guistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 25099–25119. doi:10.18653/v1/2025.findings-emnlp.1368

work page doi:10.18653/v1/2025.findings-emnlp.1368 2025

[7] [7]

Pierre Colombo, Telmo Pires, Malik Boudiaf, Rui Melo, Dominic Culver, et al

work page

[8] [8]

InAdvances in Neural Information Processing Systems, Vol

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain. InAdvances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 129672–129695. doi:10.52202/079017-4120

work page doi:10.52202/079017-4120

[9] [9]

Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law. arXiv:2403.03883 [cs.CL] https://arxiv.org/abs/2403.03883

work page arXiv 2024

[10] [10]

Cope, Jens Frankenreiter, Scott Hirst, Eric A

Kevin L. Cope, Jens Frankenreiter, Scott Hirst, Eric A. Posner, Daniel Schwarcz, and Dane Thorley. 2025. Grading Machines: Can AI Exam-Grading Replace Law Professors? doi:10.2139/ssrn.5851362

work page doi:10.2139/ssrn.5851362 2025

[11] [11]

2025, Astronomy and Computing, 52, 100954, doi: 10.1016/j.ascom.2025.100954

Scott A. Crossley, Perpetual Baffour, L. Burleigh, and Jules King. 2025. A large- scale corpus for assessing source-based writing quality: ASAP 2.0.Assessing Writing65 (2025), 100954. doi:10.1016/j.asw.2025.100954

work page doi:10.1016/j.asw.2025.100954 2025

[12] [12]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Comput...

work page doi:10.18653/v1/n19-1423 2019

[14] [14]

Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

Ahmed M. Elmassry, Nazar Zaki, Negmeldin Alsheikh, and Mohammed Mediani

work page

[15] [15]

IEEE Access13 (2025), 121902–121917

A Systematic Review of Pretrained Models in Automated Essay Scoring. IEEE Access13 (2025), 121902–121917. doi:10.1109/ACCESS.2025.3584784

work page doi:10.1109/access.2025.3584784 2025

[16] [16]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, et al. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. arXiv:2505.12864 [cs.CL] https://arxiv.org/abs/2505.12864

work page arXiv 2025

[17] [17]

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vincent Ng. 2024. LawBench: Benchmarking Legal Knowledge of Large Language Models. Conference’17, July 2017, Washington, DC, USA Abdullah Al Zubaer et al. InProceedings of the 2024 Conference on Empirical Methods in Na...

work page 2024

[18] [18]

doi:10.18653/v1/2024.emnlp-main.452

work page doi:10.18653/v1/2024.emnlp-main.452 2024

[19] [19]

2026.Grading Scales for In- coming Exchange Students

Freie Universität Berlin, Department of Law. 2026.Grading Scales for In- coming Exchange Students. https://www.jura.fu-berlin.de/en/international/ studierendenaustausch/incomings/noten/

work page 2026

[20] [20]

Shahriar Golchin, Nikhil Garuda, Christopher Impey, and Matthew Wenger

work page

[21] [21]

InProceedings of the 31st International Conference on Computational Linguistics

Grading Massive Open Online Courses Using Large Language Models. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 3899–3912

work page

[22] [22]

2025.Gemini 2.5

Google DeepMind. 2025.Gemini 2.5. https://blog.google/technology/google- deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking Accessed: 2026-01-03

work page 2025

[23] [23]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, et al. 2023. LEGALBENCH: a collaboratively built benchmark for measuring legal reasoning in large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Ho...

work page 2023

[24] [24]

Ali Hakimi Parizi, Yuyang Liu, Prudhvi Nokku, Sina Gholamian, and David Emerson. 2023. A Comparative Study of Prompting Strategies for Legal Text Classification. InProceedings of the Natural Legal Language Processing Workshop

work page 2023

[25] [25]

Association for Computational Linguistics, Singapore, 258–265. doi:10. 18653/v1/2023.nllp-1.25

work page 2023

[26] [26]

Ben Hamner, Jaison Morgan, lynnvandev, Mark Shermis, and Tom Vander Ark

work page

[27] [27]

https://kaggle.com/ competitions/asap-aes

The Hewlett Foundation: Automated Essay Scoring. https://kaggle.com/ competitions/asap-aes. Kaggle

work page

[28] [28]

Emaan Hariri and Daniel E Ho. 2026. AI for Statutory Simplification: A Com- prehensive State Legal Corpus and Labor Benchmark. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 177–187. doi:10.1145/3769126.3769256

work page doi:10.1145/3769126.3769256 2026

[29] [29]

Yue Huang and Joshua Wilson. 2025. Evaluating LLM-Based Automated Essay Scoring: Accuracy, Fairness, and Validity. InProceedings of the Artificial Intelli- gence in Measurement and Education Conference (AIME-Con): Works in Progress. National Council on Measurement in Education (NCME), Wyndham Grand Pitts- burgh, Downtown, Pittsburgh, Pennsylvania, United ...

work page 2025

[30] [30]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany)(SOSP ’23). Association for Computing Machinery, New York, N...

work page doi:10.1145/3600006.3613165 2023

[31] [31]

2025.LeoLM/leo-hessianai-70b-chat

LAION and LeoLM. 2025.LeoLM/leo-hessianai-70b-chat. Hugging Face. https: //huggingface.co/LeoLM/leo-hessianai-70b-chat Accessed: 2026-01-25

work page 2025

[32] [32]

Shengjie Li and Vincent Ng. 2024. Automated Essay Scoring: A Reflection on the State of the Art. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, 17876–17888. doi:10.18653/v1/2024.emnlp-main.991

work page doi:10.18653/v1/2024.emnlp-main.991 2024

[33] [33]

Pei Yee Liew and Ian K. T. Tan. 2025. On Automated Essay Grading using Large Language Models. InProceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI ’24). Association for Computing Machinery, New York, NY, USA, 204–211. doi:10.1145/3709026.3709030

work page doi:10.1145/3709026.3709030 2025

[34] [34]

2024.Completion API Documentation

LiteLLM Contributors. 2024.Completion API Documentation. https://docs.litellm. ai/docs/completion Accessed January 9, 2026

work page 2024

[35] [35]

Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, et al

work page

[36] [36]

Ministral 3

Ministral 3. arXiv:2601.08584 [cs.CL] https://arxiv.org/abs/2601.08584

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Watheq Ahmad Mansour, Salam Albatarni, Sohaila Eltanbouly, and Tamer El- sayed. 2024. Can Large Language Models Automatically Score Proficiency of Writ- ten Essays?. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 2777–2786. https:...

work page 2024

[38] [38]

2024.LLaMA 3.3 Model Card

Meta AI. 2024.LLaMA 3.3 Model Card. Meta. https://github.com/meta-llama/ llama-models/blob/main/models/llama3_3/MODEL_CARD.md Accessed: 2026- 01-26

work page 2024

[39] [39]

2025.Mistral 3

Mistral AI. 2025.Mistral 3. Mistral AI. https://mistral.ai/news/mistral-3 Accessed: 2026-01-26

work page 2025

[40] [40]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Apple- baum, Edwin Arbus, Rahul K. Arora, Yu Bai, et al. 2025. gpt-oss-120b & gpt-oss- 20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

2024.GPT-4o mini: Advancing Cost-Efficient Intelligence

OpenAI. 2024.GPT-4o mini: Advancing Cost-Efficient Intelligence. OpenAI. https:// openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Accessed: 2026-01-26

work page 2024

[42] [42]

2024.Hello GPT-4o

OpenAI. 2024.Hello GPT-4o. OpenAI. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-03

work page 2024

[43] [43]

2024.OpenAI API Reference: Chat

OpenAI. 2024.OpenAI API Reference: Chat. OpenAI. https://platform.openai. com/docs/api-reference/chat Accessed: 2026-01-26

work page 2024

[44] [44]

OpenAI. 2025. GPT-5 System Card. https://cdn.openai.com/gpt-5-system-card. pdf. Accessed: 2026-01-03

work page 2025

[45] [45]

2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

OpenAI. 2025.GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. System card addendum. OpenAI. https://cdn.openai.com/pdf/4173ec8d-1229- 47db-96de-06d87147e07e/5_1_system_card.pdf Accessed: 2026-01-03

work page 2025

[46] [46]

2025.Introducing GPT-4.1 in the API

OpenAI. 2025.Introducing GPT-4.1 in the API. OpenAI. https://openai.com/ index/gpt-4-1/ Accessed: 2026-01-03

work page 2025

[47] [47]

2025.Update to GPT-5 System Card: GPT-5.2

OpenAI. 2025.Update to GPT-5 System Card: GPT-5.2. System card update. OpenAI. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/ oai_5_2_system-card.pdf Accessed: 2026-01-03

work page 2025

[48] [48]

2026.OpenRouter

OpenRouter, Inc. 2026.OpenRouter. https://openrouter.ai/ Accessed Jan 9, 2026

work page 2026

[49] [49]

Christopher Ormerod and Gitit Kehat. 2025. Long context Automated Essay Scoring with Language Models. InProceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers. National Council on Measurement in Education (NCME), Wyndham Grand Pittsburgh, Downtown, Pittsburgh, Pennsylvania, United States, 35–42

work page 2025

[50] [50]

Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227

work page doi:10.1145/3769126.3769227 2026

[51] [51]

2025.Qwen Blog: Research and Model Advancements

Qwen Team. 2025.Qwen Blog: Research and Model Advancements. Qwen. https: //qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd Accessed: 2026- 01-25

work page 2025

[52] [52]

Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An Automated Essay Scoring Systems: A Systematic Literature Review.Artificial Intelligence Review55, 3 (2022), 2495–2527. doi:10.1007/s10462-021-10068-2

work page doi:10.1007/s10462-021-10068-2 2022

[53] [53]

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 10776–10787

work page 2023

[54] [54]

Sergio Servantez, Joe Barrow, Kristian Hammond, and Rajiv Jain. 2024. Chain of Logic: Rule-Based Reasoning with Large Language Models. InFindings of the As- sociation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2721–2733. doi:10.18653/v1/2024.findings-acl.159

work page doi:10.18653/v1/2024.findings-acl.159 2024

[55] [55]

Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, and Enkelejda Kasneci. 2025. Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. InProceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25). Association for Computing Machinery, New York, NY, ...

work page doi:10.1145/3706468 2025

[56] [56]

Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS ’25). Association for Computing Machinery, New York, NY, USA, 425–441

work page 2025

[57] [57]

Michael B. Strecker, Susanne Hähnchen, Martin Heidebach, Marie Herberger, Simon Alexander Nonn, Sarah Großkopf, Gökhan Erol, Menaf Erol, Ilja Garber, Constantin Höhmann, Enci Huang, Clemens Hufeld, and Louisa Zachmann

work page

[58] [58]

KI-Unterstützung und Rohpunkteschemata: Die Zukunft der juristischen Klausurkorrektur? https://ordnungderwissenschaft.de/wp-content/uploads/ 2025/12/Herberger.pdf

work page 2025

[59] [59]

2025.Apertus-70B-Instruct-2509

Swiss AI Initiative. 2025.Apertus-70B-Instruct-2509. Hugging Face. https: //huggingface.co/swiss-ai/Apertus-70B-Instruct-2509 Accessed: 2026-01-25

work page 2025

[60] [60]

Gemma Team. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

work page 2025

[63] [63]

Masaki Uto. 2021. A review of deep-neural automated essay scoring models. Behaviormetrika48, 2 (July 2021), 459–484. doi:10.1007/s41237-021-00142-y

work page doi:10.1007/s41237-021-00142-y 2021

[64] [64]

2025.EuroLLM-22B-Instruct-2512

UTTER Project. 2025.EuroLLM-22B-Instruct-2512. Hugging Face. https:// huggingface.co/utter-project/EuroLLM-22B-Instruct-2512 Accessed: 2026-01-25

work page 2025

[65] [65]

Yongjie Wang, Chuang Wang, Ruobing Li, and Hui Lin. 2022. On the Use of Bert for Automated Essay Scoring: Joint Learning of Multi-Scale Essay Repre- sentation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle,...

work page doi:10.18653/v1/2022.naacl-main.249 2022

[66] [66]

Oliver Wardas and Florian Matthes. 2026. LLMs for Legal Subsumption in German Employment Contracts. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 169–176. doi:10.1145/3769126.3769209

work page doi:10.1145/3769126.3769209 2026

[67] [67]

Lorenz Wendlinger, Christian Braun, Abdullah Al Zubaer, Simon Alexander Nonn, Sarah Großkopf, Christofer Fellicious, and Michael Granitzer. 2024. On the Suitability of pre-trained foundational LLMs for Analysis in German Legal Education. arXiv:2412.15902 [cs.CL] https://arxiv.org/abs/2412.15902 GradeLegal: Automated Grading for German Legal Cases Conferen...

work page arXiv 2024

[68] [68]

Yiquan Wu, Siying Zhou, Yifei Liu, Weiming Lu, Xiaozhong Liu, Yating Zhang, Changlong Sun, Fei Wu, and Kun Kuang. 2023. Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 12060–1...

work page 2023

[69] [69]

Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, and Weiguang Qu. 2022. Auto- mated Essay Scoring via Pairwise Contrastive Regression. InProceedings of the 29th International Conference on Computational Linguistics. International Com- mittee on Computational Linguistics, Gyeongju, Republic of Korea, 2724–2733. https://aclanthology.org/2022.coling-1.240/

work page 2022

[70] [70]

Fangyi Yu, Lee Quartey, and Frank Schilder. 2023. Exploring the Effectiveness of Prompt Engineering for Legal Reasoning Tasks. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, 13582–13596. doi:10.18653/v1/2023.findings-acl.858

work page doi:10.18653/v1/2023.findings-acl.858 2023

[71] [71]

Weikang Yuan, Junjie Cao, Zhuoren Jiang, Yangyang Kang, Jun Lin, et al. 2024. Can Large Language Models Grasp Legal Theories? Enhance Legal Reasoning with Insights from Multi-Agent Collaboration. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguis- tics, Miami, Florida, USA, 7577–7597. doi:10.1865...

work page doi:10.18653/v1/2024.findings-emnlp.445 2024

[72] [72]

No I have never in my life been drunk

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning-Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law(Munich, Germany)(CSLA W ’25). Association for Computing Machinery, New York, NY, USA, 169–193. doi:10.1145/370...

work page doi:10.1145/3709025.3712219 2025