Magis-Bench is a new benchmark of 74 magistrate-level legal writing tasks from Brazilian exams where the strongest LLMs reach only 6.97/10, showing judicial reasoning remains difficult for current models.
arXiv preprint arXiv:2505.12864 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
background 1polarities
background 1representative citing papers
Expert evaluation of LLMs on Japanese bar exam writing tasks shows clear limitations in open-ended legal reasoning and frequent hallucinations unsupported by law or precedent.
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
BenGER is a collaborative web platform that integrates end-to-end workflows for creating, annotating, running, and evaluating benchmarks on German legal tasks with large language models.
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
Reasoning-oriented LLMs reach up to 0.91 quadratic weighted kappa agreement with experts on public law cases when given sample solutions and grading rubrics, but only 0.60 on criminal law cases.
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
Rulemapping uses expert symbolic scaffolds to constrain LLMs, raising precision on §130(1) German hate-speech classification from 0.34-0.49 to 0.80-0.86 while preserving recall of 0.82-0.89.
NyayaMind combines RAG retrieval with domain-specific LLMs to generate transparent, structured legal reasoning and judgment predictions for Indian court cases.
citing papers explorer
-
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Magis-Bench is a new benchmark of 74 magistrate-level legal writing tasks from Brazilian exams where the strongest LLMs reach only 6.97/10, showing judicial reasoning remains difficult for current models.
-
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task
Expert evaluation of LLMs on Japanese bar exam writing tasks shows clear limitations in open-ended legal reasoning and frequent hallucinations unsupported by law or precedent.
-
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
-
BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
BenGER is a collaborative web platform that integrates end-to-end workflows for creating, annotating, running, and evaluating benchmarks on German legal tasks with large language models.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
GradeLegal: Automated Grading for German Legal Cases
Reasoning-oriented LLMs reach up to 0.91 quadratic weighted kappa agreement with experts on public law cases when given sample solutions and grading rubrics, but only 0.60 on criminal law cases.
-
Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Automatic prompt optimization using lenient LLM judges improves performance and transferability in legal QA evaluations compared to human design or strict judges.
-
Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech
Rulemapping uses expert symbolic scaffolds to constrain LLMs, raising precision on §130(1) German hate-speech classification from 0.34-0.49 to 0.80-0.86 while preserving recall of 0.82-0.89.
-
NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System
NyayaMind combines RAG retrieval with domain-specific LLMs to generate transparent, structured legal reasoning and judgment predictions for Indian court cases.