Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3
The pith
Even the strongest LLMs score below 70 percent on tasks that require writing full judicial sentences and magistrate-level legal analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Magis-Bench consists of discursive legal analysis items and practical exercises that require LLMs to produce full judicial sentences grounded in Brazilian law. When 23 state-of-the-art models were evaluated by four independent frontier LLM judges, the highest average score was 6.97 out of 10, achieved by Gemini-3-Pro-Preview. All models remained below 70 percent of the maximum, and inter-judge agreement was strong, with Kendall's W at 0.984.
What carries the argument
Magis-Bench, a set of 74 exam-derived tasks that test the production of complete civil and criminal judicial sentences plus multi-turn doctrinal analysis.
If this is right
- LLMs cannot yet be trusted to draft court decisions without close human review.
- Training objectives for legal AI should emphasize weighing competing claims and applying doctrine to specific facts.
- Multi-turn legal analysis remains a distinct weakness even in the strongest models.
- Public release of the benchmark and model outputs allows direct comparison of future systems on the same magistrate tasks.
Where Pith is reading between the lines
- If performance on this benchmark improves, LLMs could begin assisting with routine judicial drafting in lower-stakes settings.
- Similar exam-derived benchmarks in other legal traditions could reveal whether the current performance ceiling is language- or jurisdiction-specific.
- Persistent gaps on sentence-writing tasks suggest that scaling alone may not close the distance to human magistrate competence.
Load-bearing premise
That scores assigned by four frontier LLMs accurately reflect the quality of legal reasoning and writing that human magistrates would accept.
What would settle it
A direct comparison in which practicing magistrates independently score the same model outputs and assign average scores substantially above or below the 6.97 reported by the LLM panel.
Figures
read the original abstract
Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $\tau \ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Magis-Bench, a benchmark of 74 questions drawn from eight recent Brazilian judicial examinations (2023-2025). It covers discursive legal analysis tasks with multi-turn structure and practical exercises requiring full civil and criminal judicial sentences. The authors evaluate 23 state-of-the-art LLMs via an LLM-as-a-judge protocol using four independent frontier models, report strong inter-judge agreement (Kendall's W = 0.984; pairwise τ ≥ 0.897), and find that the best model (Gemini-3-Pro-Preview) scores 6.97/10 while all models remain below 70% of the maximum. They conclude that magistrate-level legal reasoning and writing remain challenging for current LLMs and release the benchmark, model outputs, and evaluation code.
Significance. If the LLM-as-a-judge scores are shown to track human magistrate standards, the benchmark would provide a valuable, reproducible resource for measuring progress on high-stakes legal decision-making tasks that go beyond argument generation. The explicit release of the full benchmark, all model outputs, and evaluation code is a clear strength that enables direct replication and extension. The headline result that even frontier models fall short of 70% would, if anchored, usefully quantify remaining gaps in doctrine application, fact-weighing, and formal judicial writing.
major comments (1)
- [Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.
minor comments (2)
- [Abstract and Results] The abstract and results paragraphs should explicitly state the scoring rubric (e.g., what constitutes a 7/10 versus 9/10 on a judicial sentence) and whether the same rubric is applied uniformly across discursive analysis and sentence-writing tasks.
- [Benchmark description] Clarify the exact number of questions per examination and per task type (discursive vs. sentence composition) so readers can assess coverage and potential imbalance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies a key methodological consideration in our evaluation approach. We respond to the major comment below.
read point-by-point responses
-
Referee: [Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.
Authors: We agree that the lack of direct correlation between the LLM-as-a-judge scores and ratings from actual Brazilian magistrates or exam graders represents a genuine limitation for the absolute interpretation of the 70% threshold. Recruiting qualified legal experts to re-score all 23 model outputs on 74 complex, multi-turn tasks would require substantial resources and time that exceed the scope of this benchmark introduction. The high inter-judge agreement (Kendall's W = 0.984) across four diverse frontier models offers evidence of evaluation consistency, and the tasks are taken verbatim from official examinations whose grading standards are public. We will revise the manuscript to (1) explicitly state this limitation in the discussion, (2) frame the headline result as evidence that current LLMs fall short of the high bar set by the benchmark rather than a definitive human-aligned percentage, and (3) emphasize that the full release of model outputs and evaluation code is intended to enable precisely the human calibration studies the referee recommends. revision: partial
Circularity Check
No circularity: benchmark uses external exam questions and independent LLM judges
full rationale
The paper sources its 74 questions directly from public Brazilian judicial examinations (2023-2025) and computes model scores via four separate frontier LLMs as judges, reporting only inter-judge agreement statistics. No equations, parameters, or derivations are present that reduce by construction to the paper's own inputs or self-citations. The central claim (best model at 6.97/10, all below 70% of maximum) is an empirical measurement against externally defined tasks rather than a self-referential loop, satisfying the self-contained benchmark criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-judge with frontier models provides a reliable proxy for human expert evaluation of legal writing quality
Reference graph
Works this paper leans on
-
[1]
Oren Gazal Ayal, Zohar Elyoseph, and Adir Solomon. 2026. Evaluating Large Language Models as Judicial Decision-Makers.Justice Quarterly0, 0 (2026), 1–36. arXiv:https://doi.org/10.1080/07418825.2026.2618254 doi:10.1080/07418825.2026. 2618254
- [2]
-
[3]
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al
- [4]
-
[5]
M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30, 1-2 (06 1938), 81–93. arXiv:https://academic.oup.com/biomet/article-pdf/30/1- 2/81/423380/30-1-2-81.pdf doi:10.1093/biomet/30.1-2.81
- [6]
- [7]
-
[8]
Eduardo Caruso Barbosa Pacheco, Fernanda Mattar Suriani, and Ricardo Ribeiro
-
[9]
Rabula: A Benchmark for Evaluating LLMs in Brazilian Legal Tasks. (June 2025)
work page 2025
-
[10]
Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227
-
[11]
Eric A Posner and Shivam Saran. 2025. Judge AI: Assessing large language models in judicial decision-making.University of Chicago Coase-Sandor Institute for Law & Economics Research Paper2503 (2025)
work page 2025
- [12]
-
[13]
Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. JuDGE: Benchmarking Judg- ment Document Generation for Chinese Legal System. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Compu...
-
[14]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levin...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.