arxiv: 2605.08437 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Ramon Pires , Thales Sales Almeida , Celio Larcher Junior , Giovana Bon\'as , Hugo Abonizio , Marcos Piau , Roseval Malaquias Junior , Thiago Laitz

show 1 more author

Rodrigo Nogueira

This is my paper

Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal AILLM benchmarkjudicial reasoningsentence writingBrazilian lawmagistrate tasksLLM evaluation

0 comments

The pith

Even the strongest LLMs score below 70 percent on tasks that require writing full judicial sentences and magistrate-level legal analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Magis-Bench from 74 real questions taken from recent Brazilian competitive exams for judicial positions. These questions demand both multi-turn legal reasoning and the composition of complete civil or criminal court decisions. Twenty-three current LLMs were scored by four separate frontier models acting as judges, producing high agreement across evaluators. The best result reached only 6.97 out of 10, well below the 70 percent mark of the maximum possible score. This gap shows that advanced legal decision writing still lies outside the reliable reach of today's models.

Core claim

Magis-Bench consists of discursive legal analysis items and practical exercises that require LLMs to produce full judicial sentences grounded in Brazilian law. When 23 state-of-the-art models were evaluated by four independent frontier LLM judges, the highest average score was 6.97 out of 10, achieved by Gemini-3-Pro-Preview. All models remained below 70 percent of the maximum, and inter-judge agreement was strong, with Kendall's W at 0.984.

What carries the argument

Magis-Bench, a set of 74 exam-derived tasks that test the production of complete civil and criminal judicial sentences plus multi-turn doctrinal analysis.

If this is right

LLMs cannot yet be trusted to draft court decisions without close human review.
Training objectives for legal AI should emphasize weighing competing claims and applying doctrine to specific facts.
Multi-turn legal analysis remains a distinct weakness even in the strongest models.
Public release of the benchmark and model outputs allows direct comparison of future systems on the same magistrate tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If performance on this benchmark improves, LLMs could begin assisting with routine judicial drafting in lower-stakes settings.
Similar exam-derived benchmarks in other legal traditions could reveal whether the current performance ceiling is language- or jurisdiction-specific.
Persistent gaps on sentence-writing tasks suggest that scaling alone may not close the distance to human magistrate competence.

Load-bearing premise

That scores assigned by four frontier LLMs accurately reflect the quality of legal reasoning and writing that human magistrates would accept.

What would settle it

A direct comparison in which practicing magistrates independently score the same model outputs and assign average scores substantially above or below the 6.97 reported by the LLM panel.

Figures

Figures reproduced from arXiv: 2605.08437 by Celio Larcher Junior, Giovana Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz.

**Figure 2.** Figure 2: Example of a practical sentence-drafting question [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $\tau \ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Magis-Bench gives a new set of magistrate-level tasks from recent Brazilian exams that require full sentence writing, but the LLM-judge scores have no human anchor.

read the letter

Magis-Bench pulls together questions from recent Brazilian judicial exams that require LLMs to compose full civil and criminal sentences plus multi-turn legal analyses. The main takeaway is that current models still fall short on these magistrate-level tasks according to the reported scores, but the scoring method itself needs more checking. The paper does a clean job assembling the benchmark from public 2023-2025 exams and releasing the questions, outputs, and code. That transparency is useful. It also shows the four frontier LLM judges agree closely on the rankings. The weak point is the absence of any human magistrate or expert ratings on the model outputs. Without that, we cannot tell how well the LLM scores track real legal quality standards. The high agreement among the judges is reassuring for consistency but does not address possible shared blind spots in a domain as precise as judicial writing. This work is aimed at people building legal AI systems or creating evaluation benchmarks, particularly those interested in civil-law jurisdictions. Anyone testing models on complex reasoning and writing tasks could use the dataset. It is solid enough to go to peer review. Reviewers will likely focus on adding human validation, but the core contribution of the benchmark stands on its own. I would send this to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Magis-Bench, a benchmark of 74 questions drawn from eight recent Brazilian judicial examinations (2023-2025). It covers discursive legal analysis tasks with multi-turn structure and practical exercises requiring full civil and criminal judicial sentences. The authors evaluate 23 state-of-the-art LLMs via an LLM-as-a-judge protocol using four independent frontier models, report strong inter-judge agreement (Kendall's W = 0.984; pairwise τ ≥ 0.897), and find that the best model (Gemini-3-Pro-Preview) scores 6.97/10 while all models remain below 70% of the maximum. They conclude that magistrate-level legal reasoning and writing remain challenging for current LLMs and release the benchmark, model outputs, and evaluation code.

Significance. If the LLM-as-a-judge scores are shown to track human magistrate standards, the benchmark would provide a valuable, reproducible resource for measuring progress on high-stakes legal decision-making tasks that go beyond argument generation. The explicit release of the full benchmark, all model outputs, and evaluation code is a clear strength that enables direct replication and extension. The headline result that even frontier models fall short of 70% would, if anchored, usefully quantify remaining gaps in doctrine application, fact-weighing, and formal judicial writing.

major comments (1)

[Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.

minor comments (2)

[Abstract and Results] The abstract and results paragraphs should explicitly state the scoring rubric (e.g., what constitutes a 7/10 versus 9/10 on a judicial sentence) and whether the same rubric is applied uniformly across discursive analysis and sentence-writing tasks.
[Benchmark description] Clarify the exact number of questions per examination and per task type (discursive vs. sentence composition) so readers can assess coverage and potential imbalance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a key methodological consideration in our evaluation approach. We respond to the major comment below.

read point-by-point responses

Referee: [Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.

Authors: We agree that the lack of direct correlation between the LLM-as-a-judge scores and ratings from actual Brazilian magistrates or exam graders represents a genuine limitation for the absolute interpretation of the 70% threshold. Recruiting qualified legal experts to re-score all 23 model outputs on 74 complex, multi-turn tasks would require substantial resources and time that exceed the scope of this benchmark introduction. The high inter-judge agreement (Kendall's W = 0.984) across four diverse frontier models offers evidence of evaluation consistency, and the tasks are taken verbatim from official examinations whose grading standards are public. We will revise the manuscript to (1) explicitly state this limitation in the discussion, (2) frame the headline result as evidence that current LLMs fall short of the high bar set by the benchmark rather than a definitive human-aligned percentage, and (3) emphasize that the full release of model outputs and evaluation code is intended to enable precisely the human calibration studies the referee recommends. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark uses external exam questions and independent LLM judges

full rationale

The paper sources its 74 questions directly from public Brazilian judicial examinations (2023-2025) and computes model scores via four separate frontier LLMs as judges, reporting only inter-judge agreement statistics. No equations, parameters, or derivations are present that reduce by construction to the paper's own inputs or self-citations. The central claim (best model at 6.97/10, all below 70% of maximum) is an empirical measurement against externally defined tasks rather than a self-referential loop, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, no new theoretical entities, and relies only on the domain assumption that LLM judges can proxy human magistrate evaluation quality.

axioms (1)

domain assumption LLM-as-a-judge with frontier models provides a reliable proxy for human expert evaluation of legal writing quality
Invoked in the evaluation methodology section to justify using four independent LLMs for scoring.

pith-pipeline@v0.9.0 · 5596 in / 1197 out tokens · 46936 ms · 2026-05-12T02:44:58.375457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Oren Gazal Ayal, Zohar Elyoseph, and Adir Solomon. 2026. Evaluating Large Language Models as Judicial Decision-Makers.Justice Quarterly0, 0 (2026), 1–36. arXiv:https://doi.org/10.1080/07418825.2026.2618254 doi:10.1080/07418825.2026. 2618254

work page doi:10.1080/07418825.2026.2618254 2026
[2]

Odysseas S Chlapanis, Dimitrios Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations.arXiv preprint arXiv:2505.17267(2025)

work page arXiv 2025
[3]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al

work page
[4]

Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864(2025)

work page arXiv 2025
[5]

M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30, 1-2 (06 1938), 81–93. arXiv:https://academic.oup.com/biomet/article-pdf/30/1- 2/81/423380/30-1-2-81.pdf doi:10.1093/biomet/30.1-2.81

work page doi:10.1093/biomet/30.1-2.81 1938
[6]

M. G. Kendall and B. Babington Smith. 1939. The Problem of m Rankings.The Annals of Mathematical Statistics10, 3 (1939), 275–287. http://www.jstor.org/ stable/2235668

work page arXiv 1939
[7]

Hongseok Oh, Wonseok Hwang, and Kyoung-Woon On. 2025. Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities.arXiv preprint arXiv:2512.24572(2025)

work page arXiv 2025
[8]

Eduardo Caruso Barbosa Pacheco, Fernanda Mattar Suriani, and Ricardo Ribeiro

work page
[9]

(June 2025)

Rabula: A Benchmark for Evaluating LLMs in Brazilian Legal Tasks. (June 2025)

work page 2025
[10]

Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227

work page doi:10.1145/3769126.3769227 2026
[11]

Eric A Posner and Shivam Saran. 2025. Judge AI: Assessing large language models in judicial decision-making.University of Chicago Coase-Sandor Institute for Law & Economics Research Paper2503 (2025)

work page 2025
[12]

Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, et al. 2026. PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice.arXiv preprint arXiv:2601.16669(2026)

work page arXiv 2026
[13]

Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. JuDGE: Benchmarking Judg- ment Document Generation for Chinese Legal System. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Compu...

work page doi:10.1145/3726302.3730295 2025
[14]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levin...

work page 2023