pith. machine review for the scientific record. sign in

arxiv: 2605.08437 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Pith reviewed 2026-05-12 02:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal AILLM benchmarkjudicial reasoningsentence writingBrazilian lawmagistrate tasksLLM evaluation
0
0 comments X

The pith

Even the strongest LLMs score below 70 percent on tasks that require writing full judicial sentences and magistrate-level legal analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Magis-Bench from 74 real questions taken from recent Brazilian competitive exams for judicial positions. These questions demand both multi-turn legal reasoning and the composition of complete civil or criminal court decisions. Twenty-three current LLMs were scored by four separate frontier models acting as judges, producing high agreement across evaluators. The best result reached only 6.97 out of 10, well below the 70 percent mark of the maximum possible score. This gap shows that advanced legal decision writing still lies outside the reliable reach of today's models.

Core claim

Magis-Bench consists of discursive legal analysis items and practical exercises that require LLMs to produce full judicial sentences grounded in Brazilian law. When 23 state-of-the-art models were evaluated by four independent frontier LLM judges, the highest average score was 6.97 out of 10, achieved by Gemini-3-Pro-Preview. All models remained below 70 percent of the maximum, and inter-judge agreement was strong, with Kendall's W at 0.984.

What carries the argument

Magis-Bench, a set of 74 exam-derived tasks that test the production of complete civil and criminal judicial sentences plus multi-turn doctrinal analysis.

If this is right

  • LLMs cannot yet be trusted to draft court decisions without close human review.
  • Training objectives for legal AI should emphasize weighing competing claims and applying doctrine to specific facts.
  • Multi-turn legal analysis remains a distinct weakness even in the strongest models.
  • Public release of the benchmark and model outputs allows direct comparison of future systems on the same magistrate tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If performance on this benchmark improves, LLMs could begin assisting with routine judicial drafting in lower-stakes settings.
  • Similar exam-derived benchmarks in other legal traditions could reveal whether the current performance ceiling is language- or jurisdiction-specific.
  • Persistent gaps on sentence-writing tasks suggest that scaling alone may not close the distance to human magistrate competence.

Load-bearing premise

That scores assigned by four frontier LLMs accurately reflect the quality of legal reasoning and writing that human magistrates would accept.

What would settle it

A direct comparison in which practicing magistrates independently score the same model outputs and assign average scores substantially above or below the 6.97 reported by the LLM panel.

Figures

Figures reproduced from arXiv: 2605.08437 by Celio Larcher Junior, Giovana Bon\'as, Hugo Abonizio, Marcos Piau, Ramon Pires, Rodrigo Nogueira, Roseval Malaquias Junior, Thales Sales Almeida, Thiago Laitz.

Figure 1
Figure 1. Figure 1: Overview of Magis-Bench: dataset construction and multi-LLM judging pipeline. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of a practical sentence-drafting question [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators. Our results show strong inter-judge agreement (Kendall's $W = 0.984$; pairwise Kendall's $\tau \ge 0.897$), with Google's Gemini-3-Pro-Preview achieving the highest average score (6.97/10), followed by Gemini-3-Flash-Preview (6.67) and Claude-4.5-Opus (6.46). Even the best-performing models score below 70\% of the maximum, indicating that judicial-level legal reasoning and writing remain challenging for current LLMs. We release the complete benchmark, model outputs, and evaluation code to support further research on legal AI capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Magis-Bench, a benchmark of 74 questions drawn from eight recent Brazilian judicial examinations (2023-2025). It covers discursive legal analysis tasks with multi-turn structure and practical exercises requiring full civil and criminal judicial sentences. The authors evaluate 23 state-of-the-art LLMs via an LLM-as-a-judge protocol using four independent frontier models, report strong inter-judge agreement (Kendall's W = 0.984; pairwise τ ≥ 0.897), and find that the best model (Gemini-3-Pro-Preview) scores 6.97/10 while all models remain below 70% of the maximum. They conclude that magistrate-level legal reasoning and writing remain challenging for current LLMs and release the benchmark, model outputs, and evaluation code.

Significance. If the LLM-as-a-judge scores are shown to track human magistrate standards, the benchmark would provide a valuable, reproducible resource for measuring progress on high-stakes legal decision-making tasks that go beyond argument generation. The explicit release of the full benchmark, all model outputs, and evaluation code is a clear strength that enables direct replication and extension. The headline result that even frontier models fall short of 70% would, if anchored, usefully quantify remaining gaps in doctrine application, fact-weighing, and formal judicial writing.

major comments (1)
  1. [Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.
minor comments (2)
  1. [Abstract and Results] The abstract and results paragraphs should explicitly state the scoring rubric (e.g., what constitutes a 7/10 versus 9/10 on a judicial sentence) and whether the same rubric is applied uniformly across discursive analysis and sentence-writing tasks.
  2. [Benchmark description] Clarify the exact number of questions per examination and per task type (discursive vs. sentence composition) so readers can assess coverage and potential imbalance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies a key methodological consideration in our evaluation approach. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Evaluation protocol] Evaluation protocol (results and methodology sections): The central interpretive claim—that scores below 70% demonstrate that 'judicial-level legal reasoning and writing remain challenging'—rests on the untested assumption that the four frontier LLM judges' ratings align with the standards actual Brazilian magistrates or exam graders would apply. The paper reports only intra-LLM agreement (Kendall's W = 0.984) and provides no human-expert scoring of model outputs, no calibration set, and no correlation statistic with human ratings. In a domain where correctness depends on doctrine-specific reasoning and stylistic constraints, high LLM agreement alone does not rule out systematic shared bias.

    Authors: We agree that the lack of direct correlation between the LLM-as-a-judge scores and ratings from actual Brazilian magistrates or exam graders represents a genuine limitation for the absolute interpretation of the 70% threshold. Recruiting qualified legal experts to re-score all 23 model outputs on 74 complex, multi-turn tasks would require substantial resources and time that exceed the scope of this benchmark introduction. The high inter-judge agreement (Kendall's W = 0.984) across four diverse frontier models offers evidence of evaluation consistency, and the tasks are taken verbatim from official examinations whose grading standards are public. We will revise the manuscript to (1) explicitly state this limitation in the discussion, (2) frame the headline result as evidence that current LLMs fall short of the high bar set by the benchmark rather than a definitive human-aligned percentage, and (3) emphasize that the full release of model outputs and evaluation code is intended to enable precisely the human calibration studies the referee recommends. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark uses external exam questions and independent LLM judges

full rationale

The paper sources its 74 questions directly from public Brazilian judicial examinations (2023-2025) and computes model scores via four separate frontier LLMs as judges, reporting only inter-judge agreement statistics. No equations, parameters, or derivations are present that reduce by construction to the paper's own inputs or self-citations. The central claim (best model at 6.97/10, all below 70% of maximum) is an empirical measurement against externally defined tasks rather than a self-referential loop, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, no new theoretical entities, and relies only on the domain assumption that LLM judges can proxy human magistrate evaluation quality.

axioms (1)
  • domain assumption LLM-as-a-judge with frontier models provides a reliable proxy for human expert evaluation of legal writing quality
    Invoked in the evaluation methodology section to justify using four independent LLMs for scoring.

pith-pipeline@v0.9.0 · 5596 in / 1197 out tokens · 46936 ms · 2026-05-12T02:44:58.375457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Oren Gazal Ayal, Zohar Elyoseph, and Adir Solomon. 2026. Evaluating Large Language Models as Judicial Decision-Makers.Justice Quarterly0, 0 (2026), 1–36. arXiv:https://doi.org/10.1080/07418825.2026.2618254 doi:10.1080/07418825.2026. 2618254

  2. [2]

    Odysseas S Chlapanis, Dimitrios Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations.arXiv preprint arXiv:2505.17267(2025)

  3. [3]

    Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, et al

  4. [4]

    Lexam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864(2025)

  5. [5]

    M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION.Biometrika 30, 1-2 (06 1938), 81–93. arXiv:https://academic.oup.com/biomet/article-pdf/30/1- 2/81/423380/30-1-2-81.pdf doi:10.1093/biomet/30.1-2.81

  6. [6]

    M. G. Kendall and B. Babington Smith. 1939. The Problem of m Rankings.The Annals of Mathematical Statistics10, 3 (1939), 275–287. http://www.jstor.org/ stable/2235668

  7. [7]

    Hongseok Oh, Wonseok Hwang, and Kyoung-Woon On. 2025. Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs’ Legal Reasoning Capabilities.arXiv preprint arXiv:2512.24572(2025)

  8. [8]

    Eduardo Caruso Barbosa Pacheco, Fernanda Mattar Suriani, and Ricardo Ribeiro

  9. [9]

    (June 2025)

    Rabula: A Benchmark for Evaluating LLMs in Brazilian Legal Tasks. (June 2025)

  10. [10]

    Ramon Pires, Roseval Malaquias Junior, and Rodrigo Nogueira. 2026. Automatic Legal Writing Evaluation of LLMs. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Comput- ing Machinery, New York, NY, USA, 420–424. doi:10.1145/3769126.3769227

  11. [11]

    Eric A Posner and Shivam Saran. 2025. Judge AI: Assessing large language models in judicial decision-making.University of Chicago Coase-Sandor Institute for Law & Economics Research Paper2503 (2025)

  12. [12]

    Yuzhen Shi, Huanghai Liu, Yiran Hu, Gaojie Song, Xinran Xu, Yubo Ma, Tianyi Tang, Li Zhang, Qingjing Chen, Di Feng, et al. 2026. PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice.arXiv preprint arXiv:2601.16669(2026)

  13. [13]

    Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. JuDGE: Benchmarking Judg- ment Document Generation for Chinese Legal System. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Compu...

  14. [14]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levin...