pith. sign in

arxiv: 2605.29738 · v1 · pith:UOAKAFOQnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Pith reviewed 2026-06-29 07:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal NLP benchmarkcross-jurisdictional evaluationLLM few-shot promptingcross-lingual transferlabel-set alignmenttokenizer fertilitycourt registry data
0
0 comments X

The pith

Multi-Legal-Bench tests five identical legal tasks across six countries and shows model rankings shift by task and jurisdiction while cross-lingual transfer tracks label-set alignment more than language family.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multi-Legal-Bench to make direct comparison of LLMs possible on legal reasoning by defining the same five tasks in six jurisdictions that use different languages and court systems. It evaluates frontier models in zero-shot and three-shot settings on data drawn from national registries covering 134 million decisions. Results indicate that few-shot gains observed in one language appear consistently elsewhere, no model leads in every setting, and transfer quality from Ukrainian examples aligns better with how closely the output labels match than with whether the target language belongs to the same family. Tokenizer efficiency across models also fails to explain accuracy gaps.

Core claim

Multi-Legal-Bench defines five tasks—court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction—mapped to structured metadata from six national court registries and evaluates seven frontier LLMs plus smaller models under zero- and three-shot prompting. The evaluation reveals that task-dependent few-shot effects replicate across jurisdictions, model rankings vary with both task and country, cross-lingual transfer from Ukrainian to French outperforms transfer to Polish despite language-family differences, label-set alignment predicts transfer success better than language proximity, and tokenizer fertility shows no sign

What carries the argument

The sparse 5x6 task-jurisdiction matrix that maps five identical tasks to structured metadata from six national court registries.

If this is right

  • Task-dependent few-shot effects hold across all tested jurisdictions.
  • No single model leads in every task-jurisdiction combination.
  • Cross-lingual few-shot transfer quality correlates with label-set alignment rather than language family membership.
  • Tokenizer fertility across a 2.3x range does not predict cross-lingual accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If label-set alignment drives transfer, standardizing output schemas across languages before prompting could improve few-shot results more than choosing linguistically close source languages.
  • The absence of a tokenizer effect points to pretraining corpus content as the dominant factor for legal-domain cross-lingual performance.
  • Jurisdiction-specific model rankings imply that general-purpose legal LLMs may underperform compared with models tuned on the target registry's metadata format.

Load-bearing premise

The five tasks remain meaningfully comparable across jurisdictions that use different legal traditions and court registries, allowing performance gaps to be attributed to model capabilities rather than differences in what the tasks actually require.

What would settle it

Re-annotating the same task in two jurisdictions so that the legal elements to be extracted or predicted become structurally equivalent and then observing that relative model accuracies do not change would indicate the original tasks were not equivalent.

Figures

Figures reproduced from arXiv: 2605.29738 by Volodymyr Ovcharov.

Figure 1
Figure 1. Figure 1: Best zero-shot accuracy per task–jurisdiction cell. CTC is near-ceiling everywhere; [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot accuracy for all 7 models across task–jurisdiction combinations. No single [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Few-shot delta (percentage points) by task and jurisdiction. Boxes show the range [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tokenizer fertility by language and model. French is most efficient (1.70–2.37); [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fertility vs. zero-shot accuracy on classification tasks (CTC, JFC, CCP). The correlation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Multi-Legal-Bench, the first cross-jurisdictional legal benchmark evaluating five tasks (court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, cause category prediction) across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania) via mappings to national court registry metadata, forming a sparse 5x6 matrix. It evaluates 11 LLMs (7 frontier via Bedrock, 4 smaller for scaling) in zero-shot and 3-shot settings and reports that (1) task-dependent few-shot effects replicate across jurisdictions, (2) no model dominates and rankings shift by task/jurisdiction, (3) cross-lingual transfer (e.g., UA->FR better than UA->PL) is predicted by label-set alignment rather than language family, and (4) tokenizer fertility (2.3x spread) does not predict accuracy (r=-0.27, p=0.14). All data, prompts, and predictions are released.

Significance. If the task mappings ensure meaningful comparability, the benchmark fills a gap in legal NLP by enabling direct cross-jurisdictional and cross-lingual comparisons on real court data from 134M decisions. The empirical findings on few-shot replication, non-language-proximity transfer, and tokenizer irrelevance are falsifiable and could inform model selection and prompting strategies in multilingual legal applications. The public release of data, prompts, and predictions is a clear strength supporting reproducibility.

major comments (1)
  1. [Abstract, §3] Abstract and §3 (task definition): The replication claim (1), transfer patterns (3), and tokenizer analysis (4) all treat the five tasks as comparable across jurisdictions with distinct legal traditions and registries. The manuscript states the tasks are 'mapped to structured metadata from national court registries' but provides no evidence (e.g., label cardinalities, input length distributions, or inter-registry alignment metrics) that 'cause category' or 'judgment form' operationalizations have equivalent difficulty or label-set overlap in UA vs. NL vs. FR; this assumption is load-bearing for attributing performance deltas to model capabilities rather than task inequivalence.
minor comments (1)
  1. [Abstract] The abstract reports specific deltas (UA->FR -2.1 pp, UA->PL -13.7 pp) and correlation (r=-0.27) without stating the exact number of examples or statistical controls used; adding these details would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The major comment raises a valid point about the need for explicit evidence of task comparability across jurisdictions, which we address below by committing to revisions that add the requested metrics.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (task definition): The replication claim (1), transfer patterns (3), and tokenizer analysis (4) all treat the five tasks as comparable across jurisdictions with distinct legal traditions and registries. The manuscript states the tasks are 'mapped to structured metadata from national court registries' but provides no evidence (e.g., label cardinalities, input length distributions, or inter-registry alignment metrics) that 'cause category' or 'judgment form' operationalizations have equivalent difficulty or label-set overlap in UA vs. NL vs. FR; this assumption is load-bearing for attributing performance deltas to model capabilities rather than task inequivalence.

    Authors: We agree that the manuscript would be strengthened by providing quantitative evidence of task equivalence to support cross-jurisdictional comparisons. While each task is defined via direct mapping to official national court registry fields (ensuring they reflect real legal metadata rather than synthetic constructs), the current version does not report label cardinalities, input length statistics, or alignment metrics. In the revised manuscript we will add a dedicated subsection (and accompanying table) in §3 that reports, for every filled cell in the 5×6 matrix: (i) number of unique labels, (ii) mean and variance of input token lengths, and (iii) any available label-set overlap statistics. This will allow readers to evaluate the degree of operational equivalence and to interpret performance differences with appropriate caution. We view this addition as directly responsive to the concern and will not alter the core empirical claims. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmark evaluation with no derivations or self-referential structure

full rationale

The paper defines a new benchmark (5 tasks mapped to court-registry metadata across 6 jurisdictions) and reports direct empirical results from LLM evaluations under zero- and few-shot prompting. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the abstract or described claims. All reported deltas, correlations, and replication statements are computed from model outputs on the released data rather than quantities defined in terms of those outputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on one domain assumption about task comparability across legal systems; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The five tasks can be meaningfully mapped to structured metadata from different national court registries despite varying legal traditions.
    Invoked when defining the 5x6 task-jurisdiction matrix and claiming replication of effects across jurisdictions.

pith-pipeline@v0.9.1-grok · 5827 in / 1663 out tokens · 44544 ms · 2026-06-29T07:22:06.635091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Beatriz Canaverde, Telmo Pessoa Pires, Leonor Melo Ribeiro, and André F

    URLhttps://arxiv.org/abs/2309.07445. Beatriz Canaverde, Telmo Pessoa Pires, Leonor Melo Ribeiro, and André F. T. Martins. Legal- Bench.PT: A benchmark for Portuguese law.arXiv preprint arXiv:2502.16357,

  2. [2]

    13 Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Mar- tin Katz, and Nikolaos Aletras

    URL https://arxiv.org/abs/2502.16357. 13 Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Mar- tin Katz, and Nikolaos Aletras. Lexglue: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 4310–4330,

  3. [3]

    Nguyen Tien Dong et al

    URLhttps://arxiv.org/abs/2110.00976. Nguyen Tien Dong et al. VLegal-Bench: Cognitively grounded benchmark for Vietnamese legal reasoning of large language models.arXiv preprint arXiv:2512.14554,

  4. [4]

    VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

    URL https://arxiv.org/abs/2512.14554. Yu Fan, Jingwei Ni, Jakob Merane, and Joel Niklaus. LEXam: Benchmarking legal reasoning on 340 law exams.arXiv preprint arXiv:2505.12864,

  5. [6]

    Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Aditya Narang, Alex Choi, Claudia Gruber, et al

    URL https://arxiv.org/abs/2512.11297. Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Ré, Adam Chilton, Aditya Narang, Alex Choi, Claudia Gruber, et al. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 36,

  6. [7]

    Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball

    URLhttps://arxiv.org/abs/2308.11462. Dan Hendrycks, Collin Burns, Aryeh Chen, and Spencer Ball. Cuad: An expert-annotated NLP dataset for legal contract review. InProceedings of the 35th International Conference on Neural Information Processing Systems,

  7. [8]

    Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo

    URLhttps://arxiv.org/abs/2103.06268. Faris Hijazi, Somayah AlHarbi, Abdulaziz AlHussein, Harethah Abu Shairah, Reem AlZahrani, Hebah AlShamlan, Omar Knio, and George Turkiyyah. ArabLegalEval: A multitask benchmark for assessing Arabic legal knowledge in large language models. InProceedings of the Second Arabic Natural Language Processing Conference (Arabi...

  8. [9]

    Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, and Nezihe Merve Gürel

    URLhttps:// arxiv.org/abs/2408.07983. Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, and Nezihe Merve Gürel. Evaluating the limits of large language models in multilingual legal reasoning.arXiv preprint arXiv:2509.22472,

  9. [10]

    Joel Niklaus, Matthias Stürmer, and Ilias Chalkidis

    URLhttps://arxiv.org/abs/2509.22472. Joel Niklaus, Matthias Stürmer, and Ilias Chalkidis. An empirical study on cross-x transfer for legal judgment prediction. InProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics,

  10. [11]

    URLhttps://arxiv.org/abs/2209. 12325. Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias Stürmer, and Ilias Chalkidis. Lextreme: A multi-lingual and multi-task benchmark for the legal domain. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 4973–5006,

  11. [12]

    Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Mark Stevenson

    URL https://arxiv.org/abs/2301.13126. Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Mark Stevenson. MultiLe- galPile: A 689GB multilingual legal corpus. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

  12. [13]

    Volodymyr Ovcharov

    URLhttps://arxiv.org/abs/2306.02069. Volodymyr Ovcharov. The tokenizer tax across 25 European languages: Domain invariance, cross-lingual few-shot effects, and the Ukrainian penalty.arXiv preprint arXiv:2605.24718, 2025a. URLhttps://arxiv.org/abs/2605.24718. Volodymyr Ovcharov. Tokenizer fertility and zero-shot performance on Ukrainian legal text. arXiv p...

  13. [14]

    URLhttps://arxiv.org/abs/2306.09237. 15