pith. sign in

arxiv: 2606.18699 · v1 · pith:D4UNK3MZnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI· cs.IR

TW-LegalBench: Measuring Taiwanese Legal Understanding

Pith reviewed 2026-06-26 20:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords Taiwanese lawlegal benchmarklarge language modelslegal judgment predictionexam evaluationcivil lawLLM performance
0
0 comments X

The pith

Top LLMs exceed the passing threshold for Taiwanese lawyers on official exams but struggle to cite exact legal articles in judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TW-LegalBench to test LLMs on Taiwanese law with real exam questions and court cases drawn from public official sources. It finds that leading models clear the passing score for qualified lawyers on multiple-choice and essay tasks yet fall short of the thresholds for judges and prosecutors. On legal judgment prediction the models handle verdict types and sentence lengths at usable rates but rarely name the precise statutes that the judgments require. This matters because it shows where current models reach near-human exam performance while still lacking the precision needed for reliable legal text output.

Core claim

TW-LegalBench comprises over 16,000 multiple-choice questions from five years of examinations across 18 domains, 117 open-ended essay questions scored against official rubrics, and more than 14,000 legal judgment prediction instances spanning hundreds of crime categories. Evaluation of 13 LLMs shows top models surpass the 11 percent passing rate for lawyers but remain below the 1-2 percent rates for judges and prosecutors; verdict-type accuracy and sentence prediction are reasonable while exact legal-article citation remains weak.

What carries the argument

TW-LegalBench benchmark of MCQs, rubric-scored OEQs, and LJP instances from Taiwan's public legal corpus, measured by accuracy, decomposed LLM-as-Judge scores, and statute-citation metrics.

If this is right

  • LLMs can reach lawyer-level performance on jurisdiction-specific qualification exams.
  • Exact statute citation in judgment tasks requires targeted improvement beyond current capabilities.
  • Reliable legal text generation stays difficult even when exam scores approach human thresholds.
  • The benchmark supplies a public yardstick for tracking progress on civil-law legal reasoning.
  • Rubric-based LLM judging offers a way to score open-ended legal responses at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other civil-law jurisdictions could construct comparable benchmarks from their own public examination and judgment records.
  • Models that succeed on exams may still need retrieval augmentation to handle precise citation demands.
  • The gap between multiple-choice success and open-ended citation accuracy points to distinct sub-skills in legal AI.
  • Such benchmarks could guide the design of tools that support legal education without claiming full professional competence.

Load-bearing premise

The tasks taken from official examinations and judgments sufficiently represent the legal understanding required in actual practice.

What would settle it

A follow-up study in which practicing Taiwanese lawyers rate the benchmark items as unrepresentative of daily work, or in which new models achieve both high exam scores and accurate statute citation on the LJP set.

Figures

Figures reproduced from arXiv: 2606.18699 by Chan Wei Hsu, Chun Huang Lin, Fei-Yueh Chen, Kuan Hsuan Yeh, Kuan-Ming Chen, Patrick Chung-Chia Huang, Zih-Ching Chen.

Figure 1
Figure 1. Figure 1: Main Framework for TW-LegalBench Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita￾tion on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abs… view at source ↗
Figure 2
Figure 2. Figure 2: An error analysis example from the Civil Code on [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model performance on OEQs compared to human examinees averaged over 2021-2024. (a) shows results for the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TW-LegalBench, a benchmark for LLMs on Taiwanese law drawn from public official sources. It includes >16k MCQs from five years of exams across 18 domains, 117 OEQs with official rubrics, and >14k LJP instances across hundreds of crime categories. Thirteen LLMs are evaluated via accuracy (MCQs), a decomposed LLM-as-Judge rubric scorer (OEQs), and verdict/sentence/citation metrics (LJP). The central claims are that top models exceed the 11% lawyer qualification passing rate but fall short of the 1-2% rates for judges/prosecutors, while LJP shows reasonable verdict-type and sentence accuracy but poor exact statute citation.

Significance. If the tasks validly measure the targeted capabilities, the work supplies a much-needed jurisdiction-specific benchmark for a civil-law system using Traditional Chinese sources, complementing existing English/common-law and Simplified-Chinese resources. The reliance on official public corpora is a clear strength for reproducibility and future extension.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (results): The claim that top models 'exceed the passing threshold for qualified lawyers (passing rate: 11%)' treats MCQ/OEQ accuracy as directly comparable to human qualification thresholds. No section demonstrates that benchmark performance correlates with on-the-job legal reasoning metrics (e.g., case outcome quality or expert ratings), leaving the threshold comparison load-bearing for the 'legal understanding' interpretation.
  2. [§3.2] §3.2 (OEQ evaluation): The decomposed LLM-as-Judge framework inherits any gaps in the official rubrics and potential biases of the judge model. The manuscript reports no human validation, inter-rater agreement, or calibration against expert scorers, which directly affects the reliability of the OEQ results used to support the passing-rate claims.
  3. [§3.3 and §4.3] §3.3 and §4.3 (LJP): The observation that models 'struggle to cite exact legal articles' is presented without accompanying error analysis (e.g., distinguishing hallucination, retrieval failure, or knowledge cutoff). This omission weakens the downstream claim that 'reliable legal text generation remains challenging.'
minor comments (2)
  1. [Table 1 and §2] Table 1 and §2: Ensure consistent reporting of the exact number of instances per domain and any filtering steps applied to the official exam corpus.
  2. [§5] §5 (discussion): Add a brief limitations paragraph addressing the gap between exam-style tasks and open-ended practical legal work (e.g., multi-party argumentation or novel fact patterns).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment point by point below, indicating planned changes to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results): The claim that top models 'exceed the passing threshold for qualified lawyers (passing rate: 11%)' treats MCQ/OEQ accuracy as directly comparable to human qualification thresholds. No section demonstrates that benchmark performance correlates with on-the-job legal reasoning metrics (e.g., case outcome quality or expert ratings), leaving the threshold comparison load-bearing for the 'legal understanding' interpretation.

    Authors: The tasks and thresholds are drawn directly from official qualification examinations, providing a standardized basis for comparison to the reported passing rates. We agree, however, that the manuscript does not establish correlations with on-the-job performance metrics. In revision we will add an explicit limitations paragraph distinguishing exam-based evaluation from real-world legal practice and noting the absence of such validation data. revision: partial

  2. Referee: [§3.2] §3.2 (OEQ evaluation): The decomposed LLM-as-Judge framework inherits any gaps in the official rubrics and potential biases of the judge model. The manuscript reports no human validation, inter-rater agreement, or calibration against expert scorers, which directly affects the reliability of the OEQ results used to support the passing-rate claims.

    Authors: The framework applies the official rubrics verbatim. We acknowledge that no human validation or inter-rater statistics are reported. The revised manuscript will include a limitations discussion of this point and a recommendation for future expert calibration studies. revision: yes

  3. Referee: [§3.3 and §4.3] §3.3 and §4.3 (LJP): The observation that models 'struggle to cite exact legal articles' is presented without accompanying error analysis (e.g., distinguishing hallucination, retrieval failure, or knowledge cutoff). This omission weakens the downstream claim that 'reliable legal text generation remains challenging.'

    Authors: We agree that an error analysis would strengthen the interpretation. The revision will add a short error-analysis subsection that categorizes citation failures with illustrative examples to better support the claim regarding challenges in legal text generation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark from public official sources

full rationale

The paper presents TW-LegalBench as a collection of MCQs, OEQs, and LJP instances drawn directly from public Taiwanese examination corpora and judgments. Evaluation uses standard accuracy, rubric-based LLM-as-Judge, and citation metrics with no equations, fitted parameters, or predictions that reduce to the inputs by construction. Threshold comparisons (e.g., lawyer passing rate 11%) are external benchmarks applied to observed scores, not derived internally. No self-citation load-bearing steps or ansatz smuggling appear. The work is self-contained against external data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical benchmark paper, there are no free parameters, axioms, or invented entities underlying a central claim; the work focuses on dataset creation and model evaluation.

pith-pipeline@v0.9.1-grok · 5821 in / 1175 out tokens · 32854 ms · 2026-06-26T20:57:24.579095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    Po-Heng Chen, Sijia Cheng, Wei-Lin Chen, Yen-Ting Lin, and Yun-Nung Chen. 2024. Measuring Taiwanese Mandarin Language Understanding. CoRR abs/2403.20180 (2024). arXiv: 2403.20180 doi:10.48550/ARXIV.2403.20180

  2. [2]

    Pin-Er Chen, Da-Chen Lian, Jou-An Chi, Shu-Kai Hsieh, Sieh-Chuen Huang, Hsuan-Lei Shao, Jun-Wei Chiu, Yang-Hsien Lin, Zih-Ching Chen, Cheng-Kuang Lee, Eddie TC Huang, and Simon See. 2025. Continual Pre-Training is (not) What You Need in Domain Adaptation. In Proceedings of the Asian Conference on Ma- chine Learning (Proceedings of Machine Learning Researc...

  3. [3]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vin- cent Ng. 2024. LawBench: Benchmarking Legal Knowledge of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing , Yaser Al-Onaizan, Mohit Bansal, and Yun...

  4. [4]

    doi:10.18653/v1/2024.emnlp-main.452

  5. [5]

    Jens Frankenreiter, Kevin L Cope, Scott Hirst, Eric A Posner, Daniel Schwarcz, and Dane Thorley. 2024. Grading Machines: Can AI Exam-Grading Replace Law Professors? SSRN Electronic Journal (2024). doi:10.2139/ssrn.5851362

  6. [6]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Zhouchi Lin, Bowen Zhang, Lionel Ni, Wen Gao, Yuanzhuo Wang, and Jian Guo

  7. [7]

    doi: doi.org/10

    A survey on LLM-as-a-Judge. The Innovation (2026), 101253. doi:10.1016/ j.xinn.2025.101253

  8. [8]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fa- gan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H....

  9. [9]

    Zhuo Han, Yi Yang, Yi Feng, Wanhong Huang, Xuxing Ding, Chuanyi Li, Jidong Ge, and Vincent Ng. 2025. LawShift: Benchmarking Legal Judgment Prediction Under Statute Shifts. In The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track. https://openreview.net/ forum?id=5SpFenlxDF

  10. [10]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Under- standing. Proceedings of the International Conference on Learning Representations (ICLR) (2021)

  11. [11]

    Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang Chen, and Da-Shan Shiu. 2024. Breeze-7B Technical Report. arXiv: 2403.02712 [cs.CL] https://arxiv.org/abs/2403.02712

  12. [12]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv: 2310.06...

  13. [13]

    Yen-Ting Lin and Yun-Nung Chen. 2023. Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model. arXiv: 2311.17487 [cs.CL] https://arxiv.org/abs/2311.17487

  14. [14]

    Meta AI. 2024. The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783

  15. [15]

    NVIDIA. 2025. Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning. arXiv: 2512.20848 [cs.CL] https://arxiv.org/abs/2512.20848

  16. [16]

    OpenAI. 2023. GPT-4 Technical Report. arXiv: 2303.08774 [cs.CL]

  17. [17]

    OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

  18. [18]

    OpenAI. 2025. OpenAI GPT-5 System Card. arXiv: 2601.03267 [cs.CL] https: //arxiv.org/abs/2601.03267

  19. [19]

    Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Ade- lani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchi- sio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebas- tian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee,...

  20. [20]

    Zhi Rui Tam, Ya Ting Pai, Yen-Wei Lee, Hong-Han Shuai, Jun-Da Chen, Wei Min Chu, and Sega Cheng. 2024. TMMLU+: An Improved Traditional Chinese Eval- uation Suite for Foundation Models. In First Conference on Language Modeling . https://openreview.net/forum?id=95TayIeqJ4

  21. [21]

    Qwen Team. 2024. Qwen2.5 Technical Report. ArXiv abs/2412.15115 (2024). https://api.semanticscholar.org/CorpusID:274859421

  22. [22]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv: 2505.09388 [cs.CL] https: //arxiv.org/abs/2505.09388

  23. [23]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. In Proceedings of the 36th In- ternational Conference on Neural Information Processing Systems (New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Ho...