Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Hiroaki Yamada; Jungmin Choi; Keisuke Sakaguchi

arxiv: 2604.23730 · v1 · submitted 2026-04-26 · 💻 cs.AI

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task

Jungmin Choi , Keisuke Sakaguchi , Hiroaki Yamada This is my paper

Pith reviewed 2026-05-08 05:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelslegal reasoningJapanese bar examopen-ended taskshallucinationsexpert evaluation

0 comments

The pith

Legal experts identify clear weaknesses in LLMs' open-ended reasoning on Japanese bar exam writing tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the first dataset drawn from the writing section of the Japanese bar exam to test LLMs on open-ended legal reasoning, where models must extract multiple issues from long fact patterns and produce structured free-text arguments. Legal experts manually score the model outputs to measure performance on realistic legal analysis. This matters because multiple-choice legal benchmarks already show strong LLM results, yet actual legal work requires generating complete arguments without external support. The study further examines when and how models insert unsupported facts or legal claims into their responses.

Core claim

Through real exam questions, model-generated answers, and expert evaluations, the work demonstrates that current LLMs still fall short on producing complete and accurate legal arguments in the Japanese jurisdiction, with notable shortfalls in identifying all relevant issues and in avoiding content not supported by the given facts or law.

What carries the argument

A dataset of Japanese bar exam writing questions paired with LLM responses and scored by legal experts.

Load-bearing premise

The chosen exam questions and the legal experts' scoring judgments give a representative and reliable picture of open-ended legal reasoning ability.

What would settle it

A new panel of legal experts independently scoring the same set of LLM responses and arriving at substantially different overall quality ratings.

Figures

Figures reproduced from arXiv: 2604.23730 by Hiroaki Yamada, Jungmin Choi, Keisuke Sakaguchi.

**Figure 1.** Figure 1: Excerpt from a writing-test question, machine view at source ↗

read the original abstract

Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs' generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the first dataset for evaluating LLMs on open-ended legal reasoning using the writing component of the Japanese bar exam. LLMs generate responses to these tasks requiring issue identification from narratives and structured arguments, which are then manually evaluated by legal experts to identify limitations and analyze hallucinations.

Significance. If the evaluations prove reliable, this provides a novel benchmark for realistic legal reasoning in Japanese law, extending beyond multiple-choice tests and offering insights into hallucination patterns that could inform safer deployment of LLMs in legal applications.

major comments (2)

[Abstract] The abstract claims the expert evaluation reveals limitations but supplies no sample size, scoring rubric, inter-rater agreement, or quantitative results, preventing assessment of the support for the stated limitations as noted in the review.
[Evaluation section] The central claim depends on the manual evaluation protocol, but without details on how experts were selected, how disagreements were resolved, or the exact criteria for 'limitations in legal reasoning', the findings risk being subjective and non-reproducible.

minor comments (1)

[Title] The title contains 'LLM's' which should be corrected to 'LLMs'' for proper plural possessive form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our manuscript. We address each of the major comments below and have revised the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] The abstract claims the expert evaluation reveals limitations but supplies no sample size, scoring rubric, inter-rater agreement, or quantitative results, preventing assessment of the support for the stated limitations as noted in the review.

Authors: We agree that the abstract would be strengthened by including key quantitative details. In the revised manuscript, we have updated the abstract to specify the sample size (including the number of bar exam questions and LLMs evaluated), a high-level description of the scoring rubric, inter-rater agreement measures, and summary quantitative results. The detailed findings remain in the body of the paper, but this change allows readers to more readily assess the support for our claims regarding limitations in legal reasoning. revision: yes
Referee: [Evaluation section] The central claim depends on the manual evaluation protocol, but without details on how experts were selected, how disagreements were resolved, or the exact criteria for 'limitations in legal reasoning', the findings risk being subjective and non-reproducible.

Authors: We concur that additional details on the evaluation protocol are necessary to ensure the findings are reproducible and less subjective. We have substantially expanded the Evaluation section to describe: the qualifications and selection process for the legal experts (all qualified Japanese attorneys with expertise in the relevant legal areas); the procedure for resolving inter-expert disagreements (through discussion leading to consensus); and the precise criteria used to identify limitations in legal reasoning (such as failure to spot issues, misapplication of statutes or precedents, and logical inconsistencies). We also now report quantitative inter-rater reliability metrics. These revisions directly address the concerns about subjectivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper conducts an empirical study by applying LLMs to real Japanese bar exam writing tasks and obtaining manual evaluations from independent legal experts. No derivation chain, equations, fitted parameters, or predictions are present. The central contribution rests on external exam questions and expert judgments rather than any self-referential definitions, self-citation load-bearing premises, or renamings of known results. The analysis of hallucinations is likewise a direct manual characterization of model outputs against precedents and law. This setup is self-contained against external benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that expert manual judgments constitute valid evidence of reasoning quality and that the selected exam items are representative.

axioms (1)

domain assumption Expert legal judgments are reliable measures of reasoning quality
The study relies on manual expert evaluation without reported details on criteria, training, or agreement statistics.

pith-pipeline@v0.9.0 · 5489 in / 1039 out tokens · 38233 ms · 2026-05-08T05:55:40.095959+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Anthropic. 2024. The Claude 3 Model Family: Opus, Son- net, Haiku (Model Card). https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. Accessed 2026-01-30

2024
[2]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androut- sopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics...

work page doi:10.18653/v1/2022.acl-long.297 2022
[3]

Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos

Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations. In Findings of the Association for Computational Lin- guistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 25099–25119. doi:10.18653/v1/2025.findings-emnlp.1368

work page doi:10.18653/v1/2025.findings-emnlp.1368 2025
[4]

Jungmin Choi, Jungo Kasai, and Keisuke Sakaguchi. 2023. Nihon no Shi- houshiken wo Daizai toshita GPTmoderu no hyouka (Evaluation of GPT Models on the Japanese Bar Examination). Jxiv Preprint (2023). doi:10.51094/jxiv.559 In Japanese

work page doi:10.51094/jxiv.559 2023
[5]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large Le- gal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis 16, 1 (Jan. 2024), 64–93. doi:10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024
[6]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Rea- soning on 340 Law Exams. arXiv: 2505.12864 [cs.CL] http...

work page arXiv 2025
[7]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fa- gan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jes- sica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan ...

work page arXiv 2023
[8]

Rabelo Juliano, Kim Mi-Young, Goebel Randy, Yoshioka Masaharu, Kano Yoshi- nobu, and Satoh Ken. 2021. COLIEE 2020: Methods for Legal Document Re- trieval and Entailment. Lecture Notes in Computer Science (2021), 196–210. doi:10.1007/978-3-030-79942-7_13

work page doi:10.1007/978-3-030-79942-7_13 2021
[9]

Bommarito, Shanqing Gao, et al

Daniel Martin Katz, Michael J. Bommarito, Shanqing Gao, et al. 2023. GPT-4 Passes the Bar Exam. Philosophical Transactions of the Royal Society A (2023)

2023
[10]

Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis. 2024. Interpretable long-form legal question answering with retrieval-augmented large language models. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intel- ligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intel- ligence and Fourteenth Symposium on Educat...

2024
[11]

Hallucination-free? assessing the reliability of leading ai legal research tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Man- ning, and Daniel E. Ho. 2024. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv: 2405.20362 [cs.CL] https://arxiv.org/ abs/2405.20362

work page arXiv 2024
[12]

OpenAI. 2024. GPT-4 Technical Report. arXiv: 2303.08774 [cs.CL] https://arxiv. org/abs/2303.08774

work page internal anchor Pith review arXiv 2024
[13]

OpenAI. 2024. GPT-4o model documentation (snapshot: gpt-4o-2024-11-20). OpenAI documentation. https://platform.openai.com/docs/models/gpt-4o? snapshot=gpt-4o-2024-11-20 Accessed 2026-01-30

2024
[14]

OpenAI. 2024. GPT-4o System Card. OpenAI documentation. https://openai. com/index/gpt-4o-system-card/ Accessed 2026-01-30

2024
[15]

OpenAI. 2025. o3 model documentation. https://platform.openai.com/docs/ models/o3. Accessed 2026-01-30

2025
[16]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When does pretraining help? assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São Paulo, Brazil) (ICAIL ’21). Association for Comput...

work page doi:10.1145/3462757.3466088 2021
[17]

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advance...

2020
[18]

doi:10.1609/AAAI.V34I05.6519

work page doi:10.1609/aaai.v34i05.6519

[1] [1]

Anthropic. 2024. The Claude 3 Model Family: Opus, Son- net, Haiku (Model Card). https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf. Accessed 2026-01-30

2024

[2] [2]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androut- sopoulos, Daniel Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics...

work page doi:10.18653/v1/2022.acl-long.297 2022

[3] [3]

Chlapanis, Dimitris Galanis, Nikolaos Aletras, and Ion Androut- sopoulos

Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, and Ion Androut- sopoulos. 2025. GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations. In Findings of the Association for Computational Lin- guistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 25099–25119. doi:10.18653/v1/2025.findings-emnlp.1368

work page doi:10.18653/v1/2025.findings-emnlp.1368 2025

[4] [4]

Jungmin Choi, Jungo Kasai, and Keisuke Sakaguchi. 2023. Nihon no Shi- houshiken wo Daizai toshita GPTmoderu no hyouka (Evaluation of GPT Models on the Japanese Bar Examination). Jxiv Preprint (2023). doi:10.51094/jxiv.559 In Japanese

work page doi:10.51094/jxiv.559 2023

[5] [5]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large Le- gal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis 16, 1 (Jan. 2024), 64–93. doi:10.1093/jla/laae003

work page doi:10.1093/jla/laae003 2024

[6] [6]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Rea- soning on 340 Law Exams. arXiv: 2505.12864 [cs.CL] http...

work page arXiv 2025

[7] [7]

Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fa- gan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jes- sica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan ...

work page arXiv 2023

[8] [8]

Rabelo Juliano, Kim Mi-Young, Goebel Randy, Yoshioka Masaharu, Kano Yoshi- nobu, and Satoh Ken. 2021. COLIEE 2020: Methods for Legal Document Re- trieval and Entailment. Lecture Notes in Computer Science (2021), 196–210. doi:10.1007/978-3-030-79942-7_13

work page doi:10.1007/978-3-030-79942-7_13 2021

[9] [9]

Bommarito, Shanqing Gao, et al

Daniel Martin Katz, Michael J. Bommarito, Shanqing Gao, et al. 2023. GPT-4 Passes the Bar Exam. Philosophical Transactions of the Royal Society A (2023)

2023

[10] [10]

Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis. 2024. Interpretable long-form legal question answering with retrieval-augmented large language models. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intel- ligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intel- ligence and Fourteenth Symposium on Educat...

2024

[11] [11]

Hallucination-free? assessing the reliability of leading ai legal research tools

Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Man- ning, and Daniel E. Ho. 2024. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv: 2405.20362 [cs.CL] https://arxiv.org/ abs/2405.20362

work page arXiv 2024

[12] [12]

OpenAI. 2024. GPT-4 Technical Report. arXiv: 2303.08774 [cs.CL] https://arxiv. org/abs/2303.08774

work page internal anchor Pith review arXiv 2024

[13] [13]

OpenAI. 2024. GPT-4o model documentation (snapshot: gpt-4o-2024-11-20). OpenAI documentation. https://platform.openai.com/docs/models/gpt-4o? snapshot=gpt-4o-2024-11-20 Accessed 2026-01-30

2024

[14] [14]

OpenAI. 2024. GPT-4o System Card. OpenAI documentation. https://openai. com/index/gpt-4o-system-card/ Accessed 2026-01-30

2024

[15] [15]

OpenAI. 2025. o3 model documentation. https://platform.openai.com/docs/ models/o3. Accessed 2026-01-30

2025

[16] [16]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When does pretraining help? assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (São Paulo, Brazil) (ICAIL ’21). Association for Comput...

work page doi:10.1145/3462757.3466088 2021

[17] [17]

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advance...

2020

[18] [18]

doi:10.1609/AAAI.V34I05.6519

work page doi:10.1609/aaai.v34i05.6519