arxiv: 2512.14554 · v5 · submitted 2025-12-16 · 💻 cs.CL · cs.AI

Recognition: no theorem link

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Nguyen Tien Dong , Minh-Anh Nguyen , Thanh Dat Hoang , Nguyen Tuan Ngoc , Dao Xuan Quang Minh , Phan Phi Hai , Nguyen Thi Ngoc Anh , Dang Van Tu

show 1 more author

Binh Vu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Vietnamese legal benchmarkLLM evaluationlegal reasoningBloom taxonomyAI legal systemsbenchmark datasetVietnamese lawcognitive taxonomy

0 comments

The pith

VLegal-Bench introduces the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLegal-Bench to address the lack of evaluation tools for how well large language models handle Vietnamese law. It draws on Bloom's cognitive taxonomy to structure tasks that range from basic knowledge recall to advanced problem solving in legal scenarios. With 10,450 samples created and validated by legal experts from authoritative documents, the benchmark aims to mirror real workflows such as answering questions, using retrieval, reasoning in steps, and solving cases. This matters because Vietnamese legislation is complex and changes often, so accurate assessment helps build trustworthy AI legal assistants. The work provides a public resource to standardize testing in this area.

Core claim

VLegal-Bench is the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. It encompasses multiple levels of legal understanding through tasks informed by Bloom's cognitive taxonomy and comprises 10,450 samples generated through a rigorous annotation pipeline where legal experts label and cross-validate each instance using the annotation system to ensure grounding in authoritative legal documents and mirroring of real-world legal assistant workflows.

What carries the argument

VLegal-Bench dataset with tasks organized by Bloom's cognitive taxonomy and validated by expert cross-validation on Vietnamese legal documents.

Load-bearing premise

The chosen tasks and expert cross-validation accurately reflect real-world Vietnamese legal assistant workflows and cover practical reasoning needs.

What would settle it

A follow-up study where independent legal experts create an alternative set of samples from the same documents and find major differences in coverage or difficulty would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2512.14554 by Binh Vu, Dang Van Tu, Dao Xuan Quang Minh, Minh-Anh Nguyen, Nguyen Thi Ngoc Anh, Nguyen Tien Dong, Nguyen Tuan Ngoc, Phan Phi Hai, Thanh Dat Hoang.

**Figure 2.** Figure 2: Legal benchmark data pipeline: data are collected from Vietnamese legal sources, preprocessed, stored in a database, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Annotation tool interface: a custom-built tool that supports junior annotators by attaching senior-selected Articles, [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Legal document retrieval tool: we provide a search interface that enables legal experts to efficiently locate relevant [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, the Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems. To facilitate access and reproducibility, we provide a public landing page for this benchmark at https://vilegalbench.cmcai.vn/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLegal-Bench gives a first Vietnamese legal benchmark with decent task coverage, but the lack of reported agreement numbers leaves the annotation claims hard to verify.

read the letter

The paper's main contribution is a new benchmark of 10,450 Vietnamese legal samples built around Bloom's taxonomy and tied to real statutes. It covers general Q&A, retrieval-augmented generation, multi-step reasoning, and scenario problems, which matches practical legal assistant work better than generic English benchmarks. The public landing page helps with access, and the focus on a non-English jurisdiction fills an obvious hole in the literature. That part is straightforward and worth having on record. The annotation pipeline is described as expert-driven with cross-validation, which sounds reasonable on paper. The problem is that no inter-annotator agreement figures, disagreement rates, or error statistics appear in the write-up. Without those numbers it is difficult to judge whether the samples actually meet the claimed standards or simply reflect the annotators' individual views. The weakest assumption is that the chosen tasks and validation steps automatically mirror day-to-day Vietnamese legal workflows; that claim needs quantitative backing to hold. For readers working on multilingual legal AI or low-resource legal evaluation this is worth reading, mainly for the dataset itself and the task framing. It is not yet a finished product for direct use in papers without the missing reliability data. I would send it to peer review with a clear request for the agreement metrics and any dataset release details, because the gap it targets is real and the scale is large enough to matter once the quality evidence is in place.

Referee Report

1 major / 2 minor

Summary. The paper introduces VLegal-Bench, the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning. It comprises 10,450 samples spanning multiple cognitive levels per Bloom's taxonomy, including general legal Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving. Samples are produced via a described expert annotation and cross-validation pipeline intended to ground each instance in authoritative Vietnamese legal documents while mirroring real-world legal assistant workflows.

Significance. If the annotation pipeline's reliability can be substantiated, VLegal-Bench would provide a valuable, cognitively informed resource for a language and jurisdiction where legal text is complex, hierarchical, and frequently revised. The public release and focus on practical tasks represent clear contributions to multilingual legal AI evaluation.

major comments (1)

[Annotation pipeline description] The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.

minor comments (2)

[Abstract and conclusion] The landing page URL is provided but no details are given on dataset format, licensing, or exact task distribution statistics (e.g., sample counts per cognitive level or task type).
[Introduction] The manuscript would benefit from an explicit comparison table positioning VLegal-Bench against existing legal benchmarks (e.g., LegalBench, Vietnamese-specific resources) to clarify its incremental contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The feedback highlights an important gap in substantiating the annotation pipeline's reliability. We address the major comment point-by-point below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.

Authors: We agree that the absence of quantitative metrics limits the ability to independently verify the pipeline's rigor. The current manuscript describes the expert annotation and cross-validation process qualitatively (including use of authoritative Vietnamese legal sources and alignment with real-world workflows), but does not report numerical statistics. In the revised version, we will add a new subsection (likely in Section 3) that includes: (1) inter-annotator agreement scores computed via Cohen's kappa and Fleiss' kappa across the expert annotators; (2) details on disagreement resolution procedures and rates; and (3) validation error statistics from the cross-validation stage. These additions will directly support the claims of rigor while preserving the existing description of the pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction rests on external documents and expert input

full rationale

The paper introduces VLegal-Bench through a described annotation pipeline that labels and cross-validates 10,450 samples against authoritative Vietnamese legal documents. No equations, fitted parameters, self-citations, or derivations appear in the provided text. All central claims (task coverage, cognitive grounding via Bloom's taxonomy, mirroring of real-world workflows) are presented as outcomes of the external grounding process rather than reductions to internal definitions or prior self-referential results. This is a standard benchmark-construction paper whose content is independent of any load-bearing internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that Bloom's taxonomy maps usefully to legal reasoning levels and that expert annotation produces representative real-world samples. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Bloom's cognitive taxonomy supplies appropriate levels for structuring legal understanding tasks.
The benchmark is explicitly informed by Bloom's taxonomy as stated in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1119 out tokens · 35213 ms · 2026-05-16T21:33:59.487253+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

[1]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopou- los, Daniel Martin Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4310–4330

work page 2022
[2]

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2023. Benchmarking large lan- guage models in retrieval-augmented generation.arXiv preprint arXiv:2309.01431 (2023)

work page arXiv 2023
[3]

Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases.arXiv preprint arXiv:2306.16092(2023)

work page arXiv 2023
[4]

Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, and Hao Wang. 2025. LAiW: A Chinese legal large language models benchmark. InProceedings of the 31st International conference on computational linguistics. 10738–10766

work page 2025
[5]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. 2024. Lawbench: Benchmarking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing. 7933–7962

work page 2024
[7]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Re, et al. 2023. LegalBench: A Collaboratively Built Benchmark for Legal Reasoning. InAdvances in Neural Information Processing Systems, Vol. 36

work page 2023
[9]

Péter Homoki and Zsolt Ződi. 2024. Large language models and their possible uses in law.Hungarian Journal of Legal Studies64, 3 (2024), 435–455

work page 2024
[10]

Quzhe Huang, Mingxu Tao, Chen Zhang, Zhenwei An, Cong Jiang, Zhibin Chen, Zirui Wu, and Yansong Feng. 2023. Lawyer LLaMA Technical Report. arXiv:2305.15062 [cs.CL]

work page arXiv 2023
[11]

Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, and Yiqun Liu

work page
[12]

Lexeval: A comprehensive chinese legal benchmark for evaluating large language models.Advances in Neural Information Processing Systems37 (2024), 25061–25094

work page 2024
[13]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[14]

Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained lan- guage models for Vietnamese. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1037–1042

work page 2020
[15]

Minh Nguyen, Viet Tran, and Huong Le. 2024. Vietnamese Legal Information Retrieval in Question-Answering System.arXiv preprint arXiv:2409.13699(2024)

work page arXiv 2024
[16]

Minh-Tien Nguyen, Truong-Son Nguyen, Dac-Viet Lai, et al. 2021. VLSP 2021 Challenge: Vietnamese Legal Text Processing. InProceedings of the 8th Interna- tional Workshop on Vietnamese Language and Speech Processing

work page 2021
[17]

Joel Niklaus, Danielle Matache, Theodor Würmli, Peter Hofer, and Ilias Chalkidis

work page
[18]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

MultiLegalPile: A 689GB Multilingual Legal Corpus for Training Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1–14

work page 2023
[19]

Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022. ViT5: Pre-trained Text-to-Text Transformer for Vietnamese Language Generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies: Student Research Workshop. 136–142

work page 2022
[20]

Nicholas Pipitone and Ghita Houir Alami. 2024. LegalBench-RAG: A Bench- mark for Retrieval-Augmented Generation in the Legal Domain.arXiv preprint arXiv:2408.09543(2024)

work page arXiv 2024
[21]

Zhongxiang Sun. 2023. A short survey of viewing large language models in legal aspect.arXiv preprint arXiv:2303.09136(2023)

work page arXiv 2023
[22]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, et al. 2022. Chain- of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837

work page 2022
[23]

Xiaoxian Yang, Zhifeng Wang, Qi Wang, Ke Wei, Kaiqi Zhang, and Jiangang Shi

work page
[24]

Large language models for automated q&a involving legal documents: a survey on algorithms, frameworks and applications.International Journal of Web Information Systems20, 4 (2024), 413–435

work page 2024
[25]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

work page 2022
[26]

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, et al. 2023. Disc-LawLLM: Fine-tuning Large Language Models for Effective Legal Reasoning.arXiv preprint arXiv:2309.11325(2023)

work page arXiv 2023
[27]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InAdvances in Neural Information Processing Sys- tems. arXiv:2306.05685 [cs.CL] https://openreview.net/forum?id=uccHP...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho

work page
[29]

InProceedings of the 18th International Conference on Artificial Intelligence and Law

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. InProceedings of the 18th International Conference on Artificial Intelligence and Law. 159–168

work page
[30]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yidong Guo, Yaobo Wang, et al. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.arXiv preprint arXiv:2304.06364(2023). A Details of Annotation System After constructing the Legal Corpus Database and the Knowledge Graph Database to store currently effective legal documents, cov- ering a wide range of leg...

work page internal anchor Pith review Pith/arXiv arXiv 2023