Recognition: no theorem link
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models
Pith reviewed 2026-05-16 21:33 UTC · model grok-4.3
The pith
VLegal-Bench introduces the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLegal-Bench is the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. It encompasses multiple levels of legal understanding through tasks informed by Bloom's cognitive taxonomy and comprises 10,450 samples generated through a rigorous annotation pipeline where legal experts label and cross-validate each instance using the annotation system to ensure grounding in authoritative legal documents and mirroring of real-world legal assistant workflows.
What carries the argument
VLegal-Bench dataset with tasks organized by Bloom's cognitive taxonomy and validated by expert cross-validation on Vietnamese legal documents.
Load-bearing premise
The chosen tasks and expert cross-validation accurately reflect real-world Vietnamese legal assistant workflows and cover practical reasoning needs.
What would settle it
A follow-up study where independent legal experts create an alternative set of samples from the same documents and find major differences in coverage or difficulty would challenge the benchmark's validity.
Figures
read the original abstract
The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, the Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems. To facilitate access and reproducibility, we provide a public landing page for this benchmark at https://vilegalbench.cmcai.vn/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLegal-Bench, the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning. It comprises 10,450 samples spanning multiple cognitive levels per Bloom's taxonomy, including general legal Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving. Samples are produced via a described expert annotation and cross-validation pipeline intended to ground each instance in authoritative Vietnamese legal documents while mirroring real-world legal assistant workflows.
Significance. If the annotation pipeline's reliability can be substantiated, VLegal-Bench would provide a valuable, cognitively informed resource for a language and jurisdiction where legal text is complex, hierarchical, and frequently revised. The public release and focus on practical tasks represent clear contributions to multilingual legal AI evaluation.
major comments (1)
- [Annotation pipeline description] The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.
minor comments (2)
- [Abstract and conclusion] The landing page URL is provided but no details are given on dataset format, licensing, or exact task distribution statistics (e.g., sample counts per cognitive level or task type).
- [Introduction] The manuscript would benefit from an explicit comparison table positioning VLegal-Bench against existing legal benchmarks (e.g., LegalBench, Vietnamese-specific resources) to clarify its incremental contribution.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. The feedback highlights an important gap in substantiating the annotation pipeline's reliability. We address the major comment point-by-point below and commit to revisions that will strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.
Authors: We agree that the absence of quantitative metrics limits the ability to independently verify the pipeline's rigor. The current manuscript describes the expert annotation and cross-validation process qualitatively (including use of authoritative Vietnamese legal sources and alignment with real-world workflows), but does not report numerical statistics. In the revised version, we will add a new subsection (likely in Section 3) that includes: (1) inter-annotator agreement scores computed via Cohen's kappa and Fleiss' kappa across the expert annotators; (2) details on disagreement resolution procedures and rates; and (3) validation error statistics from the cross-validation stage. These additions will directly support the claims of rigor while preserving the existing description of the pipeline. revision: yes
Circularity Check
No circularity: benchmark construction rests on external documents and expert input
full rationale
The paper introduces VLegal-Bench through a described annotation pipeline that labels and cross-validates 10,450 samples against authoritative Vietnamese legal documents. No equations, fitted parameters, self-citations, or derivations appear in the provided text. All central claims (task coverage, cognitive grounding via Bloom's taxonomy, mirroring of real-world workflows) are presented as outcomes of the external grounding process rather than reductions to internal definitions or prior self-referential results. This is a standard benchmark-construction paper whose content is independent of any load-bearing internal loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bloom's cognitive taxonomy supplies appropriate levels for structuring legal understanding tasks.
Reference graph
Works this paper leans on
-
[1]
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopou- los, Daniel Martin Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4310–4330
work page 2022
- [2]
- [3]
-
[4]
Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, and Hao Wang. 2025. LAiW: A Chinese legal large language models benchmark. InProceedings of the 31st International conference on computational linguistics. 10738–10766
work page 2025
-
[5]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. 2024. Lawbench: Benchmarking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing. 7933–7962
work page 2024
-
[7]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Re, et al. 2023. LegalBench: A Collaboratively Built Benchmark for Legal Reasoning. InAdvances in Neural Information Processing Systems, Vol. 36
work page 2023
-
[9]
Péter Homoki and Zsolt Ződi. 2024. Large language models and their possible uses in law.Hungarian Journal of Legal Studies64, 3 (2024), 435–455
work page 2024
- [10]
-
[11]
Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, and Yiqun Liu
-
[12]
Lexeval: A comprehensive chinese legal benchmark for evaluating large language models.Advances in Neural Information Processing Systems37 (2024), 25061–25094
work page 2024
-
[13]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153
-
[14]
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained lan- guage models for Vietnamese. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1037–1042
work page 2020
- [15]
-
[16]
Minh-Tien Nguyen, Truong-Son Nguyen, Dac-Viet Lai, et al. 2021. VLSP 2021 Challenge: Vietnamese Legal Text Processing. InProceedings of the 8th Interna- tional Workshop on Vietnamese Language and Speech Processing
work page 2021
-
[17]
Joel Niklaus, Danielle Matache, Theodor Würmli, Peter Hofer, and Ilias Chalkidis
-
[18]
InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
MultiLegalPile: A 689GB Multilingual Legal Corpus for Training Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1–14
work page 2023
-
[19]
Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022. ViT5: Pre-trained Text-to-Text Transformer for Vietnamese Language Generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies: Student Research Workshop. 136–142
work page 2022
- [20]
- [21]
-
[22]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, et al. 2022. Chain- of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837
work page 2022
-
[23]
Xiaoxian Yang, Zhifeng Wang, Qi Wang, Ke Wei, Kaiqi Zhang, and Jiangang Shi
-
[24]
Large language models for automated q&a involving legal documents: a survey on algorithms, frameworks and applications.International Journal of Web Information Systems20, 4 (2024), 413–435
work page 2024
-
[25]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations
work page 2022
- [26]
-
[27]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InAdvances in Neural Information Processing Sys- tems. arXiv:2306.05685 [cs.CL] https://openreview.net/forum?id=uccHP...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho
-
[29]
InProceedings of the 18th International Conference on Artificial Intelligence and Law
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. InProceedings of the 18th International Conference on Artificial Intelligence and Law. 159–168
-
[30]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yidong Guo, Yaobo Wang, et al. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.arXiv preprint arXiv:2304.06364(2023). A Details of Annotation System After constructing the Legal Corpus Database and the Knowledge Graph Database to store currently effective legal documents, cov- ering a wide range of leg...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.