pith. machine review for the scientific record. sign in

arxiv: 2512.14554 · v5 · submitted 2025-12-16 · 💻 cs.CL · cs.AI

Recognition: no theorem link

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Vietnamese legal benchmarkLLM evaluationlegal reasoningBloom taxonomyAI legal systemsbenchmark datasetVietnamese lawcognitive taxonomy
0
0 comments X

The pith

VLegal-Bench introduces the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VLegal-Bench to address the lack of evaluation tools for how well large language models handle Vietnamese law. It draws on Bloom's cognitive taxonomy to structure tasks that range from basic knowledge recall to advanced problem solving in legal scenarios. With 10,450 samples created and validated by legal experts from authoritative documents, the benchmark aims to mirror real workflows such as answering questions, using retrieval, reasoning in steps, and solving cases. This matters because Vietnamese legislation is complex and changes often, so accurate assessment helps build trustworthy AI legal assistants. The work provides a public resource to standardize testing in this area.

Core claim

VLegal-Bench is the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. It encompasses multiple levels of legal understanding through tasks informed by Bloom's cognitive taxonomy and comprises 10,450 samples generated through a rigorous annotation pipeline where legal experts label and cross-validate each instance using the annotation system to ensure grounding in authoritative legal documents and mirroring of real-world legal assistant workflows.

What carries the argument

VLegal-Bench dataset with tasks organized by Bloom's cognitive taxonomy and validated by expert cross-validation on Vietnamese legal documents.

Load-bearing premise

The chosen tasks and expert cross-validation accurately reflect real-world Vietnamese legal assistant workflows and cover practical reasoning needs.

What would settle it

A follow-up study where independent legal experts create an alternative set of samples from the same documents and find major differences in coverage or difficulty would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2512.14554 by Binh Vu, Dang Van Tu, Dao Xuan Quang Minh, Minh-Anh Nguyen, Nguyen Thi Ngoc Anh, Nguyen Tien Dong, Nguyen Tuan Ngoc, Phan Phi Hai, Thanh Dat Hoang.

Figure 1
Figure 1. Figure 1: The five-level cognitive framework of VLegal-Bench. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Legal benchmark data pipeline: data are collected from Vietnamese legal sources, preprocessed, stored in a database, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Annotation tool interface: a custom-built tool that supports junior annotators by attaching senior-selected Articles, [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Legal document retrieval tool: we provide a search interface that enables legal experts to efficiently locate relevant [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

The rapid advancement of large language models (LLMs) has enabled new possibilities for applying artificial intelligence within the legal domain. Nonetheless, the complexity, hierarchical organization, and frequent revisions of Vietnamese legislation pose considerable challenges for evaluating how well these models interpret and utilize legal knowledge. To address this gap, the Vietnamese Legal Benchmark (VLegal-Bench) is introduced, the first comprehensive benchmark designed to systematically assess LLMs on Vietnamese legal tasks. Informed by Bloom's cognitive taxonomy, VLegal-Bench encompasses multiple levels of legal understanding through tasks designed to reflect practical usage scenarios. The benchmark comprises 10,450 samples generated through a rigorous annotation pipeline, where legal experts label and cross-validate each instance using our annotation system to ensure every sample is grounded in authoritative legal documents and mirrors real-world legal assistant workflows, including general legal questions and answers, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving tailored to Vietnamese law. By providing a standardized, transparent, and cognitively informed evaluation framework, VLegal-Bench establishes a solid foundation for assessing LLM performance in Vietnamese legal contexts and supports the development of more reliable, interpretable, and ethically aligned AI-assisted legal systems. To facilitate access and reproducibility, we provide a public landing page for this benchmark at https://vilegalbench.cmcai.vn/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces VLegal-Bench, the first comprehensive benchmark for evaluating LLMs on Vietnamese legal reasoning. It comprises 10,450 samples spanning multiple cognitive levels per Bloom's taxonomy, including general legal Q&A, retrieval-augmented generation, multi-step reasoning, and scenario-based problem solving. Samples are produced via a described expert annotation and cross-validation pipeline intended to ground each instance in authoritative Vietnamese legal documents while mirroring real-world legal assistant workflows.

Significance. If the annotation pipeline's reliability can be substantiated, VLegal-Bench would provide a valuable, cognitively informed resource for a language and jurisdiction where legal text is complex, hierarchical, and frequently revised. The public release and focus on practical tasks represent clear contributions to multilingual legal AI evaluation.

major comments (1)
  1. [Annotation pipeline description] The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.
minor comments (2)
  1. [Abstract and conclusion] The landing page URL is provided but no details are given on dataset format, licensing, or exact task distribution statistics (e.g., sample counts per cognitive level or task type).
  2. [Introduction] The manuscript would benefit from an explicit comparison table positioning VLegal-Bench against existing legal benchmarks (e.g., LegalBench, Vietnamese-specific resources) to clarify its incremental contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The feedback highlights an important gap in substantiating the annotation pipeline's reliability. We address the major comment point-by-point below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: The abstract and the section describing the annotation pipeline claim that the 10,450 samples were produced through 'rigorous' expert labeling and cross-validation to ensure grounding in authoritative documents and fidelity to real-world workflows, yet no quantitative quality metrics (inter-annotator agreement, disagreement resolution rates, or validation error statistics) are reported. This absence directly undermines verification of the central claim that the benchmark accurately reflects practical Vietnamese legal reasoning needs.

    Authors: We agree that the absence of quantitative metrics limits the ability to independently verify the pipeline's rigor. The current manuscript describes the expert annotation and cross-validation process qualitatively (including use of authoritative Vietnamese legal sources and alignment with real-world workflows), but does not report numerical statistics. In the revised version, we will add a new subsection (likely in Section 3) that includes: (1) inter-annotator agreement scores computed via Cohen's kappa and Fleiss' kappa across the expert annotators; (2) details on disagreement resolution procedures and rates; and (3) validation error statistics from the cross-validation stage. These additions will directly support the claims of rigor while preserving the existing description of the pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction rests on external documents and expert input

full rationale

The paper introduces VLegal-Bench through a described annotation pipeline that labels and cross-validates 10,450 samples against authoritative Vietnamese legal documents. No equations, fitted parameters, self-citations, or derivations appear in the provided text. All central claims (task coverage, cognitive grounding via Bloom's taxonomy, mirroring of real-world workflows) are presented as outcomes of the external grounding process rather than reductions to internal definitions or prior self-referential results. This is a standard benchmark-construction paper whose content is independent of any load-bearing internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that Bloom's taxonomy maps usefully to legal reasoning levels and that expert annotation produces representative real-world samples. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Bloom's cognitive taxonomy supplies appropriate levels for structuring legal understanding tasks.
    The benchmark is explicitly informed by Bloom's taxonomy as stated in the abstract.

pith-pipeline@v0.9.0 · 5568 in / 1119 out tokens · 35213 ms · 2026-05-16T21:33:59.487253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopou- los, Daniel Martin Katz, and Nikolaos Aletras. 2022. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4310–4330

  2. [2]

    Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. 2023. Benchmarking large lan- guage models in retrieval-augmented generation.arXiv preprint arXiv:2309.01431 (2023)

  3. [3]

    Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases.arXiv preprint arXiv:2306.16092(2023)

  4. [4]

    Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, and Hao Wang. 2025. LAiW: A Chinese legal large language models benchmark. InProceedings of the 31st International conference on computational linguistics. 10738–10766

  5. [5]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

  6. [6]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, et al. 2024. Lawbench: Benchmarking legal knowledge of large language models. InProceedings of the 2024 conference on empirical methods in natural language processing. 7933–7962

  7. [7]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

  8. [8]

    Neel Guha, Julian Nyarko, Daniel E Ho, Christopher Re, et al. 2023. LegalBench: A Collaboratively Built Benchmark for Legal Reasoning. InAdvances in Neural Information Processing Systems, Vol. 36

  9. [9]

    Péter Homoki and Zsolt Ződi. 2024. Large language models and their possible uses in law.Hungarian Journal of Legal Studies64, 3 (2024), 435–455

  10. [10]

    Quzhe Huang, Mingxu Tao, Chen Zhang, Zhenwei An, Cong Jiang, Zhibin Chen, Zirui Wu, and Yansong Feng. 2023. Lawyer LLaMA Technical Report. arXiv:2305.15062 [cs.CL]

  11. [11]

    Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, and Yiqun Liu

  12. [12]

    Lexeval: A comprehensive chinese legal benchmark for evaluating large language models.Advances in Neural Information Processing Systems37 (2024), 25061–25094

  13. [13]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 2511–2522. doi:10.18653/v1/2023.emnlp-main.153

  14. [14]

    Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained lan- guage models for Vietnamese. InFindings of the Association for Computational Linguistics: EMNLP 2020. 1037–1042

  15. [15]

    Minh Nguyen, Viet Tran, and Huong Le. 2024. Vietnamese Legal Information Retrieval in Question-Answering System.arXiv preprint arXiv:2409.13699(2024)

  16. [16]

    Minh-Tien Nguyen, Truong-Son Nguyen, Dac-Viet Lai, et al. 2021. VLSP 2021 Challenge: Vietnamese Legal Text Processing. InProceedings of the 8th Interna- tional Workshop on Vietnamese Language and Speech Processing

  17. [17]

    Joel Niklaus, Danielle Matache, Theodor Würmli, Peter Hofer, and Ilias Chalkidis

  18. [18]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    MultiLegalPile: A 689GB Multilingual Legal Corpus for Training Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1–14

  19. [19]

    Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022. ViT5: Pre-trained Text-to-Text Transformer for Vietnamese Language Generation. InProceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies: Student Research Workshop. 136–142

  20. [20]

    Nicholas Pipitone and Ghita Houir Alami. 2024. LegalBench-RAG: A Bench- mark for Retrieval-Augmented Generation in the Legal Domain.arXiv preprint arXiv:2408.09543(2024)

  21. [21]

    Zhongxiang Sun. 2023. A short survey of viewing large language models in legal aspect.arXiv preprint arXiv:2303.09136(2023)

  22. [22]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, et al. 2022. Chain- of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837

  23. [23]

    Xiaoxian Yang, Zhifeng Wang, Qi Wang, Ke Wei, Kaiqi Zhang, and Jiangang Shi

  24. [24]

    Large language models for automated q&a involving legal documents: a survey on algorithms, frameworks and applications.International Journal of Web Information Systems20, 4 (2024), 413–435

  25. [25]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  26. [26]

    Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, et al. 2023. Disc-LawLLM: Fine-tuning Large Language Models for Effective Legal Reasoning.arXiv preprint arXiv:2309.11325(2023)

  27. [27]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. InAdvances in Neural Information Processing Sys- tems. arXiv:2306.05685 [cs.CL] https://openreview.net/forum?id=uccHP...

  28. [28]

    Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho

  29. [29]

    InProceedings of the 18th International Conference on Artificial Intelligence and Law

    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. InProceedings of the 18th International Conference on Artificial Intelligence and Law. 159–168

  30. [30]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yidong Guo, Yaobo Wang, et al. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.arXiv preprint arXiv:2304.06364(2023). A Details of Annotation System After constructing the Legal Corpus Database and the Knowledge Graph Database to store currently effective legal documents, cov- ering a wide range of leg...