arxiv: 2512.04578 · v3 · submitted 2025-12-04 · 💻 cs.CL

LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence

Wenjin Liu , Haoran Luo , Xin Feng , Xiang Ji , Lijuan Zhou , Rui Mao , Jiapu Wang , Shirui Pan

show 1 more author

Erik Cambria

This is my paper

Pith reviewed 2026-05-17 02:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal general intelligenceLLM benchmarkChinese legal AIexpert-level evaluationlegal reasoningmultiple-choice legal questionsmodel performance gaps

0 comments

The pith

LexGenius benchmark shows even the strongest LLMs fall well short of human legal professionals across 20 core abilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LexGenius as a new expert-level Chinese legal benchmark built on a Dimension-Task-Ability structure that spans seven dimensions, eleven tasks, and twenty abilities. Questions are drawn from recent cases and exams and filtered through repeated manual and LLM checks to limit leakage. When twelve leading models are tested, clear performance gaps appear between abilities and between every model and human experts. The authors argue this benchmark can track progress toward reliable legal general intelligence. Such a tool matters because legal work demands precise reasoning where current AI shortcuts create real risks of error.

Core claim

LexGenius evaluates legal general intelligence through a Dimension-Task-Ability framework covering seven dimensions, eleven tasks, and twenty abilities. Multiple-choice questions are generated from recent legal cases and exam items, then verified in multiple rounds of manual and LLM review to reduce leakage. Testing twelve state-of-the-art LLMs reveals substantial disparities across the twenty abilities, with even the strongest models performing below human legal professionals.

What carries the argument

The Dimension-Task-Ability framework that organizes evaluation into seven dimensions, eleven tasks, and twenty specific legal abilities for systematic testing of understanding, reasoning, and decision-making.

If this is right

Models will require targeted improvement on the specific abilities where gaps are widest rather than uniform scaling.
Legal AI systems will continue to need human oversight until models close the measured performance difference.
Future benchmark updates using newer cases can track whether legal intelligence in LLMs is advancing over time.
Developers can use the twenty-ability breakdown to prioritize training data and techniques for weaker areas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework might transfer to other high-stakes professional domains to expose similar intelligence gaps.
Persistent shortfalls on recent cases suggest that legal reasoning may depend on structured knowledge that general pre-training does not fully capture.
If later models improve on LexGenius without corresponding gains in real-world legal outcomes, the benchmark's predictive validity would need re-examination.

Load-bearing premise

The multiple-choice questions drawn from recent cases and exams, after manual and LLM review, accurately and reliably measure expert-level legal general intelligence without meaningful data leakage or cultural bias.

What would settle it

A new LLM that reaches or exceeds average human legal professional scores on the full LexGenius set, with independent verification that none of the test items appeared in its training data, would directly contradict the claim that current models lag behind experts.

Figures

Figures reproduced from arXiv: 2512.04578 by Erik Cambria, Haoran Luo, Jiapu Wang, Lijuan Zhou, Rui Mao, Shirui Pan, Wenjin Liu, Xiang Ji, Xin Feng.

**Figure 2.** Figure 2: LexGenius can be divided into 3 levels: The first level includes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The MCQ construction workflow of the LexGenius, which is a process where LLM and manual work are [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Data distribution of LexGenius. Left: the MCQ proportions across different laws and the dimensions, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the 12 SOTA LLMs with human experts on 7 core dimensions of legal intelligence. experts in tasks that require dynamic reasoning and institutional understanding (e.g., legal application analysis and case reasoning and judgment). Particularly in tasks involving value trade-offs (e.g., legal and ethical judgment), LLMs tend to avoid complex judgments and lack critical thinking and contextual… view at source ↗

**Figure 6.** Figure 6: Average ranking and average score ranking of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of 12 LLMs across 6 legal language indicators, showing gaps compared to the human [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: We utilize an MCQ sample case to evaluate [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 11.** Figure 11: The MCQ sample of ability 3. The left is the [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 9.** Figure 9: The MCQ sample of ability 1. The left is the [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 12.** Figure 12: The MCQ sample of ability 4. The left is the [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 10.** Figure 10: The MCQ sample of ability 2. The left is the [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 13.** Figure 13: The MCQ sample of ability 5. The left is the [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 17.** Figure 17: The MCQ sample of ability 9. The left is the [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: The MCQ sample of ability 10. The left is [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: The MCQ sample of ability 11. The left is [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

**Figure 23.** Figure 23: The MCQ sample of ability 15. The left is [PITH_FULL_IMAGE:figures/full_fig_p016_23.png] view at source ↗

**Figure 21.** Figure 21: The MCQ sample of ability 13. The left is [PITH_FULL_IMAGE:figures/full_fig_p016_21.png] view at source ↗

**Figure 25.** Figure 25: The MCQ sample of ability 17. The left is [PITH_FULL_IMAGE:figures/full_fig_p017_25.png] view at source ↗

**Figure 26.** Figure 26: The MCQ sample of ability 18. The left is [PITH_FULL_IMAGE:figures/full_fig_p017_26.png] view at source ↗

**Figure 27.** Figure 27: The MCQ sample of ability 19. The left is [PITH_FULL_IMAGE:figures/full_fig_p017_27.png] view at source ↗

**Figure 29.** Figure 29: The two utilized prompt methods for LLMs. [PITH_FULL_IMAGE:figures/full_fig_p018_29.png] view at source ↗

**Figure 30.** Figure 30: The correlation analysis of legal intelligence [PITH_FULL_IMAGE:figures/full_fig_p019_30.png] view at source ↗

read the original abstract

Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LexGenius adds a Chinese-focused legal benchmark with a Dimension-Task-Ability structure and shows models trail experts, though the leakage controls need more concrete checks to back the gaps.

read the letter

The main point with this paper is that it introduces LexGenius as a benchmark for legal general intelligence in LLMs, built around recent Chinese legal cases and exams, and it finds that current models show clear gaps compared to human professionals across different abilities. The Dimension-Task-Ability framework is the new organizing idea here. They do a decent job pulling together questions from fresh sources and running them through manual and LLM-assisted reviews to limit data leakage. Evaluating a dozen leading models and breaking down performance by tasks gives a practical picture of where things stand in legal AI for Chinese contexts. That's useful for anyone tracking progress in applied domains. The softer part is the validation. The abstract mentions multiple rounds of checks, but without details on agreement rates between reviewers or tests like seeing if models do worse on reworded versions of the same questions, it's hard to be sure the results reflect reasoning rather than pattern matching or uneven training data exposure. In legal Chinese, that could matter a lot because coverage in pretraining varies. If some of the disparity comes from that, the gap to humans might look different. This is the kind of work that matters for groups focused on domain-specific AI evaluation, especially outside English. Readers building legal tools or running their own benchmarks would find the dataset and the ability breakdowns worth looking at. I'd put it through peer review. The core idea is solid enough, and referees could help tighten the methodology around quality assurance and leakage controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces LexGenius, an expert-level Chinese legal benchmark for LLMs' legal general intelligence. It uses a Dimension-Task-Ability framework (7 dimensions, 11 tasks, 20 abilities) and constructs multiple-choice questions from recent legal cases and exam questions. A combination of manual and LLM reviews across multiple rounds is applied to reduce data leakage risks. Evaluation of 12 state-of-the-art LLMs reveals significant disparities across legal intelligence abilities, with even the strongest models lagging behind human legal professionals.

Significance. A specialized benchmark focused on Chinese legal GI fills a gap left by result-oriented existing benchmarks and could guide targeted improvements in legal reasoning for LLMs. The open release of the project supports further research. The headline empirical finding on ability disparities is potentially useful if the benchmark's validity holds, but its impact depends on demonstrated robustness against leakage and bias.

major comments (2)

[Benchmark construction and abstract] The abstract and benchmark construction description state that manual plus LLM reviews reduce data leakage risks and ensure reliability through multiple rounds of checks, yet no quantitative metrics are reported (e.g., inter-annotator agreement, fraction of items rejected for contamination, or performance drop on paraphrased/synonym-substituted versions). This is load-bearing for the central claim that LexGenius faithfully measures expert-level legal GI and that observed disparities reflect genuine reasoning gaps rather than residual leakage or surface patterns, especially given varying training corpus coverage for Chinese legal text.
[Evaluation and analysis] The evaluation section reports disparities and human comparisons but does not include controlled experiments or statistical controls for model-specific data coverage in Chinese legal domains. Without such checks, it is difficult to confirm that lower scores indicate deficits in legal reasoning rather than differences in pretraining exposure.

minor comments (2)

[Benchmark construction] Clarify the exact criteria and number of review rounds applied by the LLM component, and provide example rejected/approved items to illustrate the process.
[Dimension-Task-Ability framework] Ensure all 20 abilities are explicitly defined with at least one sample question per ability in the main text or a dedicated appendix table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript introducing LexGenius. The comments help clarify areas where additional evidence can strengthen the presentation of our benchmark construction and evaluation results. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Benchmark construction and abstract] The abstract and benchmark construction description state that manual plus LLM reviews reduce data leakage risks and ensure reliability through multiple rounds of checks, yet no quantitative metrics are reported (e.g., inter-annotator agreement, fraction of items rejected for contamination, or performance drop on paraphrased/synonym-substituted versions). This is load-bearing for the central claim that LexGenius faithfully measures expert-level legal GI and that observed disparities reflect genuine reasoning gaps rather than residual leakage or surface patterns, especially given varying training corpus coverage for Chinese legal text.

Authors: We agree that quantitative metrics would provide stronger support for the reliability of our multi-round review process. The original manuscript described the combination of manual and LLM reviews but omitted specific statistics. In the revised version, we will add inter-annotator agreement scores from the manual annotation rounds, the fraction of candidate items rejected or modified for potential contamination or quality issues, and results from a post-construction validation experiment measuring performance drops on paraphrased and synonym-substituted subsets of the benchmark. These details will be incorporated into the benchmark construction section to better substantiate the claim that observed disparities reflect genuine legal intelligence gaps. revision: yes
Referee: [Evaluation and analysis] The evaluation section reports disparities and human comparisons but does not include controlled experiments or statistical controls for model-specific data coverage in Chinese legal domains. Without such checks, it is difficult to confirm that lower scores indicate deficits in legal reasoning rather than differences in pretraining exposure.

Authors: We acknowledge the importance of addressing potential confounds from pretraining data exposure. Our evaluation demonstrates consistent gaps relative to human experts across a diverse set of recent legal cases and exam questions, but we did not include explicit controlled experiments isolating data coverage. In the revision, we will add a dedicated limitations subsection discussing known characteristics of the evaluated models' training data where publicly available, and note that full statistical controls are not feasible without proprietary corpus details. We will also emphasize how the Dimension-Task-Ability framework and use of post-2023 legal materials help focus on reasoning rather than memorization. This will be presented as a partial revision with expanded discussion rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity: benchmark construction and empirical evaluation only

full rationale

The paper constructs LexGenius by sourcing multiple-choice questions from recent Chinese legal cases and exams, applies manual plus LLM review rounds to mitigate leakage, and then reports empirical scores for 12 external LLMs against human baselines. No equations, fitted parameters, predictions, or derivations are present. The headline finding (LLMs lag experts with ability disparities) is a direct observation from the evaluation results rather than a quantity derived from or equivalent to the benchmark inputs by construction. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify core claims. The work is self-contained against external benchmarks and contains no load-bearing steps that reduce to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an evaluation framework and dataset rather than relying on new theoretical axioms or fitted parameters.

axioms (1)

domain assumption Recent legal cases and exam questions can form the basis for valid multiple-choice items that test expert-level legal intelligence.
Invoked in the benchmark construction process described in the abstract.

pith-pipeline@v0.9.0 · 5526 in / 1123 out tokens · 55595 ms · 2026-05-17T02:07:49.133903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

arXiv preprint arXiv:2405.01769

A survey on large language models for criti- cal societal domains: Finance, healthcare, and law. arXiv preprint arXiv:2405.01769. Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vi- tor Guizilini, and Yue Wang. 2025. Physbench: Benchmarking and enhancing vision-language mod- els for physical world understanding.arXiv preprint arXiv:2501.16411. Maelenn Corfm...

work page arXiv 2025
[2]

arXiv preprint arXiv:2306.16092 , year=

Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture- of-experts large language model.arXiv preprint arXiv:2306.16092. Yongfu Dai, Duanyu Feng, Jimin Huang, Haochen Jia, Qianqian Xie, Yifang Zhang, Weiguang Han, Wei Tian, and Hao Wang. 2025. Laiw: A chinese legal large language models benchmark. InCOLING. Yi Dong, Ron...

work page arXiv 2025
[3]

question

Precise understanding of legal provisions. Ability to accurately interpret key terms, condi- tions, and structural logic of legal clauses, includ- ing scope and applicability. An MCQ sample of this ability in the LexGenius is shown in Figure 9. { "question": "甲乙公司因供销合同纠纷向仲裁委提起仲裁并获得仲裁裁决后，甲公司以仲裁裁决事项超出仲裁协议范围为由向法院申请不予执行仲裁裁决。法院审理过程中，认为超裁部分与其他裁决事项可分，下列哪项...

work page
[4]

question

Contextual understanding of legal provi- sions.Ability to interpret legal text within the correct legal and social context, avoiding misinter- pretation based on literal reading alone. An MCQ sample of this ability in the LexGenius is shown in Figure 10. { "question": "关于刑事诉讼法的基本原则，下列哪一表述是错误的?", "options": [ "A. 人民法院审判案件，除刑事诉讼法另有规定以外，一般应公开进行", "B. 人民法院依...

work page
[5]

question

Understanding of legal provisions and so- cial phenomena.Ability to relate legal provisions to real-world events, social needs, and historical developments. An MCQ sample of this ability in the LexGenius is shown in Figure 11. Ability 3 { "question": "In recent years, the pace of legislation in China has accelerated, and a relatively complete legal system...

work page
[6]

question

Logical ability to reason toward legal con- clusions.Ability to construct sound legal argu- ments based on facts and rules, forming consistent and well-structured conclusions. An MCQ sample of this ability in the LexGenius is in Figure 12. Ability 4 { "question": "Which of the following statements about result-aggravated offenses is correct?", "options": ...

work page
[7]

question

Making reasonable inferences from unclear legal texts.Ability to infer appropriate meanings from vague, ambiguous, or abstract legal language using legal logic and principles. An MCQ sample of this ability in the LexGenius is in Figure 13. Ability 5 { "question": "Freedom of speech is an important means by which citizens supervise and restrain public powe...

work page
[8]

question

Adjusting legal reasoning based on dif- ferent legal contexts.Ability to adapt reasoning strategies when applying different branches of law, such as civil, criminal, or administrative. An MCQ sample of this ability in the LexGenius is shown in Figure 14. Ability 6 { "question": "A commercial company owns a 7-story office building. In July 2022, the compan...

work page 2022
[9]

question

Analyze legal cases.Ability to identify rele- vant facts and legal issues in a case and link them with the applicable legal norms or precedents. An MCQ sample of this ability in the LexGenius is shown in Figure 15. Ability 7 { "question": "Delivery rider Zhang saw Li jumping into the river. He handed his phone to a passerby, Wang, and jumped from a height...

work page
[10]

question

Choosing and correctly citing the relevant laws.Ability to select the most appropriate legal provisions for a given scenario and cite them accu- rately in reasoning. An MCQ sample of this ability in the LexGenius is shown in Figure 16. Ability 8 { "question": "During a routine inspection, the Market Supervision Administration of a certain district discove...

work page
[11]

question

Integrate laws across different fields.Abil- ity to synthesize norms from multiple legal do- mains and resolve inter-norm conflicts through comprehensive analysis. An MCQ sample of this ability in the LexGenius is shown in Figure 17. Ability 9 { "question": "A Chinese company signed a trade contract with a French company to import a certain technical prod...

work page
[12]

question

Judging the boundary between law and morality and resolving ethical conflicts.Ability to identify and evaluate tensions between legal obli- gations and moral principles, and propose ethically aware legal judgments. An MCQ sample of this ability in the LexGenius is shown in Figure 18. Ability 10 { "question": "The Population and Family Planning Commission ...

work page
[13]

question

Critically interpreting legal texts and un- derstanding the lawmakers’ intent.Ability to interpret laws beyond their literal wording by un- covering legislative purpose, background, and sys- temic coherence. An MCQ sample of this ability in the LexGenius is shown in Figure 19. Ability 11 { "question": "In 2016, Chinese citizen Nan filed an application wit...

work page 2016
[14]

question

Interpreting legal terms across fields and adapting to different situations.Ability to un- derstand legal terminology in varied legal contexts and appropriately adapt interpretations to specific domains. An MCQ sample of this ability in the LexGenius is shown in Figure 20. Ability 12 { "question": "Country A assigned Said to its embassy in Country B as an...

work page
[15]

question

Understanding the exact meaning of legal terms.Ability to grasp the technical definitions, scope, and usage boundaries of domain-specific legal terms. An MCQ sample of this ability in the LexGenius is shown in Figure 21. Ability 13 { "question": "In 1994, Zhao (female, 38), a resident of mainland China, and Chen (male, 71), a retired soldier from Taiwan, ...

work page 1994
[16]

question

Analyzing the social impact and stability of legal enforcement.Ability to assess the poten- tial impact of legal implementation on public order, institutional trust, and long-term societal effects. An MCQ sample of this ability in the LexGenius is shown in Figure 22. Ability 14 { "question": "On August 29, 1994, Guo Cuiyun’s husband, Wang Shoumei, signed ...

work page 1994
[17]

question

Social change, culture, and legal coordi- nation.Ability to understand how law responds to social transformation and interacts with culture, economy, and values. An MCQ sample of this ability in the LexGenius is shown in Figure 23. Ability 15 { "question": "In 2008, the Changsha Intermediate People’s Court of Hunan Province accepted the bankruptcy case of...

work page 2008
[18]

question

Understanding and managing conflicts between law and morality.Ability to propose socially responsible legal judgments in situations where legal and moral norms collide. An MCQ sample of this ability in the LexGenius is shown in Figure 24. Ability 16 { "question": "This case involves the applicant for reconsideration, Liu Yuanwu, who, as the legal represen...

work page
[19]

question

Reasonable legal reasoning and judgment prediction under uncertainty.Ability to make legally sound decisions when faced with ambigu- ous facts or normative gaps, using analogical rea- soning and proportionality. An MCQ sample of this ability in the LexGenius is shown in Figure 25. Ability 17 { "question": "Xu and his wife Lin agreed to divorce, with the a...

work page
[20]

question

Case-based reasoning and judgment.Abil- ity to construct judgments through analogical rea- soning with relevant precedents and case-specific facts. An MCQ sample of this ability in the LexGe- nius is shown in Figure 26. Ability 18 { "question": "Starting in July 1997, Liu Chengming illegally operated a gas station without approval at the fishing port dock...

work page 1997
[21]

question

Analysis of the application of judicial pro- cedures in different jurisdictions.Ability to iden- tify jurisdictional differences in judicial procedures and adjust legal reasoning accordingly. An MCQ sample of this ability in the LexGenius is shown in Figure 27. Ability 19 { "question": "On the evening of March 15, 1999, around 8:30 PM, defendant Zhao Jian...

work page 1999
[22]

question

Understanding of judicial procedures and the ability to grasp details.Ability to accurately apply procedural rules throughout litigation or non- litigation processes, ensuring procedural compli- ance. An MCQ sample of this ability in the LexGe- nius is shown in Figure 28. Ability 20 { "question": "In the course of using ordinary procedures to hear cases, ...

work page 2024