Exploring Lightweight Large Language Models for Court View Generation

Kun Zeng; Nanli Zeng; Tianyong Hao; Zhitian Hou; Zhixiong Chao

arxiv: 2605.16770 · v1 · pith:XBEJZENZnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI

Exploring Lightweight Large Language Models for Court View Generation

Zhitian Hou , Tianyong Hao , Nanli Zeng , Zhixiong Chao , Kun Zeng This is my paper

Pith reviewed 2026-05-19 21:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords court view generationlightweight large language modelslegal AIcharge predictionmodel size trade-offsjudicial applicationsevaluation framework

0 comments

The pith

Lightweight LLMs under 2 billion parameters generate court views from case facts and support charge prediction with competitive results against DNNs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests small large language models for the task of producing court views based on case facts in criminal cases and for using those views to predict charges. It examines how different model designs and sizes influence output quality on this legal generation problem. The work also compares these models to conventional neural networks and checks whether generating the view first changes charge prediction accuracy. A reader would care because smaller models could make AI support more feasible in court systems that lack access to large computing clusters.

Core claim

Experiments on three public datasets show that LLM architecture and size create measurable differences in court view generation quality and in charge prediction accuracy, that lightweight models under 2B parameters achieve results comparable to DNN baselines, and that generating the court view before predicting the charge produces different outcomes than predicting the charge directly.

What carries the argument

Mixed-source training followed by separate evaluation on each dataset's test set, applied to compare lightweight LLM architectures and sizes on court view generation and downstream charge prediction.

If this is right

Architecture choice affects court view quality and charge prediction accuracy at least as much as raw model size.
Models under 2B parameters can reach performance levels close to larger DNNs on both generation and prediction.
Generating the court view first produces charge predictions that differ from those obtained by direct prediction.
Lightweight models open a path to more efficient judicial AI applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These models could run on modest hardware, allowing legal AI tools in settings with limited servers or internet.
The observed size and architecture patterns may help select models for other specialized legal text tasks such as judgment summarization.

Load-bearing premise

Training on a mixed dataset drawn from multiple sources and evaluating on each individual test set yields unbiased comparisons of architecture and size effects.

What would settle it

Re-running the same models after training and testing entirely within one dataset instead of using mixed training would reveal whether domain differences are inflating or masking the reported architecture and size effects.

Figures

Figures reproduced from arXiv: 2605.16770 by Kun Zeng, Nanli Zeng, Tianyong Hao, Zhitian Hou, Zhixiong Chao.

**Figure 1.** Figure 1: The average ROUGE-L of CVG of each test dataset of each model after fine-tuning. transparency and interpretability, ensuring that legal decisions are understandable and consistent. Traditionally, drafting court view requires significant human expertise and effort, which motivates the development of automated CVG systems. Beyond simply generating text, CVG plays a vital role in downstream applications, s… view at source ↗

**Figure 2.** Figure 2: Overview of the research methodology and the four research questions investigated in this study. Subfigure [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The examples of Qwen3-1.7B, Llama-3.2- 1B-Instruct and InternLM2.5-1.8B-Chat in zero-shot setting. related to Adjudging Circumstance (ADC) and Sentencing Circumstance (SEC), and then generates court views for each type before merging them into the final output. • LeGen (Xu et al., 2024) is a model that decomposes court views into sub-views focused on specific legal concepts. It employs a concept reasonin… view at source ↗

**Figure 4.** Figure 4: The performance of Qwen3-0.6B and Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: The example of Gemma3-1B-IT and Llama3.2-1B-Instruct in charge prediction with CVG first and directly. open-source models in our experiments only provide a single size below 2B parameters, we are not able to draw broader, architecture-agnostic conclusions about model scaling. For charge prediction task, the results in Figure 4 show that the performance gap between the two model sizes is small under the… view at source ↗

**Figure 7.** Figure 7: The example of InternLM2.5-1.8B-Chat in charge prediction with CVG first and directly. cating little connection to model size. After finetuning, however, all models converge to very similar performance levels, with only a slight positive correlation with size. Notably, our experimental results reveal that below 1B parameters, increasing model size does not yield performance gains, whereas beyond the 1B … view at source ↗

read the original abstract

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward empirical comparison of small LLMs on court view generation with a released toolkit, but the mixed training and separate evaluation setup makes it hard to isolate clean effects from architecture or size.

read the letter

Colleague, the main thing here is that the paper runs a practical set of comparisons on lightweight models under 2B parameters for generating court views from facts and then using that output for charge prediction. They ask four clear questions about architecture, size, DNN baselines, and the value of generating the view first versus predicting directly. They also put out CVGEvalKit with three public datasets and the code, which is the most concrete contribution for anyone who wants to rerun or extend the work. Training everything on one mixed set and testing per dataset is a reasonable way to stretch limited data, and the results do point to some usable trade-offs for settings where full-scale models are too heavy. That part is honest applied work and worth noting for legal NLP folks who care about efficiency. The soft spot is exactly the one the stress test flags. Mixing sources for training while keeping test sets separate opens the door for domain differences in phrasing, jurisdiction, or label balance to drive some of the observed gaps between models. Without per-dataset training runs or any shift measurement, it is difficult to say the differences are cleanly about model properties rather than adaptation. The abstract claims new insights into task influence and trade-offs, but the design leaves that attribution weaker than it could be. Details on exact metrics, statistical checks, and error analysis are also thin based on what is described. This paper is for people working on resource-limited legal AI tools or anyone benchmarking small models on narrow generation tasks. A reader who needs concrete numbers on what sub-2B models can do in this domain will get something from the comparisons and the open kit. It is not a methods paper and does not claim to be. I would send it to peer review. The code release and focused questions give it enough substance that referees can usefully push on the experimental controls and tighten the claims.

Referee Report

1 major / 2 minor

Summary. The paper systematically explores the capabilities of lightweight large language models (smaller than 2B parameters) for Criminal Court View Generation (CVG) and their impact on charge prediction in Legal AI. It investigates four key questions regarding the effects of model architecture, model size, comparisons with Deep Neural Networks, and the benefits of generating court views prior to charge prediction. The authors develop CVGEvalKit, an evaluation framework with three public datasets, train models on a mixed training set, and evaluate on individual dataset test sets. Experimental results are claimed to provide insights into trade-offs and highlight the potential of lightweight LLMs.

Significance. Should the experimental findings prove robust to the noted methodological concerns, the work could deliver actionable insights for deploying efficient AI tools in judicial settings, emphasizing resource-efficient models. The open-sourcing of CVGEvalKit and the associated code represents a strength, promoting reproducibility and extension by the community.

major comments (1)

[Experimental protocol (Abstract and Methods)] The protocol of training every model on a single mixed training set drawn from multiple sources and then evaluating on each dataset's held-out test set (as described in the Abstract) risks domain shift confounding the reported architecture and size effects. Observed differences in CVG quality or charge prediction may arise from uneven adaptation to jurisdiction-specific phrasing, case distributions, or label imbalances rather than intrinsic model properties. Without per-dataset training baselines or explicit shift quantification (e.g., via embedding similarity or cross-dataset performance gaps), the central claims on trade-offs cannot be cleanly attributed to the factors under study.

minor comments (2)

[Abstract] The Abstract states that comprehensive experiments were conducted but provides no detail on the exact metrics for CVG quality, the statistical tests used, baseline DNN implementation details, or error analysis.
[Evaluation framework] Clarify the exact composition and sampling proportions of the mixed training set across the three source datasets.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's detailed review of our work on lightweight LLMs for court view generation. We respond to the major comment as follows.

read point-by-point responses

Referee: [Experimental protocol (Abstract and Methods)] The protocol of training every model on a single mixed training set drawn from multiple sources and then evaluating on each dataset's held-out test set (as described in the Abstract) risks domain shift confounding the reported architecture and size effects. Observed differences in CVG quality or charge prediction may arise from uneven adaptation to jurisdiction-specific phrasing, case distributions, or label imbalances rather than intrinsic model properties. Without per-dataset training baselines or explicit shift quantification (e.g., via embedding similarity or cross-dataset performance gaps), the central claims on trade-offs cannot be cleanly attributed to the factors under study.

Authors: We thank the referee for pointing out this potential issue with our experimental design. The mixed training set was chosen to create a unified training distribution that allows for fair and consistent comparison across different model architectures and sizes, reflecting a practical scenario in which Legal AI models are trained on aggregated data from various sources. This setup helps highlight the intrinsic capabilities of the models rather than variations due to different training subsets. That said, we agree that domain shift could be a confounding factor. To address this, we will incorporate per-dataset training baselines for representative models and provide quantitative analysis of cross-dataset shifts, such as performance gaps and embedding similarities, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons on external datasets

full rationale

The paper reports direct experimental runs of lightweight LLMs on three public CVG datasets. Models are trained once on a mixed training set and evaluated on each dataset's held-out test set; performance differences are measured via standard metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked to derive the reported trade-offs. The central claims are therefore falsifiable observations from external data rather than tautological restatements of inputs, satisfying the self-contained criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking paper with no new mathematical axioms, free parameters, or invented entities; relies on standard supervised training assumptions and public data splits.

pith-pipeline@v0.9.0 · 5768 in / 975 out tokens · 36288 ms · 2026-05-19T21:30:30.632644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 5 internal anchors

[1]

arXiv preprint arXiv:2509.09969 , year=

Large Language Models Meet Legal Artificial Intelligence: A Survey , author=. arXiv preprint arXiv:2509.09969 , year=

work page arXiv
[2]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024
[3]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

LeGalFormer: A Graph Representation Learning and Transformer-based Approach for Legal Similar Case Retrieval , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024
[4]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Legal judgment prediction via fine-grained element graphs and external knowledge , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024
[5]

Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions

Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions , author=. arXiv preprint arXiv:1802.08504 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval , pages=

Circumstances enhanced criminal court view generation , author=. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval , pages=

work page
[7]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page
[8]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Divide and Conquer: Legal Concept-guided Criminal Court View Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[9]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Court opinion generation from case fact description with legal basis , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[10]

arXiv preprint arXiv:2403.04366 , year=

Enhancing court view generation with knowledge injection and guidance , author=. arXiv preprint arXiv:2403.04366 , year=

work page arXiv
[11]

International Conference on Intelligent Computing , pages=

Improving causality explanation of judge-view generation based on counterfactual , author=. International Conference on Intelligent Computing , pages=. 2023 , organization=

work page 2023
[12]

Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

Generating reasonable legal text through the combination of language modeling and question answering , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

work page
[13]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Unleashing the power of llms in court view generation by stimulating internal knowledge and incorporating external knowledge , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

work page 2024
[14]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

De-biased court’s view generation with causality , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020
[15]

LoRA: Low-Rank Adaptation of Large Language Models , author=

work page
[16]

arXiv preprint arXiv:2510.09297 , year=

ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation , author=. arXiv preprint arXiv:2510.09297 , year=

work page arXiv
[17]

Qwen2.5: A Party of Foundation Models , url =

QwenTeam , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[18]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

InternLM2 Technical Report

Internlm2 technical report , author=. arXiv preprint arXiv:2403.17297 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[1] [1]

arXiv preprint arXiv:2509.09969 , year=

Large Language Models Meet Legal Artificial Intelligence: A Survey , author=. arXiv preprint arXiv:2509.09969 , year=

work page arXiv

[2] [2]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024

[3] [3]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

LeGalFormer: A Graph Representation Learning and Transformer-based Approach for Legal Similar Case Retrieval , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024

[4] [4]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Legal judgment prediction via fine-grained element graphs and external knowledge , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

work page 2024

[5] [5]

Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions

Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions , author=. arXiv preprint arXiv:1802.08504 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval , pages=

Circumstances enhanced criminal court view generation , author=. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval , pages=

work page

[7] [7]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page

[8] [8]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Divide and Conquer: Legal Concept-guided Criminal Court View Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024

[9] [9]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Court opinion generation from case fact description with legal basis , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[10] [10]

arXiv preprint arXiv:2403.04366 , year=

Enhancing court view generation with knowledge injection and guidance , author=. arXiv preprint arXiv:2403.04366 , year=

work page arXiv

[11] [11]

International Conference on Intelligent Computing , pages=

Improving causality explanation of judge-view generation based on counterfactual , author=. International Conference on Intelligent Computing , pages=. 2023 , organization=

work page 2023

[12] [12]

Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

Generating reasonable legal text through the combination of language modeling and question answering , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

work page

[13] [13]

Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

Unleashing the power of llms in court view generation by stimulating internal knowledge and incorporating external knowledge , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

work page 2024

[14] [14]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

De-biased court’s view generation with causality , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020

[15] [15]

LoRA: Low-Rank Adaptation of Large Language Models , author=

work page

[16] [16]

arXiv preprint arXiv:2510.09297 , year=

ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation , author=. arXiv preprint arXiv:2510.09297 , year=

work page arXiv

[17] [17]

Qwen2.5: A Party of Foundation Models , url =

QwenTeam , month =. Qwen2.5: A Party of Foundation Models , url =

work page

[18] [18]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

InternLM2 Technical Report

Internlm2 technical report , author=. arXiv preprint arXiv:2403.17297 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Gemma 3 Technical Report

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page