Exploring Lightweight Large Language Models for Court View Generation
Pith reviewed 2026-05-19 21:30 UTC · model grok-4.3
The pith
Lightweight LLMs under 2 billion parameters generate court views from case facts and support charge prediction with competitive results against DNNs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on three public datasets show that LLM architecture and size create measurable differences in court view generation quality and in charge prediction accuracy, that lightweight models under 2B parameters achieve results comparable to DNN baselines, and that generating the court view before predicting the charge produces different outcomes than predicting the charge directly.
What carries the argument
Mixed-source training followed by separate evaluation on each dataset's test set, applied to compare lightweight LLM architectures and sizes on court view generation and downstream charge prediction.
If this is right
- Architecture choice affects court view quality and charge prediction accuracy at least as much as raw model size.
- Models under 2B parameters can reach performance levels close to larger DNNs on both generation and prediction.
- Generating the court view first produces charge predictions that differ from those obtained by direct prediction.
- Lightweight models open a path to more efficient judicial AI applications.
Where Pith is reading between the lines
- These models could run on modest hardware, allowing legal AI tools in settings with limited servers or internet.
- The observed size and architecture patterns may help select models for other specialized legal text tasks such as judgment summarization.
Load-bearing premise
Training on a mixed dataset drawn from multiple sources and evaluating on each individual test set yields unbiased comparisons of architecture and size effects.
What would settle it
Re-running the same models after training and testing entirely within one dataset instead of using mixed training would reveal whether domain differences are inflating or masking the reported architecture and size effects.
Figures
read the original abstract
Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically explores the capabilities of lightweight large language models (smaller than 2B parameters) for Criminal Court View Generation (CVG) and their impact on charge prediction in Legal AI. It investigates four key questions regarding the effects of model architecture, model size, comparisons with Deep Neural Networks, and the benefits of generating court views prior to charge prediction. The authors develop CVGEvalKit, an evaluation framework with three public datasets, train models on a mixed training set, and evaluate on individual dataset test sets. Experimental results are claimed to provide insights into trade-offs and highlight the potential of lightweight LLMs.
Significance. Should the experimental findings prove robust to the noted methodological concerns, the work could deliver actionable insights for deploying efficient AI tools in judicial settings, emphasizing resource-efficient models. The open-sourcing of CVGEvalKit and the associated code represents a strength, promoting reproducibility and extension by the community.
major comments (1)
- [Experimental protocol (Abstract and Methods)] The protocol of training every model on a single mixed training set drawn from multiple sources and then evaluating on each dataset's held-out test set (as described in the Abstract) risks domain shift confounding the reported architecture and size effects. Observed differences in CVG quality or charge prediction may arise from uneven adaptation to jurisdiction-specific phrasing, case distributions, or label imbalances rather than intrinsic model properties. Without per-dataset training baselines or explicit shift quantification (e.g., via embedding similarity or cross-dataset performance gaps), the central claims on trade-offs cannot be cleanly attributed to the factors under study.
minor comments (2)
- [Abstract] The Abstract states that comprehensive experiments were conducted but provides no detail on the exact metrics for CVG quality, the statistical tests used, baseline DNN implementation details, or error analysis.
- [Evaluation framework] Clarify the exact composition and sampling proportions of the mixed training set across the three source datasets.
Simulated Author's Rebuttal
We appreciate the referee's detailed review of our work on lightweight LLMs for court view generation. We respond to the major comment as follows.
read point-by-point responses
-
Referee: [Experimental protocol (Abstract and Methods)] The protocol of training every model on a single mixed training set drawn from multiple sources and then evaluating on each dataset's held-out test set (as described in the Abstract) risks domain shift confounding the reported architecture and size effects. Observed differences in CVG quality or charge prediction may arise from uneven adaptation to jurisdiction-specific phrasing, case distributions, or label imbalances rather than intrinsic model properties. Without per-dataset training baselines or explicit shift quantification (e.g., via embedding similarity or cross-dataset performance gaps), the central claims on trade-offs cannot be cleanly attributed to the factors under study.
Authors: We thank the referee for pointing out this potential issue with our experimental design. The mixed training set was chosen to create a unified training distribution that allows for fair and consistent comparison across different model architectures and sizes, reflecting a practical scenario in which Legal AI models are trained on aggregated data from various sources. This setup helps highlight the intrinsic capabilities of the models rather than variations due to different training subsets. That said, we agree that domain shift could be a confounding factor. To address this, we will incorporate per-dataset training baselines for representative models and provide quantitative analysis of cross-dataset shifts, such as performance gaps and embedding similarities, in the revised manuscript. revision: yes
Circularity Check
No circularity: purely empirical comparisons on external datasets
full rationale
The paper reports direct experimental runs of lightweight LLMs on three public CVG datasets. Models are trained once on a mixed training set and evaluated on each dataset's held-out test set; performance differences are measured via standard metrics. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked to derive the reported trade-offs. The central claims are therefore falsifiable observations from external data rather than tautological restatements of inputs, satisfying the self-contained criterion for score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2509.09969 , year=
Large Language Models Meet Legal Artificial Intelligence: A Survey , author=. arXiv preprint arXiv:2509.09969 , year=
-
[2]
2024 International Joint Conference on Neural Networks (IJCNN) , pages=
Legal Document-Based, Domain-Driven Q&A System: LLMs in Perspective , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=
work page 2024
-
[3]
2024 International Joint Conference on Neural Networks (IJCNN) , pages=
LeGalFormer: A Graph Representation Learning and Transformer-based Approach for Legal Similar Case Retrieval , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=
work page 2024
-
[4]
2024 International Joint Conference on Neural Networks (IJCNN) , pages=
Legal judgment prediction via fine-grained element graphs and external knowledge , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=
work page 2024
-
[5]
Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions , author=. arXiv preprint arXiv:1802.08504 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Circumstances enhanced criminal court view generation , author=. Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval , pages=
-
[7]
Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
-
[8]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
Divide and Conquer: Legal Concept-guided Criminal Court View Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
work page 2024
-
[9]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Court opinion generation from case fact description with legal basis , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[10]
arXiv preprint arXiv:2403.04366 , year=
Enhancing court view generation with knowledge injection and guidance , author=. arXiv preprint arXiv:2403.04366 , year=
-
[11]
International Conference on Intelligent Computing , pages=
Improving causality explanation of judge-view generation based on counterfactual , author=. International Conference on Intelligent Computing , pages=. 2023 , organization=
work page 2023
-
[12]
Generating reasonable legal text through the combination of language modeling and question answering , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=
-
[13]
Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
Unleashing the power of llms in court view generation by stimulating internal knowledge and incorporating external knowledge , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=
work page 2024
-
[14]
De-biased court’s view generation with causality , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
work page 2020
-
[15]
LoRA: Low-Rank Adaptation of Large Language Models , author=
-
[16]
arXiv preprint arXiv:2510.09297 , year=
ShiZhi: A Chinese Lightweight Large Language Model for Court View Generation , author=. arXiv preprint arXiv:2510.09297 , year=
-
[17]
Qwen2.5: A Party of Foundation Models , url =
QwenTeam , month =. Qwen2.5: A Party of Foundation Models , url =
-
[18]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Internlm2 technical report , author=. arXiv preprint arXiv:2403.17297 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Neural Machine Translation by Jointly Learning to Align and Translate
Neural machine translation by jointly learning to align and translate , author=. arXiv preprint arXiv:1409.0473 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.