Recognition: unknown
Enhancing Judgment Document Generation via Agentic Legal Information Collection and Rubric-Guided Optimization
Pith reviewed 2026-05-09 16:58 UTC · model grok-4.3
The pith
Judge-R1 improves automated judgment document generation by combining agentic information collection with rubric-guided reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Judge-R1 is a unified framework for LLM-based judgment document generation that jointly improves legal information collection through a dynamic planning agent retrieving statutes and precedents and optimizes the generation process via Rubric-Guided Optimization using Group Relative Policy Optimization with a legal reward function to enforce judicial standards and logical reasoning. Extensive experiments on the JuDGE benchmark show that this results in significant improvements in legal accuracy and generation quality over state-of-the-art baselines.
What carries the argument
Agentic Legal Information Collection, which employs a dynamic planning agent for precise retrieval, paired with Rubric-Guided Optimization that uses GRPO and a comprehensive legal reward function to align outputs with judicial requirements.
Load-bearing premise
The JuDGE benchmark and the legal reward function used in the optimization sufficiently capture the key aspects of real-world judicial standards and logical reasoning without significant biases or omissions.
What would settle it
A direct comparison by legal professionals of documents generated by Judge-R1 and by baseline systems on a set of new, real court cases, assessing the accuracy of cited laws, the soundness of reasoning, and overall usability, where Judge-R1 shows no advantage.
Figures
read the original abstract
Automating the drafting of judgment documents is pivotal to judicial efficiency, yet it remains challenging due to the dual requirements of comprehensive retrieval of legal information and rigorous logical reasoning. Existing approaches, typically relying on standard Retrieval-Augmented Generation and Supervised Fine-Tuning, often suffer from insufficient evidence recall, hallucinated statutory references, and logically flawed legal reasoning. To bridge this gap, we propose Judge-R1, a unified framework designed to enhance LLM-based judgment document generation by jointly improving legal information collection and judgment document generation. First, we introduce Agentic Legal Information Collection, which employs a dynamic planning agent to retrieve precise statutes and precedents from multiple sources. Second, we implement Rubric-Guided Optimization, a reinforcement learning phase utilizing Group Relative Policy Optimization (GRPO) with a comprehensive legal reward function to enforce adherence to judicial standards and reasoning logic. Extensive experiments on the JuDGE benchmark demonstrate that Judge-R1 significantly outperforms state-of-the-art baselines in both legal accuracy and generation quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Judge-R1, a unified framework for LLM-based judgment document generation. It combines Agentic Legal Information Collection (a dynamic planning agent retrieving statutes and precedents from multiple sources) with Rubric-Guided Optimization (GRPO reinforcement learning driven by a comprehensive legal reward function enforcing judicial standards and logical reasoning). The central claim is that this approach significantly outperforms state-of-the-art baselines on the JuDGE benchmark in both legal accuracy and generation quality.
Significance. If the empirical results and evaluation hold, the work could meaningfully advance legal AI by mitigating hallucinated references and flawed reasoning in automated drafting, offering a practical path toward higher judicial efficiency. The combination of agentic retrieval and rubric-based RL is a timely contribution given the domain's demands for precision and traceability.
major comments (2)
- Abstract: The central claim that Judge-R1 'significantly outperforms' baselines in legal accuracy and generation quality is load-bearing yet unsupported by any metrics, baseline names, effect sizes, statistical tests, or error analysis in the provided text. Without these, the magnitude and reliability of the reported gains cannot be assessed.
- The manuscript provides no description of JuDGE benchmark construction (case selection criteria, annotation protocol, inter-annotator agreement, or coverage of jurisdiction-specific rules and multi-precedent conflicts) or the exact components and weighting of the legal reward function used in GRPO. These omissions directly affect whether the measured improvements reflect genuine advances in statutory adherence and logical reasoning or artifacts of the evaluation design.
Simulated Author's Rebuttal
We sincerely thank the referee for the thorough and constructive review. We have carefully addressed each major comment below, providing clarifications and committing to revisions that improve the manuscript's transparency and reproducibility without altering the core contributions.
read point-by-point responses
-
Referee: Abstract: The central claim that Judge-R1 'significantly outperforms' baselines in legal accuracy and generation quality is load-bearing yet unsupported by any metrics, baseline names, effect sizes, statistical tests, or error analysis in the provided text. Without these, the magnitude and reliability of the reported gains cannot be assessed.
Authors: We agree that the abstract would benefit from greater specificity to allow immediate assessment of the claims. The full manuscript reports quantitative results, baseline comparisons, and evaluation details in the Experiments section; however, we have revised the abstract to explicitly name the primary baselines, report key effect sizes and metrics from the JuDGE benchmark, and reference the statistical significance and error analysis performed. revision: yes
-
Referee: The manuscript provides no description of JuDGE benchmark construction (case selection criteria, annotation protocol, inter-annotator agreement, or coverage of jurisdiction-specific rules and multi-precedent conflicts) or the exact components and weighting of the legal reward function used in GRPO. These omissions directly affect whether the measured improvements reflect genuine advances in statutory adherence and logical reasoning or artifacts of the evaluation design.
Authors: We acknowledge that the original submission provided only high-level mentions of the JuDGE benchmark and reward function. In the revised manuscript we have added a dedicated subsection detailing JuDGE construction (case selection from real judicial records, expert annotation protocol, inter-annotator agreement scores, and handling of jurisdiction rules plus multi-precedent conflicts) as well as the precise components and weightings of the legal reward function used within GRPO. revision: yes
Circularity Check
No significant circularity; empirical claims rest on independent benchmark evaluation
full rationale
The paper presents Judge-R1 as a framework combining agentic retrieval and GRPO-based RL with a legal reward function, then reports empirical outperformance on the JuDGE benchmark. No equations, fitted parameters, or self-citations are shown to reduce any claimed result to the inputs by construction. The method builds on standard RAG, SFT, and RL techniques without self-definitional loops, and the performance claims are external evaluations rather than renamed fits or uniqueness theorems imported from the authors' prior work. This is a typical empirical ML paper whose central claims remain falsifiable against external benchmarks and do not collapse into self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard LLM capabilities and RL optimization techniques transfer effectively to legal reasoning tasks
Reference graph
Works this paper leans on
- [1]
-
[2]
Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat-Seng Chua, and Shaoping Ma. 2025. Decoupling Knowledge and Context: An Efficient and Effective Retrieval Augmented Generation Framework via Cross Attention. InProceedings of the ACM on Web Conference 2025
2025
-
[3]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130 (2024)
work page internal anchor Pith review arXiv 2024
- [4]
-
[5]
Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Juliano Rabelo, Ken Satoh, and Masaharu Yoshioka. 2023. Summary of the competition on legal information, ex- traction/entailment (COLIEE) 2023. InProceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 472–480
2023
-
[6]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
-
[8]
Zikun Hu, Xiang Li, Cunchao Tu, Zhiyuan Liu, and Maosong Sun. 2018. Few-shot charge prediction with discriminative legal attributes. InProceedings of the 27th international conference on computational linguistics. 487–498
2018
- [9]
-
[10]
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516(2025)
work page internal anchor Pith review arXiv 2025
- [11]
-
[12]
Mi-Young Kim, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh. 2022. Coliee 2022 summary: Methods for legal document retrieval and entailment. InJSAI International Symposium on Artificial Intelligence. Springer, 51–67
2022
-
[13]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474
2020
-
[14]
Haitao Li, Qingyao Ai, Jia Chen, Qian Dong, Yueyue Wu, Yiqun Liu, Chong Chen, and Qi Tian. 2023. SAILER: structure-aware pre-trained language model for legal case retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1035–1044
2023
-
[15]
Shang Li, Hongli Zhang, Lin Ye, Xiaoding Guo, and Binxing Fang. 2019. Mann: A multichannel attentive neural network for legal judgment prediction.IEEE Access7 (2019), 151144–151155
2019
- [16]
- [17]
- [18]
-
[19]
Juliano Rabelo, Randy Goebel, Mi-Young Kim, Yoshinobu Kano, Masaharu Yosh- ioka, and Ken Satoh. 2022. Overview and discussion of the competition on legal information extraction/entailment (COLIEE) 2021.The Review of Socionetwork Strategies16, 1 (2022), 111–133
2022
-
[20]
Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389
2009
-
[21]
Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong Wu, and Shengluan Hou. 2024. Wikiformer: Pre-training with structured information of wikipedia for ad-hoc retrieval. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19026–19034
2024
- [22]
-
[23]
Weihang Su, Qingyao Ai, Yueyue Wu, Anzhe Xie, Changyue Wang, Yixiao Ma, Haitao Li, Zhijing Wu, Yiqun Liu, and Min Zhang. 2025. Pre-training for legal case retrieval based on inter-case distinctions.ACM Transactions on Information Systems43, 5 (2025), 1–27
2025
-
[24]
Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, and Yiqun Liu. 2025. Dy- namic and parametric retrieval-augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 4118–4121
2025
-
[25]
Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. 2025. SIGIR-AP 2025 Tutorial Proposal: Dynamic and Parametric Retrieval-Augmented Generation. In3rd International ACM SIGIR Conference on Information Retrieval in the Asia Pacific
2025
-
[26]
Weihang Su, Yiran Hu, Anzhe Xie, Qingyao Ai, Quezi Bing, Ning Zheng, Yun Liu, Weixing Shen, and Yiqun Liu. 2024. STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for C...
-
[27]
Weihang Su, Jianming Long, Qingyao Ai, Yichen Tang, Changyue Wang, Yiteng Tu, and Yiqun Liu. 2026. Skill Retrieval Augmentation for Agentic AI.arXiv preprint arXiv:2604.24594(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [28]
-
[29]
Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 23–31
2024
- [30]
- [31]
- [32]
-
[33]
Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Xuanyi Chen, Jiaxin Mao, Ziyi Ye, and Yiqun Liu. 2025. Surge: A benchmark and evaluation framework for scientific survey generation.arXiv preprint arXiv:2508.15658(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, and Yiqun Liu. 2025. Judge: Benchmarking judg- ment document generation for chinese legal system. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3573–3583
2025
- [35]
- [36]
-
[37]
Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. 2025. Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1272–1282
2025
-
[38]
Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Fen Lin, Qin Liu, and Qingyao Ai
-
[39]
InProceedings of the ACM Web Conference 2026
Generalized Pseudo-Relevance Feedback. InProceedings of the ACM Web Conference 2026. 1876–1886
2026
-
[40]
Changyue Wang, Weihang Su, Qingyao Ai, and Yiqun Liu. 2026. Joint evalua- tion of answer and reasoning consistency for hallucination detection in large reasoning models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 33377–33385
2026
-
[41]
Changyue Wang, Weihang Su, Qingyao Ai, Yichen Tang, and Yiqun Liu. 2025. Knowledge editing through chain-of-thought. InProceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing. 10684–10704
2025
- [42]
- [43]
- [44]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.