pith. sign in

arxiv: 2512.01020 · v2 · submitted 2025-11-30 · 💻 cs.AI · cs.CL

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Pith reviewed 2026-05-17 02:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords legal reasoningLLM evaluationreasoning tracesRAGreinforcement learninglegal datasetsissue coveragecourt judgments
0
0 comments X

The pith

Legal Issue Trees from court judgments serve as rubrics to evaluate LLM reasoning traces for issue coverage and correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the LEGIT dataset consisting of 24,000 instances where court judgments are converted into hierarchical trees of arguments and conclusions. These trees function as rubrics to assess how completely and accurately large language models cover the key legal issues in their reasoning traces. The authors find that LLM performance in legal reasoning depends significantly on both the breadth of issues addressed and the accuracy of the conclusions drawn. Retrieval-augmented generation enhances the models' overall ability to handle legal problems, while reinforcement learning using the rubrics specifically boosts correctness even as it may limit the range of issues considered.

Core claim

We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. Using the LEGIT dataset, we show that LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that retrieval-augmented generation and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

What carries the argument

Legal Issue Trees (LEGIT), which are hierarchical structures extracted from court judgments representing opposing arguments and court conclusions used as evaluation rubrics.

If this is right

  • LLM legal reasoning quality can be measured along two distinct dimensions of coverage and correctness.
  • RAG and rubric-based RL offer complementary improvements that could be combined in future legal AI systems.
  • Large-scale expert-level datasets like LEGIT enable more granular evaluation than coarse rubrics.
  • Human expert validation supports the use of automatically extracted trees for reliable assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such tree-based rubrics might generalize to evaluate reasoning in other complex domains like medical diagnosis or scientific argumentation.
  • Integrating both RAG and RL could lead to models with high coverage and high correctness simultaneously.
  • Automated methods for tree extraction could scale the approach beyond manual conversion of judgments.

Load-bearing premise

The hierarchical trees automatically extracted from court judgments faithfully capture the essential legal issues and human expert agreement on samples validates them as reliable rubrics.

What would settle it

A large-scale study where independent legal experts create their own issue trees for the same judgments and compare agreement rates with the extracted ones, or where high-scoring LLM traces are tested in real legal scenarios and found to underperform.

Figures

Figures reproduced from arXiv: 2512.01020 by Arman Cohan, Jinu Lee, Julia Hockenmaier, Kyoung-Woon On, Simeng Han.

Figure 1
Figure 1. Figure 1: Overview of the LEGIT dataset and task. Facts and issue trees are extracted from real-world court [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices of individual issue labels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between LLM-evaluated scores between LEGIT score and Likert scale. Even though the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LEGIT score of 12 generator LLMs, evaluated [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Three LLM responses obtained from the example LEGIT problem in Figure [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correctness rate of covered parent issue, de [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of LEGIT scores of Gemma-3- 4B with RAG and RL. While RAG improves all com￾ponents of LEGIT scores, RL significantly improves the final order/issue correctness while reducing issue coverage. sourced LLMs show consistent results; see Ap￾pendix E. RL prioritizes correctness at the cost of cover￾age. In contrast to RAG, reinforcement learning with the LEGIT reward significantly increases final orde… view at source ↗
Figure 9
Figure 9. Figure 9: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Histogram showing the number of issues for each LEGIT instance. The dataset is divided into easy/medium/hard difficulty subsets based on the num￾ber of issues. (issue 2.2), which pinpoints the deduction error that led to the wrong final order prediction. B LEGIT dataset details B.1 Filtering out non-deterministic cases We use rules to filter out non-deterministic final orders. First, we only maintain civi… view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of case types in LEGIT. Case types that have more than 200 instances are shown with [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Annotation guide presented to the lawyers during expert annotation process in Section [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: (1) Issue coverage and (2) issue correctness [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Parent issue correctness binned by child issue coverage and correctness, extending Figure [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Results of RAG for LEGIT dataset, with three different generators and five retrieval settings (No RAG, BM25, two Contrievers, and Ground-truth cita￾tions). RAG improves LEGIT score for around 0.1-0.4 for all (generator, retriever) pairs, with a gain in all three components. Hyperparam. Value Objective GRPO KL Div. Coef. 1e-3 Max prompt len. 2048 Max output len. 4096 Batch size 32 Rollouts 8 Optimizer Adam… view at source ↗
Figure 18
Figure 18. Figure 18: Prompts for dataset construction and generating LLM responses. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompts for evaluating the reasoning traces, either with LEGIT rubrics (left) or Likert scale (right). [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
read the original abstract

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LEGIT, a dataset of 24K hierarchical legal issue trees automatically extracted from court judgments. These trees are positioned as rubrics to evaluate LLM reasoning traces along two axes: legal issue coverage and correctness. Experiments on held-out LLM outputs are used to argue that LLM legal reasoning is substantially degraded by deficiencies in either dimension, and that RAG and rubric-based RL yield complementary gains (RAG boosting overall capability while RL improves correctness at some cost to coverage). Reliability of the trees is asserted via human expert annotation on an unspecified sample plus comparison against coarser rubrics.

Significance. If the extracted trees function as faithful, generalizable rubrics, the work supplies a scalable, expert-level framework for evaluating complex reasoning traces in a high-stakes domain. The complementary RAG/RL finding, if robustly measured, would offer concrete guidance for training legal-reasoning models. The scale (24K instances) and structured output format are clear strengths that could support future benchmark development.

major comments (2)
  1. [Human verification / rubric reliability section] Human verification subsection (and associated results): the manuscript reports that human experts annotated a sample and compared the trees against coarse rubrics, yet supplies no quantitative agreement statistics (Cohen’s κ, percentage agreement, or error typology), no sample size, and no inter-annotator details. Because the headline claims rest on the trees serving as reliable rubrics for arbitrary LLM traces, the absence of these metrics leaves open the possibility that systematic extraction artifacts (omitted implicit issues, misaligned party positions) are being measured rather than genuine reasoning quality.
  2. [LLM evaluation experiments] Experimental results section: LLM performance claims (coverage/correctness degradation and RAG vs. RL complementarity) are presented at a high level without reported numerical metrics, confidence intervals, prompt-variation controls, or ablation tables. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains.
minor comments (2)
  1. [Dataset construction] Clarify the exact procedure and any heuristics used for automatic conversion of judgments into hierarchical trees; a short pseudocode or decision tree would aid reproducibility.
  2. [Discussion / Limitations] Add a limitations paragraph discussing potential domain shift between the court judgments used for tree extraction and the distribution of LLM-generated traces being evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have planned revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Human verification / rubric reliability section] Human verification subsection (and associated results): the manuscript reports that human experts annotated a sample and compared the trees against coarse rubrics, yet supplies no quantitative agreement statistics (Cohen’s κ, percentage agreement, or error typology), no sample size, and no inter-annotator details. Because the headline claims rest on the trees serving as reliable rubrics for arbitrary LLM traces, the absence of these metrics leaves open the possibility that systematic extraction artifacts (omitted implicit issues, misaligned party positions) are being measured rather than genuine reasoning quality.

    Authors: We agree that quantitative reliability metrics are important for substantiating the rubric claims. The revised manuscript will expand the human verification subsection to report the sample size, inter-annotator agreement statistics including Cohen’s κ and percentage agreement, and an error typology. We will also clarify how the annotations help identify and mitigate potential extraction artifacts such as omitted implicit issues or misaligned party positions, thereby supporting that the trees measure genuine reasoning quality. revision: yes

  2. Referee: [LLM evaluation experiments] Experimental results section: LLM performance claims (coverage/correctness degradation and RAG vs. RL complementarity) are presented at a high level without reported numerical metrics, confidence intervals, prompt-variation controls, or ablation tables. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains.

    Authors: We acknowledge that the experimental claims require more granular reporting to allow proper assessment. In the revised manuscript we will add specific numerical metrics for coverage and correctness scores, 95% confidence intervals, details on prompt-variation controls, and ablation tables that isolate the contributions of RAG and rubric-based RL. These additions will quantify the magnitude of the complementary gains and the impact of deficiencies in either dimension. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct measurements against externally constructed and sampled-validated rubrics

full rationale

The paper constructs LEGIT issue trees from court judgments to serve as rubrics, validates them via human expert annotations on a sample plus comparison to coarse rubrics, then reports empirical measurements of LLM reasoning traces (coverage and correctness) on held-out outputs. No equations, fitted parameters, or self-citations reduce the claimed effects of RAG/RL or the impact of coverage/correctness to quantities defined by the paper's own inputs. The derivation chain consists of independent data construction followed by direct evaluation, with no self-definitional, fitted-prediction, or load-bearing self-citation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that court judgments contain extractable hierarchical issue structures that serve as objective gold-standard rubrics; no free parameters or invented physical entities are introduced.

axioms (2)
  • domain assumption Court judgments can be reliably decomposed into hierarchical trees of opposing arguments and conclusions that represent the essential legal issues.
    Invoked when converting judgments into rubrics for LLM evaluation.
  • domain assumption Human expert annotations on a sample suffice to establish the reliability of the automatically constructed trees.
    Used to verify the rubrics before applying them to LLM traces.

pith-pipeline@v0.9.0 · 5483 in / 1386 out tokens · 43861 ms · 2026-05-17T02:24:13.506150+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Exaone 3.5: Series of large lan- guage models for real-world use cases,

    Curran Associates, Inc. LG AI Research. 2024. EXAONE 3.5: Series of Large Language Models for Real-world Use Cases.arXiv preprint. ArXiv:2412.04862 [cs]. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yu- jia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024a. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint. ArXiv:2412.05...

  2. [2]

    GPT-4 Technical Report

    Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representa- tions. 11 Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- moyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 ...

  3. [3]

    E Building

    Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. CitaLaw: Enhancing LLM with Citations in Legal Domain. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 11183–11196, Vienna, Austria. Association for Com...

  4. [4]

    Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed

    Def endant shall pa y t he plaintiff t he sum of KRW 7 8 , 000 , 000 . Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed. Conclusion: Not sufficient ly pr o v ed, pleading dismissed. Conclusion: Claim accept ed. Ar gument (Plaintiff): The set -off contract is a fraudulent con v e y ance. It shall be canceled and t he def endant s...

  5. [5]

    1 . 1 2 . 1 . 1 Co v er ed 2 . 1 . 1 Co v er ed 2 Co v er ed 2 . 1 Co v er ed 2 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 2 Corr ect 2 . 1 . 3 Corr ect 2 . 2 Incorr ect 2 . 1 Corr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 Co v er ed 2 . 1 . 3 Co v er ed 2 . 1 Co v er ed 2 Co v er ed 2 . 2 C...

  6. [6]

    3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

    1 . 3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

  7. [7]

    Refer to Figure 10 for the original version of the data and LLM responses

    1 .2 Figure 9: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), translated into English. Refer to Figure 10 for the original version of the data and LLM responses. 14 ... ### 주요 법적 고려사항 사해행위 (민법 406조) ... 원고의 채권 성립 및 이행 공사 완료, 일부지급을 고려하면, ...

  8. [8]

    Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다

    피고는 원고에게 7 , 800만원[및 지연이자]을 지급하라. Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다. Co n c lu s i o n : 피고의 본안전 항변은 받아들 이지 않 는다. Co n c lu s i o n : 원고의 피 보전채권 을 인정할 수 있 다. Ar gument (Plaintiff) : 이 사 건 상계계약 은 다 른 일반채권 자의 이 익 을 행 하는 사 해행위로 , 상계계약의 취소와 그 원상 회복으로 그 상 당액 의 반환 을 구 한다. Co n c lu s i o n 원고의 사 해행위 취소 청구 는 이 유 있어 인용 하고, 나머 지 청구 는 이 유 없어 기각 한다. Ar gu...

  9. [9]

    1 . 1 2 . 1 . 1 C o v er ed 2 . 1 . 1 C o v er ed 2 C o v er ed 2 . 1 C o v er ed 2 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 2 C orr ect 2 . 1 . 3 C orr ect 2 . 2 Incorr ect 2 . 1 C orr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 C o v er ed 2 . 1 . 3 C o v er ed 2 . 1 C o v er ed 2 C o v...

  10. [10]

    3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

    1 . 3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

  11. [11]

    com- pensation for damages ( 손해배상)

    1 .2 Figure 10: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), in Korean. Refer to Figure 9 for the English-translated version of the data and LLM responses. 15 0 5 10 15 20 25 >25 Number of Issues 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . ...

  12. [12]

    80 1 . 61 1 . 7 4 6 . 15 2.45 1 .43 1 .45 5 . 32

  13. [13]

    30 1 .47 1 .27 5 . 05

  14. [14]

    60 1 .44 1 . 56 5 . 59

  15. [15]

    65 1 .47 1 . 62 5 . 7 4 2.45 1 . 38 1 .25 5 . 08

  16. [16]

    55 1 . 30 1 . 37 5 .22

  17. [17]

    35 1 .26 1 . 34 4 . 95 1 . 7 0 1 .26 1 . 31 4 .26 2.20 1 . 09 1 . 05 4 . 34 1 . 60 1 . 10 0 . 99 3 . 69 LEGIT scor e per difficulty le v els Issue Corr ectness Issue Co v erage Final or der Corr ectness E M H Figure 14: Component-wise LEGIT score of four LLMs, divided by difficulty subsets (E: Easy, M: Medium, H: Hard). Individual score components (final ...

  18. [18]

    00 0 .25 0

    69 2.45 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 1 . 71 1 .48 1 .47 1 . 50 1 .45 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash

  19. [19]

    03 1 . 61 1 . 57 1 . 65 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 6 . 69 5 . 80 5 . 58 5 . 84 5 .45 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (Gemini- 2. 5-Flash) 0 1 2 3 4 5 +Gr ound trut h +Fine-tuned Contrie v er +Co...

  20. [20]

    0 7 1 . 99 1 . 82 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .45 1 . 31 1 .20 1 .27 1 . 19 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .2 4 1 . 13 1 . 12 1 . 16 1 . 00 0 2 4 6 8 10 +Gr ound tr...

  21. [21]

    00 0 .25 0

    7 4 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 1 . 68 1 .46 1 .47 1 . 50 1 .42 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1

  22. [22]

    e v ent s

    02 1 . 60 1 . 57 1 . 68 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 6 . 98 5 . 7 9 5 . 7 9 6 . 11 5 . 71 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (GPT -4 . 1) Figure 17: Results of RAG for LEGIT dataset, with three different gen...