Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Arman Cohan; Jinu Lee; Julia Hockenmaier; Kyoung-Woon On; Simeng Han

arxiv: 2512.01020 · v2 · submitted 2025-11-30 · 💻 cs.AI · cs.CL

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Jinu Lee , Kyoung-Woon On , Simeng Han , Arman Cohan , Julia Hockenmaier This is my paper

Pith reviewed 2026-05-17 02:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords legal reasoningLLM evaluationreasoning tracesRAGreinforcement learninglegal datasetsissue coveragecourt judgments

0 comments

The pith

Legal Issue Trees from court judgments serve as rubrics to evaluate LLM reasoning traces for issue coverage and correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the LEGIT dataset consisting of 24,000 instances where court judgments are converted into hierarchical trees of arguments and conclusions. These trees function as rubrics to assess how completely and accurately large language models cover the key legal issues in their reasoning traces. The authors find that LLM performance in legal reasoning depends significantly on both the breadth of issues addressed and the accuracy of the conclusions drawn. Retrieval-augmented generation enhances the models' overall ability to handle legal problems, while reinforcement learning using the rubrics specifically boosts correctness even as it may limit the range of issues considered.

Core claim

We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. Using the LEGIT dataset, we show that LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that retrieval-augmented generation and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

What carries the argument

Legal Issue Trees (LEGIT), which are hierarchical structures extracted from court judgments representing opposing arguments and court conclusions used as evaluation rubrics.

If this is right

LLM legal reasoning quality can be measured along two distinct dimensions of coverage and correctness.
RAG and rubric-based RL offer complementary improvements that could be combined in future legal AI systems.
Large-scale expert-level datasets like LEGIT enable more granular evaluation than coarse rubrics.
Human expert validation supports the use of automatically extracted trees for reliable assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such tree-based rubrics might generalize to evaluate reasoning in other complex domains like medical diagnosis or scientific argumentation.
Integrating both RAG and RL could lead to models with high coverage and high correctness simultaneously.
Automated methods for tree extraction could scale the approach beyond manual conversion of judgments.

Load-bearing premise

The hierarchical trees automatically extracted from court judgments faithfully capture the essential legal issues and human expert agreement on samples validates them as reliable rubrics.

What would settle it

A large-scale study where independent legal experts create their own issue trees for the same judgments and compare agreement rates with the extracted ones, or where high-scoring LLM traces are tested in real legal scenarios and found to underperform.

Figures

Figures reproduced from arXiv: 2512.01020 by Arman Cohan, Jinu Lee, Julia Hockenmaier, Kyoung-Woon On, Simeng Han.

**Figure 3.** Figure 3: Confusion matrices of individual issue labels [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between LLM-evaluated scores between LEGIT score and Likert scale. Even though the [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: LEGIT score of 12 generator LLMs, evaluated [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Three LLM responses obtained from the example LEGIT problem in Figure [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Correctness rate of covered parent issue, de [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of LEGIT scores of Gemma-3- 4B with RAG and RL. While RAG improves all components of LEGIT scores, RL significantly improves the final order/issue correctness while reducing issue coverage. sourced LLMs show consistent results; see Appendix E. RL prioritizes correctness at the cost of coverage. In contrast to RAG, reinforcement learning with the LEGIT reward significantly increases final orde… view at source ↗

**Figure 9.** Figure 9: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Histogram showing the number of issues for each LEGIT instance. The dataset is divided into easy/medium/hard difficulty subsets based on the number of issues. (issue 2.2), which pinpoints the deduction error that led to the wrong final order prediction. B LEGIT dataset details B.1 Filtering out non-deterministic cases We use rules to filter out non-deterministic final orders. First, we only maintain civi… view at source ↗

**Figure 12.** Figure 12: Distribution of case types in LEGIT. Case types that have more than 200 instances are shown with [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Annotation guide presented to the lawyers during expert annotation process in Section [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 15.** Figure 15: (1) Issue coverage and (2) issue correctness [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Parent issue correctness binned by child issue coverage and correctness, extending Figure [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Results of RAG for LEGIT dataset, with three different generators and five retrieval settings (No RAG, BM25, two Contrievers, and Ground-truth citations). RAG improves LEGIT score for around 0.1-0.4 for all (generator, retriever) pairs, with a gain in all three components. Hyperparam. Value Objective GRPO KL Div. Coef. 1e-3 Max prompt len. 2048 Max output len. 4096 Batch size 32 Rollouts 8 Optimizer Adam… view at source ↗

**Figure 18.** Figure 18: Prompts for dataset construction and generating LLM responses. [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Prompts for evaluating the reasoning traces, either with LEGIT rubrics (left) or Likert scale (right). [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗

read the original abstract

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LEGIT, a dataset of 24K hierarchical legal issue trees automatically extracted from court judgments. These trees are positioned as rubrics to evaluate LLM reasoning traces along two axes: legal issue coverage and correctness. Experiments on held-out LLM outputs are used to argue that LLM legal reasoning is substantially degraded by deficiencies in either dimension, and that RAG and rubric-based RL yield complementary gains (RAG boosting overall capability while RL improves correctness at some cost to coverage). Reliability of the trees is asserted via human expert annotation on an unspecified sample plus comparison against coarser rubrics.

Significance. If the extracted trees function as faithful, generalizable rubrics, the work supplies a scalable, expert-level framework for evaluating complex reasoning traces in a high-stakes domain. The complementary RAG/RL finding, if robustly measured, would offer concrete guidance for training legal-reasoning models. The scale (24K instances) and structured output format are clear strengths that could support future benchmark development.

major comments (2)

[Human verification / rubric reliability section] Human verification subsection (and associated results): the manuscript reports that human experts annotated a sample and compared the trees against coarse rubrics, yet supplies no quantitative agreement statistics (Cohen’s κ, percentage agreement, or error typology), no sample size, and no inter-annotator details. Because the headline claims rest on the trees serving as reliable rubrics for arbitrary LLM traces, the absence of these metrics leaves open the possibility that systematic extraction artifacts (omitted implicit issues, misaligned party positions) are being measured rather than genuine reasoning quality.
[LLM evaluation experiments] Experimental results section: LLM performance claims (coverage/correctness degradation and RAG vs. RL complementarity) are presented at a high level without reported numerical metrics, confidence intervals, prompt-variation controls, or ablation tables. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains.

minor comments (2)

[Dataset construction] Clarify the exact procedure and any heuristics used for automatic conversion of judgments into hierarchical trees; a short pseudocode or decision tree would aid reproducibility.
[Discussion / Limitations] Add a limitations paragraph discussing potential domain shift between the court judgments used for tree extraction and the distribution of LLM-generated traces being evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have planned revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Human verification / rubric reliability section] Human verification subsection (and associated results): the manuscript reports that human experts annotated a sample and compared the trees against coarse rubrics, yet supplies no quantitative agreement statistics (Cohen’s κ, percentage agreement, or error typology), no sample size, and no inter-annotator details. Because the headline claims rest on the trees serving as reliable rubrics for arbitrary LLM traces, the absence of these metrics leaves open the possibility that systematic extraction artifacts (omitted implicit issues, misaligned party positions) are being measured rather than genuine reasoning quality.

Authors: We agree that quantitative reliability metrics are important for substantiating the rubric claims. The revised manuscript will expand the human verification subsection to report the sample size, inter-annotator agreement statistics including Cohen’s κ and percentage agreement, and an error typology. We will also clarify how the annotations help identify and mitigate potential extraction artifacts such as omitted implicit issues or misaligned party positions, thereby supporting that the trees measure genuine reasoning quality. revision: yes
Referee: [LLM evaluation experiments] Experimental results section: LLM performance claims (coverage/correctness degradation and RAG vs. RL complementarity) are presented at a high level without reported numerical metrics, confidence intervals, prompt-variation controls, or ablation tables. Without these, it is impossible to assess the magnitude or statistical reliability of the reported gains.

Authors: We acknowledge that the experimental claims require more granular reporting to allow proper assessment. In the revised manuscript we will add specific numerical metrics for coverage and correctness scores, 95% confidence intervals, details on prompt-variation controls, and ablation tables that isolate the contributions of RAG and rubric-based RL. These additions will quantify the magnitude of the complementary gains and the impact of deficiencies in either dimension. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are direct measurements against externally constructed and sampled-validated rubrics

full rationale

The paper constructs LEGIT issue trees from court judgments to serve as rubrics, validates them via human expert annotations on a sample plus comparison to coarse rubrics, then reports empirical measurements of LLM reasoning traces (coverage and correctness) on held-out outputs. No equations, fitted parameters, or self-citations reduce the claimed effects of RAG/RL or the impact of coverage/correctness to quantities defined by the paper's own inputs. The derivation chain consists of independent data construction followed by direct evaluation, with no self-definitional, fitted-prediction, or load-bearing self-citation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that court judgments contain extractable hierarchical issue structures that serve as objective gold-standard rubrics; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Court judgments can be reliably decomposed into hierarchical trees of opposing arguments and conclusions that represent the essential legal issues.
Invoked when converting judgments into rubrics for LLM evaluation.
domain assumption Human expert annotations on a sample suffice to establish the reliability of the automatically constructed trees.
Used to verify the rubrics before applying them to LLM traces.

pith-pipeline@v0.9.0 · 5483 in / 1386 out tokens · 43861 ms · 2026-05-17T02:24:13.506150+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

Exaone 3.5: Series of large lan- guage models for real-world use cases,

Curran Associates, Inc. LG AI Research. 2024. EXAONE 3.5: Series of Large Language Models for Real-world Use Cases.arXiv preprint. ArXiv:2412.04862 [cs]. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yu- jia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024a. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint. ArXiv:2412.05...

work page arXiv 2024
[2]

GPT-4 Technical Report

Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representa- tions. 11 Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- moyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

E Building

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. CitaLaw: Enhancing LLM with Citations in Legal Domain. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 11183–11196, Vienna, Austria. Association for Com...

work page 2025
[4]

Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed

Def endant shall pa y t he plaintiff t he sum of KRW 7 8 , 000 , 000 . Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed. Conclusion: Not sufficient ly pr o v ed, pleading dismissed. Conclusion: Claim accept ed. Ar gument (Plaintiff): The set -off contract is a fraudulent con v e y ance. It shall be canceled and t he def endant s...

work page 2017
[5]

1 . 1 2 . 1 . 1 Co v er ed 2 . 1 . 1 Co v er ed 2 Co v er ed 2 . 1 Co v er ed 2 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 2 Corr ect 2 . 1 . 3 Corr ect 2 . 2 Incorr ect 2 . 1 Corr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 Co v er ed 2 . 1 . 3 Co v er ed 2 . 1 Co v er ed 2 Co v er ed 2 . 2 C...

work page
[6]

3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

1 . 3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

work page 2020
[7]

Refer to Figure 10 for the original version of the data and LLM responses

1 .2 Figure 9: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), translated into English. Refer to Figure 10 for the original version of the data and LLM responses. 14 ... ### 주요 법적 고려사항 사해행위 (민법 406조) ... 원고의 채권 성립 및 이행 공사 완료, 일부지급을 고려하면, ...

work page
[8]

Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다

피고는 원고에게 7 , 800만원[및 지연이자]을 지급하라. Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다. Co n c lu s i o n : 피고의 본안전 항변은 받아들 이지 않 는다. Co n c lu s i o n : 원고의 피 보전채권 을 인정할 수 있 다. Ar gument (Plaintiff) : 이 사 건 상계계약 은 다 른 일반채권 자의 이 익 을 행 하는 사 해행위로 , 상계계약의 취소와 그 원상 회복으로 그 상 당액 의 반환 을 구 한다. Co n c lu s i o n 원고의 사 해행위 취소 청구 는 이 유 있어 인용 하고, 나머 지 청구 는 이 유 없어 기각 한다. Ar gu...

work page 2017
[9]

1 . 1 2 . 1 . 1 C o v er ed 2 . 1 . 1 C o v er ed 2 C o v er ed 2 . 1 C o v er ed 2 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 2 C orr ect 2 . 1 . 3 C orr ect 2 . 2 Incorr ect 2 . 1 C orr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 C o v er ed 2 . 1 . 3 C o v er ed 2 . 1 C o v er ed 2 C o v...

work page
[10]

3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

1 . 3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

work page 2020
[11]

com- pensation for damages ( 손해배상)

1 .2 Figure 10: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), in Korean. Refer to Figure 9 for the English-translated version of the data and LLM responses. 15 0 5 10 15 20 25 >25 Number of Issues 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . ...

work page
[12]

80 1 . 61 1 . 7 4 6 . 15 2.45 1 .43 1 .45 5 . 32

work page
[13]

30 1 .47 1 .27 5 . 05

work page
[14]

60 1 .44 1 . 56 5 . 59

work page
[15]

65 1 .47 1 . 62 5 . 7 4 2.45 1 . 38 1 .25 5 . 08

work page
[16]

55 1 . 30 1 . 37 5 .22

work page
[17]

35 1 .26 1 . 34 4 . 95 1 . 7 0 1 .26 1 . 31 4 .26 2.20 1 . 09 1 . 05 4 . 34 1 . 60 1 . 10 0 . 99 3 . 69 LEGIT scor e per difficulty le v els Issue Corr ectness Issue Co v erage Final or der Corr ectness E M H Figure 14: Component-wise LEGIT score of four LLMs, divided by difficulty subsets (E: Easy, M: Medium, H: Hard). Individual score components (final ...

work page 1998
[18]

00 0 .25 0

69 2.45 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 1 . 71 1 .48 1 .47 1 . 50 1 .45 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash

work page
[19]

03 1 . 61 1 . 57 1 . 65 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 6 . 69 5 . 80 5 . 58 5 . 84 5 .45 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (Gemini- 2. 5-Flash) 0 1 2 3 4 5 +Gr ound trut h +Fine-tuned Contrie v er +Co...

work page
[20]

0 7 1 . 99 1 . 82 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .45 1 . 31 1 .20 1 .27 1 . 19 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .2 4 1 . 13 1 . 12 1 . 16 1 . 00 0 2 4 6 8 10 +Gr ound tr...

work page
[21]

00 0 .25 0

7 4 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 1 . 68 1 .46 1 .47 1 . 50 1 .42 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1

work page
[22]

e v ent s

02 1 . 60 1 . 57 1 . 68 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 6 . 98 5 . 7 9 5 . 7 9 6 . 11 5 . 71 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (GPT -4 . 1) Figure 17: Results of RAG for LEGIT dataset, with three different gen...

work page 2048

[1] [1]

Exaone 3.5: Series of large lan- guage models for real-world use cases,

Curran Associates, Inc. LG AI Research. 2024. EXAONE 3.5: Series of Large Language Models for Real-world Use Cases.arXiv preprint. ArXiv:2412.04862 [cs]. Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yu- jia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024a. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.arXiv preprint. ArXiv:2412.05...

work page arXiv 2024

[2] [2]

GPT-4 Technical Report

Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representa- tions. 11 Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- moyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

E Building

Structured Evaluation of Legal Reasoning in LLMs: Chain-of-Thought Prompting and Human Scoring for Retrieval Robustness. Kepu Zhang, Weijie Yu, Sunhao Dai, and Jun Xu. 2025. CitaLaw: Enhancing LLM with Citations in Legal Domain. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 11183–11196, Vienna, Austria. Association for Com...

work page 2025

[4] [4]

Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed

Def endant shall pa y t he plaintiff t he sum of KRW 7 8 , 000 , 000 . Ar gument (Def endant): Plaintiff’ s period of filing lawsuit has e xpir ed. Conclusion: Not sufficient ly pr o v ed, pleading dismissed. Conclusion: Claim accept ed. Ar gument (Plaintiff): The set -off contract is a fraudulent con v e y ance. It shall be canceled and t he def endant s...

work page 2017

[5] [5]

1 . 1 2 . 1 . 1 Co v er ed 2 . 1 . 1 Co v er ed 2 Co v er ed 2 . 1 Co v er ed 2 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 1 Corr ect 2 . 1 . 2 Corr ect 2 . 1 . 3 Corr ect 2 . 2 Incorr ect 2 . 1 Corr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 Co v er ed 2 . 1 . 3 Co v er ed 2 . 1 Co v er ed 2 Co v er ed 2 . 2 C...

work page

[6] [6]

3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

1 . 3 2.2 Ex a m p l e case - T r a n s l at ed (E n g li sh) ( Judgment ID : 서울남부지방법원- 2020 가단 27 38 96) E x am p le LL M o ut p ut s

work page 2020

[7] [7]

Refer to Figure 10 for the original version of the data and LLM responses

1 .2 Figure 9: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), translated into English. Refer to Figure 10 for the original version of the data and LLM responses. 14 ... ### 주요 법적 고려사항 사해행위 (민법 406조) ... 원고의 채권 성립 및 이행 공사 완료, 일부지급을 고려하면, ...

work page

[8] [8]

Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다

피고는 원고에게 7 , 800만원[및 지연이자]을 지급하라. Ar gumen t (Def end a n t ): 원고의 제척기간이 도과하였다. Co n c lu s i o n : 피고의 본안전 항변은 받아들 이지 않 는다. Co n c lu s i o n : 원고의 피 보전채권 을 인정할 수 있 다. Ar gument (Plaintiff) : 이 사 건 상계계약 은 다 른 일반채권 자의 이 익 을 행 하는 사 해행위로 , 상계계약의 취소와 그 원상 회복으로 그 상 당액 의 반환 을 구 한다. Co n c lu s i o n 원고의 사 해행위 취소 청구 는 이 유 있어 인용 하고, 나머 지 청구 는 이 유 없어 기각 한다. Ar gu...

work page 2017

[9] [9]

1 . 1 2 . 1 . 1 C o v er ed 2 . 1 . 1 C o v er ed 2 C o v er ed 2 . 1 C o v er ed 2 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 1 C orr ect 2 . 1 . 2 C orr ect 2 . 1 . 3 C orr ect 2 . 2 Incorr ect 2 . 1 C orr ect 2 . 2 Incorr ect 2 Incorr ect R oot Incorr ect 2 Incorr ect R oot Incorr ect 2 . 1 . 2 C o v er ed 2 . 1 . 3 C o v er ed 2 . 1 C o v er ed 2 C o v...

work page

[10] [10]

3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

1 . 3 2.2 Ex ample case - Original (K or ean) (Judgment ID: 서 울남부 지 방법 원 - 2020 가 단 27 3 8 96) E x a m p le LL M o u t p u t s

work page 2020

[11] [11]

com- pensation for damages ( 손해배상)

1 .2 Figure 10: A detailed example of a LEGIT case (fraudulent conveyance, top), including facts and the legal issue tree, as well as two LLM outputs and their LEGIT scores (bottom), in Korean. Refer to Figure 9 for the English-translated version of the data and LLM responses. 15 0 5 10 15 20 25 >25 Number of Issues 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . ...

work page

[12] [12]

80 1 . 61 1 . 7 4 6 . 15 2.45 1 .43 1 .45 5 . 32

work page

[13] [13]

30 1 .47 1 .27 5 . 05

work page

[14] [14]

60 1 .44 1 . 56 5 . 59

work page

[15] [15]

65 1 .47 1 . 62 5 . 7 4 2.45 1 . 38 1 .25 5 . 08

work page

[16] [16]

55 1 . 30 1 . 37 5 .22

work page

[17] [17]

35 1 .26 1 . 34 4 . 95 1 . 7 0 1 .26 1 . 31 4 .26 2.20 1 . 09 1 . 05 4 . 34 1 . 60 1 . 10 0 . 99 3 . 69 LEGIT scor e per difficulty le v els Issue Corr ectness Issue Co v erage Final or der Corr ectness E M H Figure 14: Component-wise LEGIT score of four LLMs, divided by difficulty subsets (E: Easy, M: Medium, H: Hard). Individual score components (final ...

work page 1998

[18] [18]

00 0 .25 0

69 2.45 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 1 . 71 1 .48 1 .47 1 . 50 1 .45 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash

work page

[19] [19]

03 1 . 61 1 . 57 1 . 65 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemini- 2. 5-Flash 6 . 69 5 . 80 5 . 58 5 . 84 5 .45 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (Gemini- 2. 5-Flash) 0 1 2 3 4 5 +Gr ound trut h +Fine-tuned Contrie v er +Co...

work page

[20] [20]

0 7 1 . 99 1 . 82 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .45 1 . 31 1 .20 1 .27 1 . 19 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 Gemma- 3-4B 1 .2 4 1 . 13 1 . 12 1 . 16 1 . 00 0 2 4 6 8 10 +Gr ound tr...

work page

[21] [21]

00 0 .25 0

7 4 0 . 00 0 .25 0 . 50 0 . 7 5 1 . 00 1 .25 1 . 50 1 . 7 5 2. 00 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 1 . 68 1 .46 1 .47 1 . 50 1 .42 0 . 0 0 . 5 1 . 0 1 . 5 2. 0 2. 5 3 . 0 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1

work page

[22] [22]

e v ent s

02 1 . 60 1 . 57 1 . 68 1 . 55 0 2 4 6 8 10 +Gr ound trut h +Fine-tuned Contrie v er +Contrie v er +BM25 GPT -4 . 1 6 . 98 5 . 7 9 5 . 7 9 6 . 11 5 . 71 Final or der corr ectness (/5) Issue co v er age (/2) Issue corr ectness (/3) LEGIT scor e (/10) LEGIT scor es wit h RA G (GPT -4 . 1) Figure 17: Results of RAG for LEGIT dataset, with three different gen...

work page 2048