TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

Dogyoon Lim; Eun-Ju Lee; Gyubin Choi; Haesung Lee; So-Min Lee; Sung-Kyoung Jang; Yohan Jo; Youkang Ko

arxiv: 2605.03792 · v1 · submitted 2026-05-05 · 💻 cs.CL

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

Haesung Lee , Gyubin Choi , Eun-Ju Lee , So-Min Lee , Youkang Ko , Dogyoon Lim , Sung-Kyoung Jang , Yohan Jo This is my paper

Pith reviewed 2026-05-07 03:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords judicialllmsriskslegalperformancetribench-koaddresscapture

0 comments

The pith

TriBench-Ko benchmark reveals that many contemporary LLMs exhibit significant risks in judicial tasks, especially precedent retrieval and capturing critical legal information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TriBench-Ko is a new public test set built from real Korean court decisions. It asks LLMs to summarize jurisprudence, retrieve relevant precedents, extract legal issues, and analyze evidence. At the same time it scores each answer for problems like making up facts, showing demographic bias, changing answers with small prompt changes, or giving legal advice beyond its role. When the authors ran several current LLMs on the benchmark, most models struggled badly with finding the right past cases and often omitted key facts from the documents they were given. The benchmark therefore flags areas where LLM outputs in legal settings need extra human review.

Core claim

Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information.

Load-bearing premise

The four tasks and risk categories, constructed from verified judicial decisions, accurately represent the performance and deployment risks that arise in day-to-day judicial workflows.

read the original abstract

Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TriBench-Ko adds a needed Korean judicial risk benchmark but the abstract leaves key construction details thin.

read the letter

TriBench-Ko is a practical new benchmark for LLM risks in Korean judicial work, though its methods need more detail. The paper releases a dataset built around four tasks drawn from real decisions: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It ties each to specific risk types such as inaccuracy, biases, inconsistencies, and overreach. That pairing is a clear step past generic legal proxies like bar exams. They test several current models, flag consistent problems with precedent retrieval and missing critical information, and make the data and code public. Those elements give the work concrete value for anyone testing LLMs in legal settings, especially non-English ones. The release itself is the strongest part. The main gap is in the construction details. The abstract does not report item counts, case sampling rules, inter-annotator agreement, or exact scoring rubrics. Without those, it is hard to judge whether the observed failure rates would hold for routine judicial work or whether the selected decisions over-represent difficult cases. The stress-test point on ecological validity lands: the tasks may not fully proxy day-to-day workflows if the risk labels were assigned without sitting judges. This paper is aimed at researchers and developers working on legal AI applications. It deserves peer review because the benchmark release is a tangible contribution that others can use and extend, even if the current write-up needs a fuller methods section and some external validation. I would send it to referees and request the missing numbers and any judge feedback on the items.

Referee Report

3 major / 3 minor

Summary. The paper introduces TriBench-Ko, a publicly released Korean benchmark for evaluating LLM deployment risks in judicial workflows. It defines four tasks—jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis—each constructed from verified real-world judicial decisions and jointly scored for task performance plus risk categories (inaccuracy including hallucination/omission/statutory misapplication, biases, inconsistencies, and adjudicative overreach). Evaluation of multiple contemporary LLMs shows frequent significant risks, especially failures in precedent retrieval and capture of critical legal information, with a diagnosis of failure modes and a call for caution in judicial use.

Significance. If the benchmark items are representative of routine judicial work, the release provides a needed non-English, task-grounded alternative to proxy evaluations such as bar-exam or classification benchmarks. Public dataset and code release supports reproducibility. The central empirical claim—that LLMs exhibit systematic, high-stakes failures in precedent retrieval and information completeness—would be a useful signal for legal-AI safety research if the construction and scoring procedures are shown to be reliable and ecologically valid.

major comments (3)

[§3] §3 (Benchmark Construction): The description of how the four tasks and risk labels were derived from verified judicial decisions omits the sampling frame, total number of cases, inclusion/exclusion criteria, expert annotation protocol, and any measure of inter-annotator agreement. These details are load-bearing for the claim that the benchmark “accurately represent[s] the performance and deployment risks that arise in day-to-day judicial workflows.”
[§4] §4 (Evaluation): Results are reported without per-task sample sizes, statistical significance tests, confidence intervals, or variance estimates across prompts or runs. The abstract’s assertion that “many models frequently manifest significant risks” therefore cannot be assessed for robustness or generalizability.
[§3.2–3.3] §3.2–3.3 (Risk Taxonomy and Scoring): No explicit scoring rubric, decision criteria, or example annotations are provided for mapping model outputs to the four risk categories (e.g., how omission vs. hallucination is distinguished in jurisprudence summarization or how overreach is quantified in evidence analysis). This prevents independent verification of the joint task-plus-risk evaluation.

minor comments (3)

[Abstract] Abstract: The phrase “a range of contemporary LLMs” is vague; naming the models and giving headline accuracy or risk rates would improve immediate readability.
[Related Work] Related Work: Prior Korean or multilingual legal benchmarks are cited only lightly; a short comparison table would clarify TriBench-Ko’s incremental contribution.
[§2] Notation: The risk categories are introduced with overlapping examples (e.g., “inaccuracy” includes both hallucination and statutory misapplication); a concise taxonomy table would reduce ambiguity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical benchmark construction and evaluation with no mathematical modeling or new theoretical constructs.

pith-pipeline@v0.9.0 · 5544 in / 1077 out tokens · 42310 ms · 2026-05-07T03:11:27.772956+00:00 · methodology

TriBench-Ko: Evaluating LLM Risks in Judicial Workflows

Core claim

Load-bearing premise

discussion (0)