TriBench-Ko: Evaluating LLM Risks in Judicial Workflows
Pith reviewed 2026-05-07 03:11 UTC · model grok-4.3
The pith
TriBench-Ko benchmark reveals that many contemporary LLMs exhibit significant risks in judicial tasks, especially precedent retrieval and capturing critical legal information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information.
Load-bearing premise
The four tasks and risk categories, constructed from verified judicial decisions, accurately represent the performance and deployment risks that arise in day-to-day judicial workflows.
read the original abstract
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this, we publicly release TriBench-Ko, a Korean benchmark designed to evaluate potential deployment risks of LLMs within the context of verified judicial task requirements. It covers four core tasks: jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis. It jointly assesses model behavior across multiple deployment risk categories, including inaccuracy (hallucination, omission, statutory misapplication), biases (demographic, overcompliance), inconsistencies (prompt sensitivity, non-determinism), and adjudicative overreach. Each item is structured to systematically assess both task performance and a specific risk type based on real judicial decisions. Our evaluation of a range of contemporary LLMs reveals that many models frequently manifest significant risks, most notably struggling with precedent retrieval and failing to capture critical legal information. We provide a comprehensive diagnosis of these LLMs and pinpoint critical areas where LLM-generated outputs in judicial contexts necessitate rigorous inspection and caution. Our dataset and code are available at https://github.com/holi-lab/TriBench-Ko
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TriBench-Ko, a publicly released Korean benchmark for evaluating LLM deployment risks in judicial workflows. It defines four tasks—jurisprudence summarization, precedent retrieval, legal issue extraction, and evidence analysis—each constructed from verified real-world judicial decisions and jointly scored for task performance plus risk categories (inaccuracy including hallucination/omission/statutory misapplication, biases, inconsistencies, and adjudicative overreach). Evaluation of multiple contemporary LLMs shows frequent significant risks, especially failures in precedent retrieval and capture of critical legal information, with a diagnosis of failure modes and a call for caution in judicial use.
Significance. If the benchmark items are representative of routine judicial work, the release provides a needed non-English, task-grounded alternative to proxy evaluations such as bar-exam or classification benchmarks. Public dataset and code release supports reproducibility. The central empirical claim—that LLMs exhibit systematic, high-stakes failures in precedent retrieval and information completeness—would be a useful signal for legal-AI safety research if the construction and scoring procedures are shown to be reliable and ecologically valid.
major comments (3)
- [§3] §3 (Benchmark Construction): The description of how the four tasks and risk labels were derived from verified judicial decisions omits the sampling frame, total number of cases, inclusion/exclusion criteria, expert annotation protocol, and any measure of inter-annotator agreement. These details are load-bearing for the claim that the benchmark “accurately represent[s] the performance and deployment risks that arise in day-to-day judicial workflows.”
- [§4] §4 (Evaluation): Results are reported without per-task sample sizes, statistical significance tests, confidence intervals, or variance estimates across prompts or runs. The abstract’s assertion that “many models frequently manifest significant risks” therefore cannot be assessed for robustness or generalizability.
- [§3.2–3.3] §3.2–3.3 (Risk Taxonomy and Scoring): No explicit scoring rubric, decision criteria, or example annotations are provided for mapping model outputs to the four risk categories (e.g., how omission vs. hallucination is distinguished in jurisprudence summarization or how overreach is quantified in evidence analysis). This prevents independent verification of the joint task-plus-risk evaluation.
minor comments (3)
- [Abstract] Abstract: The phrase “a range of contemporary LLMs” is vague; naming the models and giving headline accuracy or risk rates would improve immediate readability.
- [Related Work] Related Work: Prior Korean or multilingual legal benchmarks are cited only lightly; a short comparison table would clarify TriBench-Ko’s incremental contribution.
- [§2] Notation: The risk categories are introduced with overlapping examples (e.g., “inaccuracy” includes both hallucination and statutory misapplication); a concise taxonomy table would reduce ambiguity.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.