arxiv: 2601.17230 · v2 · submitted 2026-01-23 · 💻 cs.CL · cs.LG

Recognition: no theorem link

CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval

Akshith Reddy Putta , Jacob Devasier , Chengkai Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:19 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords legal fact-checkingbenchmark datasetLLM evaluationSupreme Court precedentsprecedent retrievaloverruling detectioncolloquial claims

0 comments

The pith

CaseFacts benchmark shows LLMs struggle to verify colloquial legal claims against Supreme Court precedents and that web search makes accuracy worse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CaseFacts as a dataset of 6,294 layperson-style claims about U.S. Supreme Court cases, each labeled Supported, Refuted, or Overruled by current precedents. It builds the dataset by having LLMs turn expert case summaries into everyday assertions and uses a semantic similarity step to catch when later rulings have overturned earlier ones. Experiments find that even the best current LLMs perform poorly on this task. Adding unrestricted web search to the models actually lowers results compared with closed-book settings because the search pulls in noisy or non-authoritative sources. The benchmark is released to encourage work on systems that can handle the gap between everyday language and technical, time-sensitive law.

Core claim

CaseFacts supplies 6,294 colloquial claims synthesized from Supreme Court case summaries and labeled Supported, Refuted, or Overruled by an LLM pipeline that uses semantic similarity to detect overrulings. State-of-the-art LLMs find the verification task difficult, and augmenting them with open web retrieval degrades performance relative to closed-book baselines due to retrieval of noisy, non-authoritative precedents.

What carries the argument

The multi-stage LLM pipeline that synthesizes colloquial claims from expert summaries and applies a semantic similarity heuristic to identify and label complex legal overrulings.

If this is right

Legal verification systems perform better when they stay within authoritative closed sources rather than relying on open web retrieval.
Any effective legal fact-checker must explicitly track the temporal validity of precedents because later rulings can overrule earlier ones.
Benchmarks that force models to bridge everyday language and technical jurisprudence are needed to advance reliable legal AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis-plus-heuristic approach could be adapted to create benchmarks in other domains where facts evolve, such as medical guidelines.
Specialized retrieval limited to official legal databases might overcome the noise problem that open web search introduces here.
Hybrid systems that combine LLM reasoning with structured legal databases could be tested directly on this benchmark to measure gains.

Load-bearing premise

The LLM-generated claims from expert summaries closely match the way ordinary people would actually state legal assertions, and the semantic similarity heuristic correctly produces Supported, Refuted, or Overruled labels with few errors.

What would settle it

A legal expert review of a random sample of several hundred labeled claims that measures agreement between the dataset labels and the actual current status of the cited precedents.

Figures

Figures reproduced from arXiv: 2601.17230 by Akshith Reddy Putta, Chengkai Li, Jacob Devasier.

read the original abstract

Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CaseFacts gives a useful new benchmark for colloquial-to-precedent legal checking and shows web search hurting closed-book performance, but the LLM-synthesized claims and unvalidated heuristic need more scrutiny before the numbers can be trusted.

read the letter

The main thing here is a new benchmark that tries to close the gap between how normal people talk about legal issues and the actual Supreme Court precedents that apply. It has 6,294 claims labeled Supported, Refuted, or Overruled, built from case summaries with a semantic similarity trick to catch overrulings and some temporal filtering. They also report that adding unrestricted web search makes LLM performance worse than just using the model’s own knowledge, which is a clean negative result worth noting. That part feels like the most concrete contribution so far. The release of the data is straightforward and should help people working on legal retrieval or verification tools. What stands out is the attempt to move beyond formal-to-formal mappings that most prior legal datasets use. The construction pipeline is described at a high level, and the negative finding on web augmentation is presented as an empirical observation rather than a theoretical claim. That keeps the paper from overreaching on what the experiments prove. The soft spots are mostly around validation. The claims come from LLMs prompted on expert summaries, and there is no clear independent check on whether those claims actually sound like things a layperson would say or whether the labels are accurate at scale. The semantic similarity heuristic for overrulings is novel but its error rate is not reported against a human baseline in the available details. Experimental controls around retrieval noise and temporal cutoffs also look light. If those pieces hold up under closer inspection the benchmark could be useful; right now they are the main place where the evidence is thin. This is the kind of resource paper that belongs in a reading group focused on applied NLP or legal AI. A serious editor should send it to review rather than desk-reject it, mainly because new benchmarks in high-stakes domains are rare and the negative result on web search is worth testing further. I would not cite it yet for any core claim until the dataset validation is stronger, but I would keep an eye on follow-up work that uses it.

Referee Report

3 major / 2 minor

Summary. The paper introduces CaseFacts, a benchmark of 6,294 colloquial legal claims about U.S. Supreme Court precedents labeled Supported, Refuted, or Overruled. Claims are synthesized from expert summaries via a multi-stage LLM pipeline, with a novel semantic similarity heuristic used to detect and label overrulings. Experiments on state-of-the-art LLMs show the task is challenging and that unrestricted web search augmentation degrades performance relative to closed-book baselines, attributed to retrieval of noisy, non-authoritative precedents.

Significance. If the labels prove reliable, CaseFacts would fill a gap in legal-domain fact-checking benchmarks by emphasizing the semantic gap between lay claims and technical jurisprudence plus temporal validity. The empirical finding that web retrieval harms performance could inform retrieval strategies in high-stakes domains, and releasing the dataset supports further research.

major comments (3)

[Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.
[Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.
[§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.

minor comments (2)

[Abstract] The abstract states the dataset size but omits the breakdown across Supported, Refuted, and Overruled classes; adding this would help readers assess class balance.
[Introduction] Notation for the semantic similarity threshold and how Overruled is distinguished from Refuted could be introduced earlier and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on CaseFacts. We address each major comment below and commit to revisions that improve reproducibility and label transparency without altering the core findings.

read point-by-point responses

Referee: [Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.

Authors: We acknowledge that full human validation was not performed at scale due to the dataset size and resource limits. The pipeline begins with expert case summaries and uses the semantic similarity heuristic to detect overrulings via citation overlap and embedding similarity thresholds calibrated on known overruling pairs. In the revised version we will add a manual error analysis on a stratified sample of 300 claims (100 per label), with two legal experts providing independent annotations and reporting Cohen's kappa for the Overruled category. This will quantify label reliability while preserving the automated construction approach. revision: yes
Referee: [Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.

Authors: We agree that these details are essential. Classification is performed as a strict three-way choice with no partial credit; temporal validity is enforced by masking any precedent decided after the claim's reference date. In the revision we will include the exact system and user prompts for each model, the temperature and decoding settings, and a new subsection on temporal handling. We will also release the full evaluation scripts to enable exact reproduction of the closed-book versus web-augmented comparisons. revision: yes
Referee: [§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.

Authors: The web-augmented setting used the raw claim text as the search query against a standard web API, retrieving the top-5 results ranked by the API's relevance score and concatenating them verbatim to the prompt with no source filtering. This design intentionally tests unrestricted retrieval. The revision will add an explicit paragraph in §4.2 describing the query template, ranking method, and absence of authority filters, together with an ablation that substitutes only official court documents to isolate the effect of noisy precedents. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs CaseFacts via a multi-stage LLM pipeline that synthesizes claims from external expert case summaries and applies a semantic similarity heuristic for labeling Supported/Refuted/Overruled. No equations, fitted parameters, or derivations are present that reduce by construction to the inputs. Experimental claims about LLM performance (including web-search degradation) are presented as empirical observations rather than derived results. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The benchmark is self-contained against external case data and standard evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the fidelity of LLM synthesis for claims and the reliability of the semantic similarity heuristic for labeling overrulings, both introduced without external validation in the available description.

free parameters (1)

semantic similarity threshold
Parameter in the novel heuristic used to identify overrulings; specific value not provided in abstract.

axioms (2)

domain assumption Large language models can synthesize accurate colloquial claims from expert case summaries without introducing significant factual distortions.
Core step in the multi-stage pipeline that produces the 6,294 claims.
ad hoc to paper The semantic similarity heuristic reliably identifies and verifies complex legal overrulings.
Novel component used to efficiently label Overruled claims during dataset construction.

pith-pipeline@v0.9.0 · 5480 in / 1411 out tokens · 73185 ms · 2026-05-16T11:19:47.631511+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge
cs.IR 2026-04 unverdicted novelty 7.0

CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retr...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

InFind- ings of the Association for Computational Linguistics: NAACL 2022, pages 1–16, Seattle, United States

PubHealthTab: A public health table-based dataset for evidence-based fact checking. InFind- ings of the Association for Computational Linguistics: NAACL 2022, pages 1–16, Seattle, United States. As- sociation for Computational Linguistics. Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana ...

work page arXiv 2022
[2]

Akshith Reddy Putta, Jacob Devasier, and Chengkai Li

Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633. Akshith Reddy Putta, Jacob Devasier, and Chengkai Li. 2025. Claimcheck: Real-time fact-checking with small language models.Preprint, arXiv:2510.01226. Markus Reuter, Tobias Lingenberg, Ruta Liepina, Francesca Lagioia, Marco Lippi, Giovanni Sarto...

work page arXiv 2025
[3]

Self-Preference Bias in LLM-as-a-Judge

Self-preference bias in llm-as-a-judge. Preprint, arXiv:2410.21819. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged resources to advance general chinese embedding.Preprint, arXiv:2309.07597. Jingze Zhang, Jiahe Qian, Yiliang Zhou, and Yi- fan Peng. 2025a. Enhancing health fact-checking with llm-generated synthetic dat...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

the court found,

successfully maintained comparable claim com- plexity. A.3 LLM Prompts This section details the specific prompts used to construct the dataset and run baseline experiments. Claim Generation and FactualityListing 1 dis- plays the prompt used to extract atomic legal claims from case summaries. To ensure these claims are supported by the text, we utilize a f...

work page
[5]

Base your judgment only on the Supreme Court evidence

work page
[6]

If the evidence does not support the claim, do not label as consistent

work page
[7]

Do not rely on outside knowledge or assumptions

work page
[8]

explanation

Do not invent information that is not in the evidence. Claim: {claim} Case Evidence: Facts: {facts} Question: {question} Conclusion: {conclusion} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 3: Prompt for the Contradiction Check within the ...

work page
[9]

Is Claim 1 correct and Claim 2 incorrect?

work page
[10]

Is Claim 2 correct and Claim 1 incorrect?

work page
[11]

Are both claims incorrect?

work page
[12]

explanation

Are both claims partially correct but phrased poorly? (If so, provide a merged/corrected claim). ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "decision": "<claim1_correct | claim2_correct | neither_correct | both_partial>", "corrected_claim": "..." (optional, if decision is ’both_partial’) }} ``` 14 List...

work page
[13]

case1_overruled

Are they overruling one another? Indicate "case1_overruled" if Case 2’s evidence points to overruling Case 1’s evidence. Indicate "case2_overruled" if Case 1’s evidence points to overruling Case 2’s evidence. Take into account their ruling dates when making overruling decisions

work page
[14]

consistent

Are they consistent given context? (e.g. different jurisdictions, different specific facts). Indicate "consistent" in the decision field. Even if the claims are slightly contradicting, they are consistent as long as they propagate different legal principles. This will be quite common, as true overruling contradictions are rare in Supreme Court cases. Clai...

work page
[15]

The ground truth for Claim 1 will be both cases

Keep Claim 1 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 1 will be both cases

work page
[16]

The ground truth for Claim 2 will be both cases

Keep Claim 2 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 2 will be both cases

work page
[17]

reasoning

Merge them (create a new claim that combines the information from both). Claim 1: {claim1} Claim 2: {claim2} Case 1 Evidence: Facts: {facts1} Question: {api_question1} Conclusion: {api_conclusion1} Case 2 Evidence: Facts: {facts2} Question: {api_question2} Conclusion: {api_conclusion2} Output JSON: {{ "reasoning": "...", "decision": "keep_1" | "keep_2" | ...

work page
[18]

Independent of specific case details or parties (remove names, dates, specific locations)

work page
[19]

unless",

Unconditional (remove "unless", "especially if", or specific factual caveats)

work page
[20]

Concise and direct (simple, everyday language)

work page
[21]

{claim}" ## Output Format: Return a JSON object with a single key

Focused on the core legal principle being asserted (even if that principle is false). You must not change the meaning of the claim. If some details are necessary to preserve the meaning, keep them, even if that makes the claim lengthy. If the claim is already concise and general, you may return it as is or with minor improvements. ## Input Claim: "{claim}...

work page
[22]

Base your judgment on your internal knowledge of the Supreme Court case

work page
[23]

explanation

If you are unsure or do not know the case, do not label as consistent. Claim: {claim} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 14: Prompt for Baseline Predictions You are a legal expert. Your task is to analyze a legal claim and determi...

work page
[24]

Do not invent cases or cite cases not in the list

You must ONLY cite cases from the provided list of valid Supreme Court cases. Do not invent cases or cite cases not in the list

work page
[25]

If you are unsure, provide your best estimate but prioritize accuracy

Do not guess. If you are unsure, provide your best estimate but prioritize accuracy

work page
[26]

explanation

Output must be a valid JSON object. Valid Supreme Court Cases: {case_list} Claim: {claim} Respond with a JSON object in the following format: {{ "explanation": "Brief explanation of your reasoning.", "cases": ["Case Name 1", "Case Name 2", ...], "verdict": "Supported" or "Refuted" or "Overruled" }} 19

work page