Recognition: no theorem link
CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval
Pith reviewed 2026-05-16 11:19 UTC · model grok-4.3
The pith
CaseFacts benchmark shows LLMs struggle to verify colloquial legal claims against Supreme Court precedents and that web search makes accuracy worse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CaseFacts supplies 6,294 colloquial claims synthesized from Supreme Court case summaries and labeled Supported, Refuted, or Overruled by an LLM pipeline that uses semantic similarity to detect overrulings. State-of-the-art LLMs find the verification task difficult, and augmenting them with open web retrieval degrades performance relative to closed-book baselines due to retrieval of noisy, non-authoritative precedents.
What carries the argument
The multi-stage LLM pipeline that synthesizes colloquial claims from expert summaries and applies a semantic similarity heuristic to identify and label complex legal overrulings.
If this is right
- Legal verification systems perform better when they stay within authoritative closed sources rather than relying on open web retrieval.
- Any effective legal fact-checker must explicitly track the temporal validity of precedents because later rulings can overrule earlier ones.
- Benchmarks that force models to bridge everyday language and technical jurisprudence are needed to advance reliable legal AI.
Where Pith is reading between the lines
- The same synthesis-plus-heuristic approach could be adapted to create benchmarks in other domains where facts evolve, such as medical guidelines.
- Specialized retrieval limited to official legal databases might overcome the noise problem that open web search introduces here.
- Hybrid systems that combine LLM reasoning with structured legal databases could be tested directly on this benchmark to measure gains.
Load-bearing premise
The LLM-generated claims from expert summaries closely match the way ordinary people would actually state legal assertions, and the semantic similarity heuristic correctly produces Supported, Refuted, or Overruled labels with few errors.
What would settle it
A legal expert review of a random sample of several hundred labeled claims that measures agreement between the dataset labels and the actual current status of the cited precedents.
Figures
read the original abstract
Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CaseFacts, a benchmark of 6,294 colloquial legal claims about U.S. Supreme Court precedents labeled Supported, Refuted, or Overruled. Claims are synthesized from expert summaries via a multi-stage LLM pipeline, with a novel semantic similarity heuristic used to detect and label overrulings. Experiments on state-of-the-art LLMs show the task is challenging and that unrestricted web search augmentation degrades performance relative to closed-book baselines, attributed to retrieval of noisy, non-authoritative precedents.
Significance. If the labels prove reliable, CaseFacts would fill a gap in legal-domain fact-checking benchmarks by emphasizing the semantic gap between lay claims and technical jurisprudence plus temporal validity. The empirical finding that web retrieval harms performance could inform retrieval strategies in high-stakes domains, and releasing the dataset supports further research.
major comments (3)
- [Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.
- [Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.
- [§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.
minor comments (2)
- [Abstract] The abstract states the dataset size but omits the breakdown across Supported, Refuted, and Overruled classes; adding this would help readers assess class balance.
- [Introduction] Notation for the semantic similarity threshold and how Overruled is distinguished from Refuted could be introduced earlier and used consistently.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on CaseFacts. We address each major comment below and commit to revisions that improve reproducibility and label transparency without altering the core findings.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.
Authors: We acknowledge that full human validation was not performed at scale due to the dataset size and resource limits. The pipeline begins with expert case summaries and uses the semantic similarity heuristic to detect overrulings via citation overlap and embedding similarity thresholds calibrated on known overruling pairs. In the revised version we will add a manual error analysis on a stratified sample of 300 claims (100 per label), with two legal experts providing independent annotations and reporting Cohen's kappa for the Overruled category. This will quantify label reliability while preserving the automated construction approach. revision: yes
-
Referee: [Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.
Authors: We agree that these details are essential. Classification is performed as a strict three-way choice with no partial credit; temporal validity is enforced by masking any precedent decided after the claim's reference date. In the revision we will include the exact system and user prompts for each model, the temperature and decoding settings, and a new subsection on temporal handling. We will also release the full evaluation scripts to enable exact reproduction of the closed-book versus web-augmented comparisons. revision: yes
-
Referee: [§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.
Authors: The web-augmented setting used the raw claim text as the search query against a standard web API, retrieving the top-5 results ranked by the API's relevance score and concatenating them verbatim to the prompt with no source filtering. This design intentionally tests unrestricted retrieval. The revision will add an explicit paragraph in §4.2 describing the query template, ranking method, and absence of authority filters, together with an ablation that substitutes only official court documents to isolate the effect of noisy precedents. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper constructs CaseFacts via a multi-stage LLM pipeline that synthesizes claims from external expert case summaries and applies a semantic similarity heuristic for labeling Supported/Refuted/Overruled. No equations, fitted parameters, or derivations are present that reduce by construction to the inputs. Experimental claims about LLM performance (including web-search degradation) are presented as empirical observations rather than derived results. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The benchmark is self-contained against external case data and standard evaluation protocols.
Axiom & Free-Parameter Ledger
free parameters (1)
- semantic similarity threshold
axioms (2)
- domain assumption Large language models can synthesize accurate colloquial claims from expert case summaries without introducing significant factual distortions.
- ad hoc to paper The semantic similarity heuristic reliably identifies and verifies complex legal overrulings.
Forward citations
Cited by 1 Pith paper
-
Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge
CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retr...
Reference graph
Works this paper leans on
-
[1]
PubHealthTab: A public health table-based dataset for evidence-based fact checking. InFind- ings of the Association for Computational Linguistics: NAACL 2022, pages 1–16, Seattle, United States. As- sociation for Computational Linguistics. Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana ...
-
[2]
Akshith Reddy Putta, Jacob Devasier, and Chengkai Li
Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633. Akshith Reddy Putta, Jacob Devasier, and Chengkai Li. 2025. Claimcheck: Real-time fact-checking with small language models.Preprint, arXiv:2510.01226. Markus Reuter, Tobias Lingenberg, Ruta Liepina, Francesca Lagioia, Marco Lippi, Giovanni Sarto...
-
[3]
Self-Preference Bias in LLM-as-a-Judge
Self-preference bias in llm-as-a-judge. Preprint, arXiv:2410.21819. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged resources to advance general chinese embedding.Preprint, arXiv:2309.07597. Jingze Zhang, Jiahe Qian, Yiliang Zhou, and Yi- fan Peng. 2025a. Enhancing health fact-checking with llm-generated synthetic dat...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
successfully maintained comparable claim com- plexity. A.3 LLM Prompts This section details the specific prompts used to construct the dataset and run baseline experiments. Claim Generation and FactualityListing 1 dis- plays the prompt used to extract atomic legal claims from case summaries. To ensure these claims are supported by the text, we utilize a f...
-
[5]
Base your judgment only on the Supreme Court evidence
-
[6]
If the evidence does not support the claim, do not label as consistent
-
[7]
Do not rely on outside knowledge or assumptions
-
[8]
Do not invent information that is not in the evidence. Claim: {claim} Case Evidence: Facts: {facts} Question: {question} Conclusion: {conclusion} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 3: Prompt for the Contradiction Check within the ...
-
[9]
Is Claim 1 correct and Claim 2 incorrect?
-
[10]
Is Claim 2 correct and Claim 1 incorrect?
-
[11]
Are both claims incorrect?
-
[12]
Are both claims partially correct but phrased poorly? (If so, provide a merged/corrected claim). ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "decision": "<claim1_correct | claim2_correct | neither_correct | both_partial>", "corrected_claim": "..." (optional, if decision is ’both_partial’) }} ``` 14 List...
-
[13]
Are they overruling one another? Indicate "case1_overruled" if Case 2’s evidence points to overruling Case 1’s evidence. Indicate "case2_overruled" if Case 1’s evidence points to overruling Case 2’s evidence. Take into account their ruling dates when making overruling decisions
-
[14]
Are they consistent given context? (e.g. different jurisdictions, different specific facts). Indicate "consistent" in the decision field. Even if the claims are slightly contradicting, they are consistent as long as they propagate different legal principles. This will be quite common, as true overruling contradictions are rare in Supreme Court cases. Clai...
-
[15]
The ground truth for Claim 1 will be both cases
Keep Claim 1 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 1 will be both cases
-
[16]
The ground truth for Claim 2 will be both cases
Keep Claim 2 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 2 will be both cases
-
[17]
Merge them (create a new claim that combines the information from both). Claim 1: {claim1} Claim 2: {claim2} Case 1 Evidence: Facts: {facts1} Question: {api_question1} Conclusion: {api_conclusion1} Case 2 Evidence: Facts: {facts2} Question: {api_question2} Conclusion: {api_conclusion2} Output JSON: {{ "reasoning": "...", "decision": "keep_1" | "keep_2" | ...
-
[18]
Independent of specific case details or parties (remove names, dates, specific locations)
- [19]
-
[20]
Concise and direct (simple, everyday language)
-
[21]
{claim}" ## Output Format: Return a JSON object with a single key
Focused on the core legal principle being asserted (even if that principle is false). You must not change the meaning of the claim. If some details are necessary to preserve the meaning, keep them, even if that makes the claim lengthy. If the claim is already concise and general, you may return it as is or with minor improvements. ## Input Claim: "{claim}...
-
[22]
Base your judgment on your internal knowledge of the Supreme Court case
-
[23]
If you are unsure or do not know the case, do not label as consistent. Claim: {claim} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 14: Prompt for Baseline Predictions You are a legal expert. Your task is to analyze a legal claim and determi...
-
[24]
Do not invent cases or cite cases not in the list
You must ONLY cite cases from the provided list of valid Supreme Court cases. Do not invent cases or cite cases not in the list
-
[25]
If you are unsure, provide your best estimate but prioritize accuracy
Do not guess. If you are unsure, provide your best estimate but prioritize accuracy
-
[26]
Output must be a valid JSON object. Valid Supreme Court Cases: {case_list} Claim: {claim} Respond with a JSON object in the following format: {{ "explanation": "Brief explanation of your reasoning.", "cases": ["Case Name 1", "Case Name 2", ...], "verdict": "Supported" or "Refuted" or "Overruled" }} 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.