pith. machine review for the scientific record. sign in

arxiv: 2601.17230 · v2 · submitted 2026-01-23 · 💻 cs.CL · cs.LG

Recognition: no theorem link

CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:19 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords legal fact-checkingbenchmark datasetLLM evaluationSupreme Court precedentsprecedent retrievaloverruling detectioncolloquial claims
0
0 comments X

The pith

CaseFacts benchmark shows LLMs struggle to verify colloquial legal claims against Supreme Court precedents and that web search makes accuracy worse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CaseFacts as a dataset of 6,294 layperson-style claims about U.S. Supreme Court cases, each labeled Supported, Refuted, or Overruled by current precedents. It builds the dataset by having LLMs turn expert case summaries into everyday assertions and uses a semantic similarity step to catch when later rulings have overturned earlier ones. Experiments find that even the best current LLMs perform poorly on this task. Adding unrestricted web search to the models actually lowers results compared with closed-book settings because the search pulls in noisy or non-authoritative sources. The benchmark is released to encourage work on systems that can handle the gap between everyday language and technical, time-sensitive law.

Core claim

CaseFacts supplies 6,294 colloquial claims synthesized from Supreme Court case summaries and labeled Supported, Refuted, or Overruled by an LLM pipeline that uses semantic similarity to detect overrulings. State-of-the-art LLMs find the verification task difficult, and augmenting them with open web retrieval degrades performance relative to closed-book baselines due to retrieval of noisy, non-authoritative precedents.

What carries the argument

The multi-stage LLM pipeline that synthesizes colloquial claims from expert summaries and applies a semantic similarity heuristic to identify and label complex legal overrulings.

If this is right

  • Legal verification systems perform better when they stay within authoritative closed sources rather than relying on open web retrieval.
  • Any effective legal fact-checker must explicitly track the temporal validity of precedents because later rulings can overrule earlier ones.
  • Benchmarks that force models to bridge everyday language and technical jurisprudence are needed to advance reliable legal AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis-plus-heuristic approach could be adapted to create benchmarks in other domains where facts evolve, such as medical guidelines.
  • Specialized retrieval limited to official legal databases might overcome the noise problem that open web search introduces here.
  • Hybrid systems that combine LLM reasoning with structured legal databases could be tested directly on this benchmark to measure gains.

Load-bearing premise

The LLM-generated claims from expert summaries closely match the way ordinary people would actually state legal assertions, and the semantic similarity heuristic correctly produces Supported, Refuted, or Overruled labels with few errors.

What would settle it

A legal expert review of a random sample of several hundred labeled claims that measures agreement between the dataset labels and the actual current status of the cited precedents.

Figures

Figures reproduced from arXiv: 2601.17230 by Akshith Reddy Putta, Chengkai Li, Jacob Devasier.

Figure 1
Figure 1. Figure 1: Graph displaying the changes in character length before and after the LLM pass with prompt [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CaseFacts, a benchmark of 6,294 colloquial legal claims about U.S. Supreme Court precedents labeled Supported, Refuted, or Overruled. Claims are synthesized from expert summaries via a multi-stage LLM pipeline, with a novel semantic similarity heuristic used to detect and label overrulings. Experiments on state-of-the-art LLMs show the task is challenging and that unrestricted web search augmentation degrades performance relative to closed-book baselines, attributed to retrieval of noisy, non-authoritative precedents.

Significance. If the labels prove reliable, CaseFacts would fill a gap in legal-domain fact-checking benchmarks by emphasizing the semantic gap between lay claims and technical jurisprudence plus temporal validity. The empirical finding that web retrieval harms performance could inform retrieval strategies in high-stakes domains, and releasing the dataset supports further research.

major comments (3)
  1. [Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.
  2. [Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.
  3. [§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.
minor comments (2)
  1. [Abstract] The abstract states the dataset size but omits the breakdown across Supported, Refuted, and Overruled classes; adding this would help readers assess class balance.
  2. [Introduction] Notation for the semantic similarity threshold and how Overruled is distinguished from Refuted could be introduced earlier and used consistently.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on CaseFacts. We address each major comment below and commit to revisions that improve reproducibility and label transparency without altering the core findings.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset construction pipeline (multi-stage LLM synthesis + semantic similarity heuristic): no human validation, inter-annotator agreement, or error analysis is reported for the Overruled labels or the overall Supported/Refuted/Overruled distribution. This is load-bearing because the benchmark's utility and all downstream experimental claims rest on label correctness.

    Authors: We acknowledge that full human validation was not performed at scale due to the dataset size and resource limits. The pipeline begins with expert case summaries and uses the semantic similarity heuristic to detect overrulings via citation overlap and embedding similarity thresholds calibrated on known overruling pairs. In the revised version we will add a manual error analysis on a stratified sample of 300 claims (100 per label), with two legal experts providing independent annotations and reporting Cohen's kappa for the Overruled category. This will quantify label reliability while preserving the automated construction approach. revision: yes

  2. Referee: [Experiments] Experiments section: performance measurement details are limited (e.g., exact prompting, handling of temporal validity, and whether classification is strict three-way or allows partial credit). Without these, it is hard to interpret the reported degradation from web search or to reproduce the closed-book vs. augmented comparison.

    Authors: We agree that these details are essential. Classification is performed as a strict three-way choice with no partial credit; temporal validity is enforced by masking any precedent decided after the claim's reference date. In the revision we will include the exact system and user prompts for each model, the temperature and decoding settings, and a new subsection on temporal handling. We will also release the full evaluation scripts to enable exact reproduction of the closed-book versus web-augmented comparisons. revision: yes

  3. Referee: [§4.2] Web-augmentation results: the claim that unrestricted search degrades performance is central, yet the manuscript provides no details on query formulation, result ranking, or filtering of non-authoritative sources. This leaves open whether the observed drop is due to noise or to an uncontrolled experimental variable.

    Authors: The web-augmented setting used the raw claim text as the search query against a standard web API, retrieving the top-5 results ranked by the API's relevance score and concatenating them verbatim to the prompt with no source filtering. This design intentionally tests unrestricted retrieval. The revision will add an explicit paragraph in §4.2 describing the query template, ranking method, and absence of authority filters, together with an ablation that substitutes only official court documents to isolate the effect of noisy precedents. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs CaseFacts via a multi-stage LLM pipeline that synthesizes claims from external expert case summaries and applies a semantic similarity heuristic for labeling Supported/Refuted/Overruled. No equations, fitted parameters, or derivations are present that reduce by construction to the inputs. Experimental claims about LLM performance (including web-search degradation) are presented as empirical observations rather than derived results. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The benchmark is self-contained against external case data and standard evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the fidelity of LLM synthesis for claims and the reliability of the semantic similarity heuristic for labeling overrulings, both introduced without external validation in the available description.

free parameters (1)
  • semantic similarity threshold
    Parameter in the novel heuristic used to identify overrulings; specific value not provided in abstract.
axioms (2)
  • domain assumption Large language models can synthesize accurate colloquial claims from expert case summaries without introducing significant factual distortions.
    Core step in the multi-stage pipeline that produces the 6,294 claims.
  • ad hoc to paper The semantic similarity heuristic reliably identifies and verifies complex legal overrulings.
    Novel component used to efficiently label Overruled claims during dataset construction.

pith-pipeline@v0.9.0 · 5480 in / 1411 out tokens · 73185 ms · 2026-05-16T11:19:47.631511+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge

    cs.IR 2026-04 unverdicted novelty 7.0

    CAR is a new retrieval objective that targets the currently active authority set rather than most-similar documents, with theorems on coverage conditions and evaluations showing two-stage methods outperform dense retr...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    InFind- ings of the Association for Computational Linguistics: NAACL 2022, pages 1–16, Seattle, United States

    PubHealthTab: A public health table-based dataset for evidence-based fact checking. InFind- ings of the Association for Computational Linguistics: NAACL 2022, pages 1–16, Seattle, United States. As- sociation for Computational Linguistics. Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana ...

  2. [2]

    Akshith Reddy Putta, Jacob Devasier, and Chengkai Li

    Synthetic data generation using large language models: Advances in text and code.IEEE Access, 13:134615–134633. Akshith Reddy Putta, Jacob Devasier, and Chengkai Li. 2025. Claimcheck: Real-time fact-checking with small language models.Preprint, arXiv:2510.01226. Markus Reuter, Tobias Lingenberg, Ruta Liepina, Francesca Lagioia, Marco Lippi, Giovanni Sarto...

  3. [3]

    Self-Preference Bias in LLM-as-a-Judge

    Self-preference bias in llm-as-a-judge. Preprint, arXiv:2410.21819. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged resources to advance general chinese embedding.Preprint, arXiv:2309.07597. Jingze Zhang, Jiahe Qian, Yiliang Zhou, and Yi- fan Peng. 2025a. Enhancing health fact-checking with llm-generated synthetic dat...

  4. [4]

    the court found,

    successfully maintained comparable claim com- plexity. A.3 LLM Prompts This section details the specific prompts used to construct the dataset and run baseline experiments. Claim Generation and FactualityListing 1 dis- plays the prompt used to extract atomic legal claims from case summaries. To ensure these claims are supported by the text, we utilize a f...

  5. [5]

    Base your judgment only on the Supreme Court evidence

  6. [6]

    If the evidence does not support the claim, do not label as consistent

  7. [7]

    Do not rely on outside knowledge or assumptions

  8. [8]

    explanation

    Do not invent information that is not in the evidence. Claim: {claim} Case Evidence: Facts: {facts} Question: {question} Conclusion: {conclusion} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 3: Prompt for the Contradiction Check within the ...

  9. [9]

    Is Claim 1 correct and Claim 2 incorrect?

  10. [10]

    Is Claim 2 correct and Claim 1 incorrect?

  11. [11]

    Are both claims incorrect?

  12. [12]

    explanation

    Are both claims partially correct but phrased poorly? (If so, provide a merged/corrected claim). ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "decision": "<claim1_correct | claim2_correct | neither_correct | both_partial>", "corrected_claim": "..." (optional, if decision is ’both_partial’) }} ``` 14 List...

  13. [13]

    case1_overruled

    Are they overruling one another? Indicate "case1_overruled" if Case 2’s evidence points to overruling Case 1’s evidence. Indicate "case2_overruled" if Case 1’s evidence points to overruling Case 2’s evidence. Take into account their ruling dates when making overruling decisions

  14. [14]

    consistent

    Are they consistent given context? (e.g. different jurisdictions, different specific facts). Indicate "consistent" in the decision field. Even if the claims are slightly contradicting, they are consistent as long as they propagate different legal principles. This will be quite common, as true overruling contradictions are rare in Supreme Court cases. Clai...

  15. [15]

    The ground truth for Claim 1 will be both cases

    Keep Claim 1 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 1 will be both cases

  16. [16]

    The ground truth for Claim 2 will be both cases

    Keep Claim 2 (if it is more accurate, comprehensive, or better phrased). The ground truth for Claim 2 will be both cases

  17. [17]

    reasoning

    Merge them (create a new claim that combines the information from both). Claim 1: {claim1} Claim 2: {claim2} Case 1 Evidence: Facts: {facts1} Question: {api_question1} Conclusion: {api_conclusion1} Case 2 Evidence: Facts: {facts2} Question: {api_question2} Conclusion: {api_conclusion2} Output JSON: {{ "reasoning": "...", "decision": "keep_1" | "keep_2" | ...

  18. [18]

    Independent of specific case details or parties (remove names, dates, specific locations)

  19. [19]

    unless",

    Unconditional (remove "unless", "especially if", or specific factual caveats)

  20. [20]

    Concise and direct (simple, everyday language)

  21. [21]

    {claim}" ## Output Format: Return a JSON object with a single key

    Focused on the core legal principle being asserted (even if that principle is false). You must not change the meaning of the claim. If some details are necessary to preserve the meaning, keep them, even if that makes the claim lengthy. If the claim is already concise and general, you may return it as is or with minor improvements. ## Input Claim: "{claim}...

  22. [22]

    Base your judgment on your internal knowledge of the Supreme Court case

  23. [23]

    explanation

    If you are unsure or do not know the case, do not label as consistent. Claim: {claim} ## Output Format: Return a JSON object in the following format: ```json {{ "explanation": "...", "contradiction": "<consistent/inconsistent>", ... }} ``` Listing 14: Prompt for Baseline Predictions You are a legal expert. Your task is to analyze a legal claim and determi...

  24. [24]

    Do not invent cases or cite cases not in the list

    You must ONLY cite cases from the provided list of valid Supreme Court cases. Do not invent cases or cite cases not in the list

  25. [25]

    If you are unsure, provide your best estimate but prioritize accuracy

    Do not guess. If you are unsure, provide your best estimate but prioritize accuracy

  26. [26]

    explanation

    Output must be a valid JSON object. Valid Supreme Court Cases: {case_list} Claim: {claim} Respond with a JSON object in the following format: {{ "explanation": "Brief explanation of your reasoning.", "cases": ["Case Name 1", "Case Name 2", ...], "verdict": "Supported" or "Refuted" or "Overruled" }} 19