pith. machine review for the scientific record. sign in

arxiv: 2601.14698 · v2 · submitted 2026-01-21 · 💻 cs.CL

Recognition: no theorem link

ClaimDB: A Fact Verification Benchmark over Large Structured Data

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords fact verificationbenchmarkstructured datalarge language modelsabstentionclaim verificationdatabase reasoningLLM evaluation
0
0 comments X

The pith

ClaimDB shows that current LLMs cannot reliably verify claims over large structured databases with millions of records.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ClaimDB is a benchmark built from 80 real-world databases spanning governance, healthcare, media, education, and sciences, where each claim requires composing evidence across multiple tables holding millions of records. At this scale, models cannot simply read evidence and must instead generate executable programs to query and combine data. Tests on 30 proprietary and open-source LLMs find that more than half fall below 55 percent accuracy while nearly all fail to abstain when evidence is absent. The results indicate that existing models remain unsuitable for high-stakes tasks that depend on accurate reasoning over complex structured data.

Core claim

ClaimDB consists of 80 unique real-life databases with claims whose supporting evidence is generated through compositions of millions of records drawn from multiple tables. When 30 state-of-the-art LLMs are evaluated on the benchmark, more than half score below 55 percent accuracy. Both closed-source and open-source models under 70B parameters also fail to abstain reliably when no evidence exists to decide a claim.

What carries the argument

The ClaimDB benchmark, which generates verifiable claims from compositions over large multi-table databases and requires executable program reasoning instead of direct evidence reading.

If this is right

  • Verification at this scale requires shifting from evidence reading to executable program generation.
  • More than half of tested LLMs fall below 55 percent accuracy on the benchmark.
  • Both closed-source and open-source models struggle to recognize when evidence is insufficient.
  • Reliability concerns limit use of current LLMs in high-stakes structured-data analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Direct integration between LLMs and database query engines may reduce the reasoning failures seen on ClaimDB.
  • The same abstention weaknesses are likely to appear in other tasks that aggregate information across large tables.
  • Training objectives focused on abstention could lower overconfident errors in production data-analysis settings.
  • Extending the benchmark with time-varying data would test whether models can handle evolving records.

Load-bearing premise

The 80 chosen databases and the generated claims and evidence pairs are representative of real-world fact-verification demands over large structured data, and the accuracy and abstention metrics measure model capability without hidden biases from claim construction.

What would settle it

A single LLM reaching above 70 percent accuracy on ClaimDB while correctly abstaining on at least 80 percent of claims that have no supporting evidence would show the reported limitations can be overcome at current model scales.

read the original abstract

Real-world fact-checking often involves verifying claims grounded in structured data at scale. Despite substantial progress in fact-verification benchmarks, this setting remains largely underexplored. In this work, we introduce ClaimDB, a fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on "reading" the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that more than half score below 55% accuracy. Our analysis also reveals that both closed- and open-source models struggle with abstention -- the ability to admit that there is no evidence to decide -- raising doubts about their reliability in high-stakes data analysis tasks. We release the benchmark, code, and the LLM leaderboard at https://claimdb.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ClaimDB, a fact-verification benchmark consisting of 80 real-life databases spanning domains such as governance, healthcare, media, education, and natural sciences. Claims are constructed so that evidence requires compositions over millions of records and multiple tables, rendering direct reading approaches infeasible and necessitating executable program-based reasoning. Experiments evaluate 30 state-of-the-art proprietary and open-source LLMs (under 70B parameters), reporting that more than half achieve below 55% accuracy while both closed- and open-source models struggle with abstention (admitting insufficient evidence). The benchmark, code, and leaderboard are released publicly.

Significance. If the benchmark construction and evaluation protocol are sound, the work is significant for exposing limitations of current LLMs in scalable, executable reasoning over structured data at real-world sizes. It fills a gap in fact-verification benchmarks by emphasizing program synthesis over retrieval, supplies a reusable resource with broad domain coverage, and supplies empirical evidence that could motivate targeted improvements in data-analysis reliability for high-stakes applications.

major comments (3)
  1. [Dataset construction] Dataset construction section: the claim-generation process is described only at a high level with no enumeration of templates, rules, join cardinalities, aggregation functions, or negation structures used. Without these details it is impossible to assess whether low accuracies and abstention failures arise from inherent model limitations or from construction artifacts that over-represent particular query patterns.
  2. [Experiments] Experimental results section: the reported accuracies for the 30 LLMs lack error bars, statistical significance tests, confidence intervals, or an explicit evaluation protocol (e.g., exact prompting format, abstention threshold, and handling of executable program outputs). This absence renders the central claim that “more than half score below 55% accuracy” unverifiable from the given information.
  3. [Results and Analysis] §4 (or equivalent results analysis): the representativeness argument for the 80 databases and generated claims is asserted without quantitative comparison to real-world fact-verification workloads (e.g., distribution of join depths, aggregation types, or evidence sizes). This weakens the inference that observed failures generalize to high-stakes data-analysis tasks.
minor comments (2)
  1. [Abstract and Experiments] The abstract states “open-source (below 70B)” but the main text should list exact model sizes and families for reproducibility.
  2. [Figures and Tables] Figure captions and table headers should explicitly define the abstention metric and the executable-program success criterion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the claim-generation process is described only at a high level with no enumeration of templates, rules, join cardinalities, aggregation functions, or negation structures used. Without these details it is impossible to assess whether low accuracies and abstention failures arise from inherent model limitations or from construction artifacts that over-represent particular query patterns.

    Authors: We agree that the claim-generation process requires more granular details for proper assessment. In the revised manuscript, we will add a dedicated appendix that enumerates the templates, rules, join cardinalities, aggregation functions, and negation structures used. This will demonstrate the diversity of patterns covered and confirm that observed model limitations stem from inherent challenges rather than construction artifacts. revision: yes

  2. Referee: [Experiments] Experimental results section: the reported accuracies for the 30 LLMs lack error bars, statistical significance tests, confidence intervals, or an explicit evaluation protocol (e.g., exact prompting format, abstention threshold, and handling of executable program outputs). This absence renders the central claim that “more than half score below 55% accuracy” unverifiable from the given information.

    Authors: We acknowledge the importance of statistical rigor and explicit protocols. We will revise the experiments section to include error bars and confidence intervals for the reported accuracies. We will also provide a detailed description of the evaluation protocol, including the exact prompting format, abstention handling (e.g., thresholds for insufficient evidence), and processing of executable program outputs. This will ensure the results are fully verifiable. revision: yes

  3. Referee: [Results and Analysis] §4 (or equivalent results analysis): the representativeness argument for the 80 databases and generated claims is asserted without quantitative comparison to real-world fact-verification workloads (e.g., distribution of join depths, aggregation types, or evidence sizes). This weakens the inference that observed failures generalize to high-stakes data-analysis tasks.

    Authors: We will strengthen the analysis by adding quantitative statistics on the distributions of join depths, aggregation types, and evidence sizes across the 80 databases. This will provide concrete support for the benchmark's coverage. Direct comparisons to external real-world workload distributions are not possible without access to proprietary data, but the real-life origin and domain diversity of the databases offer a robust basis for generalizability to high-stakes tasks. revision: partial

Circularity Check

0 steps flagged

No significant circularity in benchmark introduction and empirical evaluation

full rationale

The paper introduces ClaimDB, a new fact-verification benchmark built from 80 real-life databases and claims derived from compositions of millions of records across multiple tables. It reports standard empirical metrics (accuracy and abstention) on 30 existing LLMs without any derivations, equations, fitted parameters, or predictions that reduce to inputs defined by the authors. No self-citations are load-bearing for a central theoretical claim, and the benchmark is released externally for independent use. The work is self-contained as an empirical contribution; representativeness concerns fall under correctness risk rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen databases and claims capture the essential difficulties of real-world structured-data verification; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 80 databases and associated claims are representative of real-world fact verification over large structured data.
    This assumption underpins the claim that poor LLM performance on ClaimDB indicates broader reliability issues in high-stakes data analysis.

pith-pipeline@v0.9.0 · 5500 in / 1293 out tokens · 34486 ms · 2026-05-16T12:58:35.083493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.