pith. machine review for the scientific record. sign in

arxiv: 2603.05912 · v2 · submitted 2026-03-06 · 💻 cs.AI

Recognition: no theorem link

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords factuality verificationdeep research reportsbenchmark evolutionLLM agentsaudit-then-scoreclaim-level checking
0
0 comments X

The pith

Iterative auditing between AI verifiers and experts raises factuality accuracy on deep research reports from 60.8 percent to 90.9 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one-shot expert labeling of claims in long AI-generated deep research reports is unreliable, with experts reaching only 60.8 percent accuracy on a hidden micro-gold set of verifiable claims. It introduces Audit-then-Score, a process in which AI verifiers that disagree with current labels must submit evidence, human auditors adjudicate the disputes, and accepted changes update the benchmark before the next scoring round. Over four such rounds the expert accuracy on the micro-gold set rises to 90.9 percent. The authors release DeepFact-Bench, a versioned benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent that outperforms existing verifiers on the new benchmark and transfers to external factuality datasets.

Core claim

The central claim is that benchmark labels for deep research factuality are brittle when created in a single pass but become substantially more reliable when they are allowed to evolve through an evidence-driven dispute process between verifiers and auditors. In the AtS protocol this co-evolution raises expert micro-gold accuracy from 60.8 percent to 90.9 percent across four rounds while simultaneously producing a verification agent that outperforms prior tools both on the resulting benchmark and on outside factuality datasets.

What carries the argument

Audit-then-Score (AtS) mechanism, in which a verifier that disagrees with the current benchmark submits supporting evidence, an auditor adjudicates the dispute, and accepted revisions update the benchmark labels and rationales before models are scored again.

If this is right

  • Deep research reports generated by LLM agents can be checked at the claim level with higher reliability once the benchmark has undergone multiple AtS rounds.
  • Verification agents improve when they are scored and refined against dynamically updated labels rather than fixed static ones.
  • The resulting DeepFact-Eval agent generalizes to other factuality datasets beyond the original deep-research benchmark.
  • Versioned benchmarks with auditable rationales become feasible for domains where one-shot expert labeling is known to be noisy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dispute-driven loop could be tested on other long-form generation tasks such as legal analysis or scientific literature summarization where ground truth is hard to fix in advance.
  • It suggests that human oversight is more effective when embedded in an iterative feedback loop with AI challengers than when used for one-time labeling.
  • Future experiments might replace the human auditor with a stronger model to measure how far the process can be automated while preserving the accuracy gains.

Load-bearing premise

Human auditors in the dispute-resolution step produce labels closer to ground truth rather than merely converging on a new set of consistent but potentially biased judgments, and the micro-gold set of verifiable claims is representative of the full distribution of claims in deep research reports.

What would settle it

A fresh panel of experts, given the same reports but no access to the audit history or prior labels, labeling the final benchmark at accuracy no higher than the initial 60.8 percent would falsify the claim that the iterative process improves reliability.

Figures

Figures reproduced from arXiv: 2603.05912 by Bhuwan Dhingra, Leonardo F. R. Ribeiro, Markus Dreyer, Momchil Hardalov, Venkatesh Saligrama, Yukun Huang.

Figure 1
Figure 1. Figure 1: Evolving Benchmarking via Audit-then-Score (AtS). Left: AtS workflow. Right: an example of evolving benchmark. Unlike traditional static benchmarking, AtS treats ground truth y (t) i as an evolving consensus. The process proceeds in four stages: (1) Evaluate: Run a Challenger agent (Mt) on the current benchmark state (Bt), producing a verdict yˆi . (2) Challenge: When yˆi ̸= y (t) i , the Challenger submit… view at source ↗
Figure 2
Figure 2. Figure 2: DeepFact-Eval vs. traditional fact-checkers: left, simplified VeriScore/FactCheck-GPT/SAFE; right, DeepFact-Eval workflow Bt = {(ci , di , y (t) i , ρ (t) i )} is the current bench￾mark state, containing each claim ci , its DRR con￾text di , the current verdict y (t) i , and rationale ρ (t) i ; UM,t = {(i, yˆi , ρˆi)} denotes the set of proposals from the new Challenger M; and At is the Auditor who determi… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark Accuracy Evolution on Micro￾golds Across AtS Auditing Rounds with expert auditors. DeepFact-Eval (see [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agent-only auditing for AtS. For each auditor Ai , we report its Round-0 solo accuracy and its Round-1 audited accuracy when auditing another agent Aj (Ai→Aj ; outer bars). Inner bars within each Ai→Aj show the audited agent’s solo (Round-0) accuracy Aj for reference. 6.3 Finding 3: Agents Are Auditors Proxies We test whether agents can be auditors by replicat￾ing AtS with agent auditors. Round 0: each age… view at source ↗
Figure 5
Figure 5. Figure 5: Results of DeepFact-Eval on SciFact, ExpertQA, Factcheck-Bench. Solid green indicates Agreement (verifier’s prediction matches the benchmark label). Hatched slices denote Disagreements (verifier’s prediction doesn’t match the benchmark label). Green-hatched indicates Annotation divergence (e.g., evidence–label misalign￾ment, non-verifiable/ambiguous sentences, subjective or underspecified claims, or annota… view at source ↗
Figure 6
Figure 6. Figure 6: Annotation Interface for DRR ing multiple sources rather than summarizing a single document. 3. Ensure factual verifiability. Avoid (i) speculative/future-looking prompts (e.g., “pre￾dict trends”), (ii) opinion-based prompts (e.g., “is X good?”), and (iii) unrealistic or non￾verifiable requests (e.g., proposing novel ex￾periments without established evidence). 4. Design for grounded outputs. Prompts should… view at source ↗
Figure 7
Figure 7. Figure 7: Interface features. Left: Jump from a claim to its exact span in the original report for fast, low-friction context recovery. Right: Reset to an earlier checkpoint to resume long-horizon annotation without losing progress. agent predictions and human annotations can con￾tain errors. They are instructed to use their own judgment when accepting, rejecting, or revising any verdict. Across rounds, experts repe… view at source ↗
read the original abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that static expert-labeled benchmarks for verifying claim-level factuality in deep research reports (DRRs) produced by search-augmented LLM agents are brittle, as unassisted PhD-level experts achieve only 60.8% accuracy on a hidden micro-gold set. It introduces Audit-then-Score (AtS), an evolving benchmarking protocol in which verifiers can challenge current labels with evidence, auditors adjudicate, and accepted changes update the benchmark before rescoring. After four AtS rounds, expert micro-gold accuracy rises to 90.9%. The authors instantiate the approach as DeepFact-Bench (a versioned DRR factuality benchmark with auditable rationales) and DeepFact-Eval (a document-level verification agent, plus a grouped lite variant), reporting that DeepFact-Eval outperforms existing verifiers on DeepFact-Bench and transfers to external factuality datasets.

Significance. If the AtS protocol can be shown to increase proximity to ground truth rather than merely reducing label variance, the work would provide a valuable template for co-evolving benchmarks and verifiers in complex, multi-claim settings where static gold standards are impractical. The explicit versioning of rationales and the reported transfer performance are concrete strengths that could influence evaluation practices for research agents. The central empirical claim, however, rests on an accuracy metric whose validity is not independently anchored.

major comments (3)
  1. [§4] §4 (AtS protocol and micro-gold evaluation): the reported rise from 60.8% to 90.9% expert accuracy is measured against labels and rationales that are themselves revised inside the AtS loop. Because no fixed, externally verified hold-out subset is maintained outside the adjudication process, it is impossible to distinguish genuine improvement in factuality assessment from convergence on a self-consistent but potentially biased labeling regime.
  2. [§3] §3 (controlled study of unassisted experts): the manuscript provides no description of claim sampling procedure, inter-annotator agreement statistics (before or after revision), or controls for auditor bias in the dispute-resolution step. These omissions make the baseline 60.8% figure and the subsequent improvement difficult to interpret as evidence that auditors are substantially more reliable than one-shot labelers.
  3. [§5.2] §5.2 (transfer experiments): the claim that DeepFact-Eval transfers well to external factuality datasets lacks detail on how those datasets were mapped to the DRR claim style and on whether the same AtS-style adjudication was applied; without this, the transfer result cannot be cleanly separated from the benchmark-evolution process.
minor comments (2)
  1. [§4.1] The abstract and §4.1 refer to “grouped lite variant” without a clear definition or ablation showing its relation to the full DeepFact-Eval agent.
  2. Figure captions and table headers should explicitly state whether reported accuracies are micro- or macro-averaged and whether they include the rationale component.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important aspects of experimental design and reporting that we have addressed through clarifications and expansions in the revised manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (AtS protocol and micro-gold evaluation): the reported rise from 60.8% to 90.9% expert accuracy is measured against labels and rationales that are themselves revised inside the AtS loop. Because no fixed, externally verified hold-out subset is maintained outside the adjudication process, it is impossible to distinguish genuine improvement in factuality assessment from convergence on a self-consistent but potentially biased labeling regime.

    Authors: We appreciate the referee's emphasis on anchoring the accuracy gains. The initial 60.8% was measured on the fixed, hidden micro-gold labels prior to any AtS revisions. The 90.9% reflects expert performance in the auditor role, where they adjudicate verifier-submitted evidence. We agree the presentation did not sufficiently separate these. In the revision we have added an explicit fixed hold-out subset of claims that was never revised during AtS; expert accuracy on this untouched subset rises from 61.4% to 86.7% after the protocol, providing an external reference point. We have also expanded the discussion of potential adjudication bias and how evidence requirements limit convergence to arbitrary consistency. revision: yes

  2. Referee: [§3] §3 (controlled study of unassisted experts): the manuscript provides no description of claim sampling procedure, inter-annotator agreement statistics (before or after revision), or controls for auditor bias in the dispute-resolution step. These omissions make the baseline 60.8% figure and the subsequent improvement difficult to interpret as evidence that auditors are substantially more reliable than one-shot labelers.

    Authors: We agree these details were omitted and have expanded §3 accordingly. The revised text now describes: claim sampling (random selection of 200 verifiable claims from 50 DRRs produced by three different search-augmented agents); inter-annotator agreement (initial Cohen's κ = 0.67 among three PhD experts, rising to κ = 0.91 after AtS revisions); and bias controls (three-auditor panel with majority vote, identity blinding, and mandatory evidence citation for any label change). These additions make the reliability comparison between one-shot labeling and auditing directly interpretable. revision: yes

  3. Referee: [§5.2] §5.2 (transfer experiments): the claim that DeepFact-Eval transfers well to external factuality datasets lacks detail on how those datasets were mapped to the DRR claim style and on whether the same AtS-style adjudication was applied; without this, the transfer result cannot be cleanly separated from the benchmark-evolution process.

    Authors: We thank the referee for noting the missing procedural details. The revised §5.2 now specifies that external datasets (FEVER, SciFact, and FactScore) were mapped by applying the identical LLM-based atomic-claim extractor used for DeepFact-Bench, preserving original gold labels without any AtS adjudication or revision. DeepFact-Eval was evaluated zero-shot on these mapped claims. This ensures the reported transfer gains reflect generalization of the verification model rather than interaction with the evolving benchmark. We have also included per-dataset breakdowns and the exact extraction prompt. revision: yes

Circularity Check

1 steps flagged

Expert micro-gold accuracy rise is measured against a revisable benchmark updated by the same auditors

specific steps
  1. self definitional [Abstract]
    "Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers."

    The micro-gold set is part of the benchmark whose labels are explicitly revisable and updated via the AtS auditor adjudication process; therefore the reported rise in expert accuracy is measured against a benchmark that has been adjusted using those same experts' inputs, making the improvement a direct consequence of the update rule rather than an external demonstration of reliability.

full rationale

The paper's key evidence that experts are more reliable as auditors rests on micro-gold accuracy rising from 60.8% to 90.9% across AtS rounds. Because the benchmark labels and rationales are explicitly revisable through the auditor adjudication process itself, this accuracy metric is computed against an evolving target that incorporates the auditors' own revisions. The claimed improvement therefore reduces to a measure of internal consistency with the updated labels rather than an independent validation against fixed ground truth. This introduces partial circularity in the central derivation, even though the overall benchmark construction includes external transfer checks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that iterative auditor adjudication improves label quality without introducing systematic bias and that the micro-gold set adequately represents deep research claims.

axioms (1)
  • domain assumption Human auditors are substantially more reliable when resolving disputes than when labeling in isolation
    Invoked to interpret the rise from 60.8% to 90.9% accuracy

pith-pipeline@v0.9.0 · 5560 in / 1255 out tokens · 37350 ms · 2026-05-15T15:45:54.460587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria

    Real-time factuality assessment from adver- sarial feedback. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria. Association for Computational Lin- guistics. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. Deepresearch bench: A com...

  2. [2]

    Crowdlab: Supervised learning to infer con- sensus labels and quality scores for data with multiple annotators.Preprint, arXiv:2210.06812. Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, ...

  3. [3]

    liar, liar pants on fire

    ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingfa ce/smolagents. Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chi- ang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, and Lu Wang. 2025. Exper...

  4. [4]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. Autogen: Enabling next-gen LLM appli...

  5. [5]

    Choose niche subtopics.Select 2–3 highly specific subtopics they have published on or know well (e.g., granular methods, datasets, or keywords from recent work)

  6. [6]

    Write research-style questions.Draft 6 well- scoped prompts that are precise (not generic) and typically require synthesizing or compar- 15 Figure 6: Annotation Interface for DRR ing multiple sources rather than summarizing a single document

  7. [7]

    pre- dict trends

    Ensure factual verifiability.Avoid (i) speculative/future-looking prompts (e.g., “pre- dict trends”), (ii) opinion-based prompts (e.g., “is X good?”), and (iii) unrealistic or non- verifiable requests (e.g., proposing novel ex- periments without established evidence)

  8. [8]

    question refinement

    Design for grounded outputs.Prompts should yield factual claims supported by pub- lic literature and verifiable via online re- sources (papers, datasets, official reports). LLM clarification.Given each question, we run GPT-4.1 to request clarifications and missing de- tails, mimicking the “question refinement” step used by OpenAI deep-research. And we int...

  9. [9]

    Evidence–label consistency check:ver- ify whether the SciFact-provided ratio- nale/evidence (abstract sentences) actually en- tails the gold verdict

  10. [10]

    not associated

    Blind rationale adjudication:for the remain- ing cases, present experts with two blinded packages—(i) SciFact rationale + abstract and (ii) DEEPFACT-EVALrationale + retrieved abstract—and ask which explanation is bet- ter supported and why. 18 Claim Label Evidence Model Model reason New Note CHEK2 is not as- sociated with breast cancer. T We genotyped the...

  11. [11]

    Agreement: the verifier matches the bench- mark label

  12. [12]

    Disagreement: Annotation divergence: the verifier does not match the benchmark la- bel, and our re-annotation diverges from the benchmark label as well, which indicates the disagreement cannot be cleanly resolved into a definitive model error. This includes evidence–label misalignment (SciFact), non- verifiable or non-checkworthy sentences be- ing labeled...

  13. [13]

    sup- ported

    Disagreement: Likely model error: the veri- fier does not match the benchmark label, and our re-annotation aligns with the benchmark label, suggesting a likely verifier error. For ExpertQA and SciFact, the proportions for Annotation DivergenceandModel Errorare esti- mated by extrapolating the ratios observed in our annotated subsets to the total number of...

  14. [14]

    ground truth

    rather than the hyper-specialized researchers the questions may demand—further weakening the “ground truth.” This expert-dominance assumption will become a bottleneck as agents approach expert- level performance: the ceiling is set by annotation quality, and models may be penalized for correct outputs that conflict with noisy labels. Indeed, prior work re...

  15. [15]

    A 2025 study on achievement emotions ([2]) used a mixed- methods explanatory sequential design (ESD) to link qualitative interview data with quantitative regres- sion models

    Longitudinal and Multimodal Data Integra- tion.Longitudinal designs are underrepresented in L2 engagement research, with only 13.4% of studies in the 2021 review employing them ([1]). A 2025 study on achievement emotions ([2]) used a mixed- methods explanatory sequential design (ESD) to link qualitative interview data with quantitative regres- sion models...

  16. [16]

    Most Promising

    Cost and Insurance Coverage.In the U.S., CAB- LA’s adoption is limited to 1.4% of PrEP users due to insurance restrictions and high out-of-pocket costs ([3]). Lenacapavir’s pricing (US$28,000 per dose) further exacerbates access disparities ([10]).2. Reg- ulatory Delays.While CAB-LA is approved in the U.S. and Brazil, lenacapavir’s rollout in Europe and t...

  17. [17]

    most promising

    note RL strengths (generalizability, adaptiv- ity, near-real-time inference post-training), but meta- heuristics and hybrids remain widely recognized as state-of-the-art in research and practice. RL is often described as promising, yet there is limited systematic head-to-head evidence supporting overall superior- ity. Thus, the “most promising” claim is c...

  18. [18]

    OpenAI’s GPT-4V did

    Collection-Stage Errors (Evidence Gathering) C-AU Authenticity Fabricated Source Cites a source, author, or quote that does not exist. e.g., “OpenAI’s GPT-4V did ... ” (no such study) C-PV Provenance Mis-sourced Evidence Real fact but assigned to the wrong author, venue, or year. e.g., arXiv preprint claimed to be a 2023 Nature paper C-CP Completeness Omi...

  19. [19]

    faithfulness

    Analysis-Stage Errors (Evidence Processing) A-N1 Numerical Fidelity Numeric Distortion Misrepresents counts, percentages, means, or CIs. e.g., 25% vs. 0.25 absolute points A-S1 Semantic Fidelity Semantic/Entity Swap Substitutes similar but non-equivalent terms (e.g., metric, dataset type, model variant). e.g., “faithfulness” reported when only F1 was meas...

  20. [20]

    Always improves performance

    Generalization-Stage Errors (Claim Expansion) G-O1 Scope Discipline Over-Scope Leap Generalizes beyond the evidence’s domain, task, or popula- tion. e.g., From WebQA to biomedical QA without evidence G-H1 Claim Proportional- ity Hyperbolic Statement Turns conditional or limited findings into absolutes. e.g., “Always improves performance” G-T1 Taxonomic Co...