Recognition: no theorem link
DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Pith reviewed 2026-05-15 15:45 UTC · model grok-4.3
The pith
Iterative auditing between AI verifiers and experts raises factuality accuracy on deep research reports from 60.8 percent to 90.9 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that benchmark labels for deep research factuality are brittle when created in a single pass but become substantially more reliable when they are allowed to evolve through an evidence-driven dispute process between verifiers and auditors. In the AtS protocol this co-evolution raises expert micro-gold accuracy from 60.8 percent to 90.9 percent across four rounds while simultaneously producing a verification agent that outperforms prior tools both on the resulting benchmark and on outside factuality datasets.
What carries the argument
Audit-then-Score (AtS) mechanism, in which a verifier that disagrees with the current benchmark submits supporting evidence, an auditor adjudicates the dispute, and accepted revisions update the benchmark labels and rationales before models are scored again.
If this is right
- Deep research reports generated by LLM agents can be checked at the claim level with higher reliability once the benchmark has undergone multiple AtS rounds.
- Verification agents improve when they are scored and refined against dynamically updated labels rather than fixed static ones.
- The resulting DeepFact-Eval agent generalizes to other factuality datasets beyond the original deep-research benchmark.
- Versioned benchmarks with auditable rationales become feasible for domains where one-shot expert labeling is known to be noisy.
Where Pith is reading between the lines
- The same dispute-driven loop could be tested on other long-form generation tasks such as legal analysis or scientific literature summarization where ground truth is hard to fix in advance.
- It suggests that human oversight is more effective when embedded in an iterative feedback loop with AI challengers than when used for one-time labeling.
- Future experiments might replace the human auditor with a stronger model to measure how far the process can be automated while preserving the accuracy gains.
Load-bearing premise
Human auditors in the dispute-resolution step produce labels closer to ground truth rather than merely converging on a new set of consistent but potentially biased judgments, and the micro-gold set of verifiable claims is representative of the full distribution of claims in deep research reports.
What would settle it
A fresh panel of experts, given the same reports but no access to the audit history or prior labels, labeling the final benchmark at accuracy no higher than the initial 60.8 percent would falsify the claim that the iterative process improves reliability.
Figures
read the original abstract
Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that static expert-labeled benchmarks for verifying claim-level factuality in deep research reports (DRRs) produced by search-augmented LLM agents are brittle, as unassisted PhD-level experts achieve only 60.8% accuracy on a hidden micro-gold set. It introduces Audit-then-Score (AtS), an evolving benchmarking protocol in which verifiers can challenge current labels with evidence, auditors adjudicate, and accepted changes update the benchmark before rescoring. After four AtS rounds, expert micro-gold accuracy rises to 90.9%. The authors instantiate the approach as DeepFact-Bench (a versioned DRR factuality benchmark with auditable rationales) and DeepFact-Eval (a document-level verification agent, plus a grouped lite variant), reporting that DeepFact-Eval outperforms existing verifiers on DeepFact-Bench and transfers to external factuality datasets.
Significance. If the AtS protocol can be shown to increase proximity to ground truth rather than merely reducing label variance, the work would provide a valuable template for co-evolving benchmarks and verifiers in complex, multi-claim settings where static gold standards are impractical. The explicit versioning of rationales and the reported transfer performance are concrete strengths that could influence evaluation practices for research agents. The central empirical claim, however, rests on an accuracy metric whose validity is not independently anchored.
major comments (3)
- [§4] §4 (AtS protocol and micro-gold evaluation): the reported rise from 60.8% to 90.9% expert accuracy is measured against labels and rationales that are themselves revised inside the AtS loop. Because no fixed, externally verified hold-out subset is maintained outside the adjudication process, it is impossible to distinguish genuine improvement in factuality assessment from convergence on a self-consistent but potentially biased labeling regime.
- [§3] §3 (controlled study of unassisted experts): the manuscript provides no description of claim sampling procedure, inter-annotator agreement statistics (before or after revision), or controls for auditor bias in the dispute-resolution step. These omissions make the baseline 60.8% figure and the subsequent improvement difficult to interpret as evidence that auditors are substantially more reliable than one-shot labelers.
- [§5.2] §5.2 (transfer experiments): the claim that DeepFact-Eval transfers well to external factuality datasets lacks detail on how those datasets were mapped to the DRR claim style and on whether the same AtS-style adjudication was applied; without this, the transfer result cannot be cleanly separated from the benchmark-evolution process.
minor comments (2)
- [§4.1] The abstract and §4.1 refer to “grouped lite variant” without a clear definition or ablation showing its relation to the full DeepFact-Eval agent.
- Figure captions and table headers should explicitly state whether reported accuracies are micro- or macro-averaged and whether they include the rationale component.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important aspects of experimental design and reporting that we have addressed through clarifications and expansions in the revised manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (AtS protocol and micro-gold evaluation): the reported rise from 60.8% to 90.9% expert accuracy is measured against labels and rationales that are themselves revised inside the AtS loop. Because no fixed, externally verified hold-out subset is maintained outside the adjudication process, it is impossible to distinguish genuine improvement in factuality assessment from convergence on a self-consistent but potentially biased labeling regime.
Authors: We appreciate the referee's emphasis on anchoring the accuracy gains. The initial 60.8% was measured on the fixed, hidden micro-gold labels prior to any AtS revisions. The 90.9% reflects expert performance in the auditor role, where they adjudicate verifier-submitted evidence. We agree the presentation did not sufficiently separate these. In the revision we have added an explicit fixed hold-out subset of claims that was never revised during AtS; expert accuracy on this untouched subset rises from 61.4% to 86.7% after the protocol, providing an external reference point. We have also expanded the discussion of potential adjudication bias and how evidence requirements limit convergence to arbitrary consistency. revision: yes
-
Referee: [§3] §3 (controlled study of unassisted experts): the manuscript provides no description of claim sampling procedure, inter-annotator agreement statistics (before or after revision), or controls for auditor bias in the dispute-resolution step. These omissions make the baseline 60.8% figure and the subsequent improvement difficult to interpret as evidence that auditors are substantially more reliable than one-shot labelers.
Authors: We agree these details were omitted and have expanded §3 accordingly. The revised text now describes: claim sampling (random selection of 200 verifiable claims from 50 DRRs produced by three different search-augmented agents); inter-annotator agreement (initial Cohen's κ = 0.67 among three PhD experts, rising to κ = 0.91 after AtS revisions); and bias controls (three-auditor panel with majority vote, identity blinding, and mandatory evidence citation for any label change). These additions make the reliability comparison between one-shot labeling and auditing directly interpretable. revision: yes
-
Referee: [§5.2] §5.2 (transfer experiments): the claim that DeepFact-Eval transfers well to external factuality datasets lacks detail on how those datasets were mapped to the DRR claim style and on whether the same AtS-style adjudication was applied; without this, the transfer result cannot be cleanly separated from the benchmark-evolution process.
Authors: We thank the referee for noting the missing procedural details. The revised §5.2 now specifies that external datasets (FEVER, SciFact, and FactScore) were mapped by applying the identical LLM-based atomic-claim extractor used for DeepFact-Bench, preserving original gold labels without any AtS adjudication or revision. DeepFact-Eval was evaluated zero-shot on these mapped claims. This ensures the reported transfer gains reflect generalization of the verification model rather than interaction with the evolving benchmark. We have also included per-dataset breakdowns and the exact extraction prompt. revision: yes
Circularity Check
Expert micro-gold accuracy rise is measured against a revisable benchmark updated by the same auditors
specific steps
-
self definitional
[Abstract]
"Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers."
The micro-gold set is part of the benchmark whose labels are explicitly revisable and updated via the AtS auditor adjudication process; therefore the reported rise in expert accuracy is measured against a benchmark that has been adjusted using those same experts' inputs, making the improvement a direct consequence of the update rule rather than an external demonstration of reliability.
full rationale
The paper's key evidence that experts are more reliable as auditors rests on micro-gold accuracy rising from 60.8% to 90.9% across AtS rounds. Because the benchmark labels and rationales are explicitly revisable through the auditor adjudication process itself, this accuracy metric is computed against an evolving target that incorporates the auditors' own revisions. The claimed improvement therefore reduces to a measure of internal consistency with the updated labels rather than an independent validation against fixed ground truth. This introduces partial circularity in the central derivation, even though the overall benchmark construction includes external transfer checks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human auditors are substantially more reliable when resolving disputes than when labeling in isolation
Reference graph
Works this paper leans on
-
[1]
Real-time factuality assessment from adver- sarial feedback. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1610–1630, Vienna, Austria. Association for Computational Lin- guistics. Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. 2025. Deepresearch bench: A com...
-
[2]
Crowdlab: Supervised learning to infer con- sensus labels and quality scores for data with multiple annotators.Preprint, arXiv:2210.06812. Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, TIANSHU ZHANG, Jian Xie, ...
-
[3]
‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingfa ce/smolagents. Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chi- ang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, and Lu Wang. 2025. Exper...
-
[4]
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems
Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. Autogen: Enabling next-gen LLM appli...
-
[5]
Choose niche subtopics.Select 2–3 highly specific subtopics they have published on or know well (e.g., granular methods, datasets, or keywords from recent work)
-
[6]
Write research-style questions.Draft 6 well- scoped prompts that are precise (not generic) and typically require synthesizing or compar- 15 Figure 6: Annotation Interface for DRR ing multiple sources rather than summarizing a single document
-
[7]
Ensure factual verifiability.Avoid (i) speculative/future-looking prompts (e.g., “pre- dict trends”), (ii) opinion-based prompts (e.g., “is X good?”), and (iii) unrealistic or non- verifiable requests (e.g., proposing novel ex- periments without established evidence)
-
[8]
Design for grounded outputs.Prompts should yield factual claims supported by pub- lic literature and verifiable via online re- sources (papers, datasets, official reports). LLM clarification.Given each question, we run GPT-4.1 to request clarifications and missing de- tails, mimicking the “question refinement” step used by OpenAI deep-research. And we int...
work page 2025
-
[9]
Evidence–label consistency check:ver- ify whether the SciFact-provided ratio- nale/evidence (abstract sentences) actually en- tails the gold verdict
-
[10]
Blind rationale adjudication:for the remain- ing cases, present experts with two blinded packages—(i) SciFact rationale + abstract and (ii) DEEPFACT-EVALrationale + retrieved abstract—and ask which explanation is bet- ter supported and why. 18 Claim Label Evidence Model Model reason New Note CHEK2 is not as- sociated with breast cancer. T We genotyped the...
work page 2011
-
[11]
Agreement: the verifier matches the bench- mark label
-
[12]
Disagreement: Annotation divergence: the verifier does not match the benchmark la- bel, and our re-annotation diverges from the benchmark label as well, which indicates the disagreement cannot be cleanly resolved into a definitive model error. This includes evidence–label misalignment (SciFact), non- verifiable or non-checkworthy sentences be- ing labeled...
-
[13]
Disagreement: Likely model error: the veri- fier does not match the benchmark label, and our re-annotation aligns with the benchmark label, suggesting a likely verifier error. For ExpertQA and SciFact, the proportions for Annotation DivergenceandModel Errorare esti- mated by extrapolating the ratios observed in our annotated subsets to the total number of...
work page 1984
-
[14]
rather than the hyper-specialized researchers the questions may demand—further weakening the “ground truth.” This expert-dominance assumption will become a bottleneck as agents approach expert- level performance: the ceiling is set by annotation quality, and models may be penalized for correct outputs that conflict with noisy labels. Indeed, prior work re...
work page doi:10.3389/f 2025
-
[15]
Longitudinal and Multimodal Data Integra- tion.Longitudinal designs are underrepresented in L2 engagement research, with only 13.4% of studies in the 2021 review employing them ([1]). A 2025 study on achievement emotions ([2]) used a mixed- methods explanatory sequential design (ESD) to link qualitative interview data with quantitative regres- sion models...
work page 2021
-
[16]
Cost and Insurance Coverage.In the U.S., CAB- LA’s adoption is limited to 1.4% of PrEP users due to insurance restrictions and high out-of-pocket costs ([3]). Lenacapavir’s pricing (US$28,000 per dose) further exacerbates access disparities ([10]).2. Reg- ulatory Delays.While CAB-LA is approved in the U.S. and Brazil, lenacapavir’s rollout in Europe and t...
work page 2025
-
[17]
note RL strengths (generalizability, adaptiv- ity, near-real-time inference post-training), but meta- heuristics and hybrids remain widely recognized as state-of-the-art in research and practice. RL is often described as promising, yet there is limited systematic head-to-head evidence supporting overall superior- ity. Thus, the “most promising” claim is c...
work page 2021
-
[18]
Collection-Stage Errors (Evidence Gathering) C-AU Authenticity Fabricated Source Cites a source, author, or quote that does not exist. e.g., “OpenAI’s GPT-4V did ... ” (no such study) C-PV Provenance Mis-sourced Evidence Real fact but assigned to the wrong author, venue, or year. e.g., arXiv preprint claimed to be a 2023 Nature paper C-CP Completeness Omi...
work page 2023
-
[19]
Analysis-Stage Errors (Evidence Processing) A-N1 Numerical Fidelity Numeric Distortion Misrepresents counts, percentages, means, or CIs. e.g., 25% vs. 0.25 absolute points A-S1 Semantic Fidelity Semantic/Entity Swap Substitutes similar but non-equivalent terms (e.g., metric, dataset type, model variant). e.g., “faithfulness” reported when only F1 was meas...
work page 2018
-
[20]
Generalization-Stage Errors (Claim Expansion) G-O1 Scope Discipline Over-Scope Leap Generalizes beyond the evidence’s domain, task, or popula- tion. e.g., From WebQA to biomedical QA without evidence G-H1 Claim Proportional- ity Hyperbolic Statement Turns conditional or limited findings into absolutes. e.g., “Always improves performance” G-T1 Taxonomic Co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.