Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

Cheng Huang; Jingxi Qiu; Zeyu Han

arxiv: 2605.26663 · v1 · pith:ML2ZV5LDnew · submitted 2026-05-26 · 💻 cs.CL · cs.IR· cs.SE

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

Jingxi Qiu , Zeyu Han , Cheng Huang This is my paper

Pith reviewed 2026-06-29 18:17 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.SE

keywords fact verificationNEI labelinsufficient evidenceconstruction artifactsshortcut learningmodel evaluationSciFacttransfer

0 comments

The pith

NEI competence does not transfer reliably across different evidence construction methods in fact verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fact verification benchmarks create the Not Enough Information label through multiple construction families that produce observationally similar but semantically distinct evidence conditions. The paper introduces NEI-CAP to tag each example with its construction family, audit shortcut cues, validate hard cases by human review, and measure whether detection skill transfers across families. Models trained on shortcut-prone constructions fail to recognize insufficient evidence in related cases built differently, while training on mixed constructions narrows the gap without closing it. The construction choice also changes model in the Support or Refute label, so overall NEI scores can conceal which specific problem has been solved.

Core claim

NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.

What carries the argument

NEI-CAP, a construction-aware diagnostic protocol that tags each NEI example with its construction family, audits shortcut cues, validates hard cases through human adjudication, and tests cross-family transfer.

If this is right

Models trained on shortcut-prone NEI constructions fail to recognize insufficient evidence in semantically related cases from other construction families.
Mixed-construction training narrows but does not close the performance gap across families.
The evidence condition used for NEI examples also shifts model confidence on the reference Support and Refute labels.
Aggregate NEI scores can mask whether a model has learned general insufficiency detection or construction-specific patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Different construction families may be probing separate underlying competencies rather than a single insufficiency-detection skill.
Evaluation protocols should include routine cross-construction tests to prevent overestimation of robustness in fact verification systems.
The same construction-tracking approach could expose similar artifacts in other NLP tasks that rely on constructed neutral or negative labels.
Human validation of hard cases assumes adjudicators do not introduce systematic biases beyond those already present in the benchmark data.

Load-bearing premise

The different NEI construction families produce meaningfully distinct evidence conditions that test distinct model competencies, and human adjudication reliably validates hard cases without introducing new biases.

What would settle it

A model trained exclusively on one NEI construction family that then achieves cross-family NEI accuracy comparable to mixed-training baselines on held-out families would falsify the non-transfer result.

Figures

Figures reproduced from arXiv: 2605.26663 by Cheng Huang, Jingxi Qiu, Zeyu Han.

**Figure 2.** Figure 2: shows the SciFact train/test construction matrix. A placeholder-trained verifier obtains perfect matched-placeholder NEI-F1, but falls to 0.000 on BM25 near-miss and cited nonrationale NEI. Position-biased training shows the same hard-construction collapse, while randomirrelevant training is less extreme but still shortcutprone: it is nearly solved under matched evaluation yet transfers poorly to BM2… view at source ↗

read the original abstract

Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that NEI construction methods create non-transferable artifacts in fact verification, but the abstract leaves the size of the gaps and the strength of the human validation unclear.

read the letter

Dear Colleague,

The main point here is that NEI examples in fact verification are not interchangeable. Different construction methods produce different shortcuts, and models trained on one do not reliably handle others.

The authors introduce NEI-CAP, which tags each NEI instance with its construction family, audits for shortcuts, applies human adjudication to hard cases, and tests cross-family transfer. They apply this to SciFact-style scientific verification, treating FEVER and HoVer as external controls. The reported finding is that competence fails to transfer, mixed-construction training narrows the gap without closing it, and the evidence condition also shifts confidence on support and refute labels.

What the paper does well is make explicit that an aggregate NEI score can hide which specific problem a model has solved. The protocol itself is a concrete, reusable diagnostic that could be applied to other benchmarks.

The soft spots are in the execution details. The abstract describes the experiments and human adjudication but supplies no quantitative results, effect sizes, or exclusion criteria, so the magnitude of the transfer failures and the reliability of the adjudication remain hard to assess from the given text. The claim that the construction families test meaningfully distinct competencies also needs the full results and definitions to be convincing.

This is for people who build or evaluate fact verification systems, especially in scientific or high-stakes settings. A reader interested in dataset artifacts and evaluation practices would find the framing useful.

It deserves peer review. The diagnostic idea addresses a real issue in current benchmarks, and referees can push for the missing numbers and controls.

Referee Report

2 major / 2 minor

Summary. The paper claims that NEI labels in fact verification are constructed via distinct families that introduce observable artifacts, proposes the NEI-CAP diagnostic protocol (with construction tagging, shortcut auditing, human adjudication of hard cases, and cross-construction transfer tests), and reports that models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence even under mixed training; fixed-claim diagnostics further show that evidence condition affects Support/Refute confidence, so aggregate NEI scores can mask the actual problem solved. Experiments are instantiated on SciFact-style scientific verification with FEVER and HoVer as controls.

Significance. If the empirical transfer gaps and human-adjudication results hold, the work is significant for benchmark design in fact verification and NLI-style tasks: it supplies a reusable protocol for isolating construction artifacts, demonstrates that mixed training is insufficient to close gaps, and shows how evidence conditions can shift non-NEI predictions. The explicit use of external controls and human validation of hard cases strengthens the diagnostic framing and could improve robustness evaluation practices.

major comments (2)

[§4] §4 (Experiments): the central transfer-gap claim requires explicit reporting of per-construction example counts, class balance, and statistical significance tests (e.g., McNemar or bootstrap) on the observed gaps; without these, it is unclear whether the reported failure of mixed training to close the gap is driven by construction differences or by unequal data volumes across families.
[§3.2] §3.2 (Human adjudication): the protocol description must specify exclusion criteria, inter-annotator agreement (e.g., Fleiss' κ), and how adjudicators were instructed to distinguish 'semantically related insufficient evidence' from other NEI subtypes; these details are load-bearing for validating that the hard cases truly test distinct competencies.

minor comments (2)

[§3] Notation for construction families is introduced in §3 but used inconsistently in later tables; a single glossary or legend would improve readability.
[Abstract / §4] The abstract states results across 'these settings' but the main text should include a summary table of all quantitative transfer metrics (accuracy deltas, NEI F1) with confidence intervals to allow direct comparison with the qualitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§4] §4 (Experiments): the central transfer-gap claim requires explicit reporting of per-construction example counts, class balance, and statistical significance tests (e.g., McNemar or bootstrap) on the observed gaps; without these, it is unclear whether the reported failure of mixed training to close the gap is driven by construction differences or by unequal data volumes across families.

Authors: We agree that explicit reporting of per-construction example counts, class balance, and statistical significance is necessary to support the transfer-gap claims. In the revised manuscript, we will add a dedicated subsection or table in §4 reporting the exact number of examples per construction family, their label distributions, and the results of bootstrap or McNemar tests on the performance differences. This will confirm that the gaps are attributable to construction artifacts rather than data volume imbalances. revision: yes
Referee: [§3.2] §3.2 (Human adjudication): the protocol description must specify exclusion criteria, inter-annotator agreement (e.g., Fleiss' κ), and how adjudicators were instructed to distinguish 'semantically related insufficient evidence' from other NEI subtypes; these details are load-bearing for validating that the hard cases truly test distinct competencies.

Authors: We agree that these details are important for validating the human adjudication protocol. In the revision, we will expand §3.2 to explicitly state the exclusion criteria used, report inter-annotator agreement via Fleiss' κ, and include the full adjudication guidelines provided to annotators regarding the distinction between 'semantically related insufficient evidence' and other NEI subtypes. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an empirical diagnostic protocol (NEI-CAP) for evaluating NEI construction artifacts in fact verification. It relies on dataset comparisons, human adjudication, and transfer tests across construction families rather than any mathematical derivations, fitted parameters, or self-referential definitions. No load-bearing steps reduce to inputs by construction, and external controls (FEVER, HoVer) plus human validation provide independent grounding. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that construction families are distinct and that human validation provides an unbiased ground truth for hard cases; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human adjudication provides reliable validation for hard NEI cases
Invoked in the description of the NEI-CAP protocol.

pith-pipeline@v0.9.1-grok · 5719 in / 1113 out tokens · 36633 ms · 2026-06-29T18:17:04.809989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R

Missing counter-evidence renders NLP fact- checking unrealistic for misinformation.arXiv preprint arXiv:2210.13865. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural lan- guage inference data. InProceedings of the 2018 Conference of the North American Chapter of the A...

work page arXiv 2018
[2]

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing.arXiv preprint arXiv:2111.09543. Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kuklian- sky, Gaurav Singh Toma...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal

The FACTS grounding leaderboard: Bench- marking LLMs’ ability to ground responses to long- form input.arXiv preprint arXiv:2501.03200. Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. 2020. HoVer: A dataset for many-hop fact extraction and 8 claim verification. InFindings of the Association for Computational Lingu...

work page arXiv 2020
[4]

Adversarial attacks against Fact Extraction and VERification

Association for Computational Linguistics. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and be- yond.F oundations and Trends in Information Re- trieval, 3:333–389. Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin C! robust fact verification with contrastive evidence. InProceedings of the 2021 ...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[5]

Step-by-step fact verification system for med- ical claims with explainable reasoning. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (V olume 2: Short Papers), pages 805–816, Albu- querque, New Mexico. Association for Computa- tional Linguistics...

2025
[6]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi

HealthFC: Verifying health claims with evidence-based medical fact-checking.arXiv preprint arXiv:2309.08503. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Proc...

work page arXiv 2020

[1] [1]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R

Missing counter-evidence renders NLP fact- checking unrealistic for misinformation.arXiv preprint arXiv:2210.13865. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural lan- guage inference data. InProceedings of the 2018 Conference of the North American Chapter of the A...

work page arXiv 2018

[2] [2]

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing.arXiv preprint arXiv:2111.09543. Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kuklian- sky, Gaurav Singh Toma...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal

The FACTS grounding leaderboard: Bench- marking LLMs’ ability to ground responses to long- form input.arXiv preprint arXiv:2501.03200. Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. 2020. HoVer: A dataset for many-hop fact extraction and 8 claim verification. InFindings of the Association for Computational Lingu...

work page arXiv 2020

[4] [4]

Adversarial attacks against Fact Extraction and VERification

Association for Computational Linguistics. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and be- yond.F oundations and Trends in Information Re- trieval, 3:333–389. Tal Schuster, Adam Fisch, and Regina Barzilay. 2021. Get your vitamin C! robust fact verification with contrastive evidence. InProceedings of the 2021 ...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[5] [5]

Step-by-step fact verification system for med- ical claims with explainable reasoning. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (V olume 2: Short Papers), pages 805–816, Albu- querque, New Mexico. Association for Computa- tional Linguistics...

2025

[6] [6]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi

HealthFC: Verifying health claims with evidence-based medical fact-checking.arXiv preprint arXiv:2309.08503. David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or fiction: Verifying scientific claims. InProceedings of the 2020 Con- ference on Empirical Methods in Natural Language Proc...

work page arXiv 2020