hub Canonical reference

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Arpit Mittal · 2018 · cs.CL · DOI 10.18653/v1/n18-1074 · arXiv 1803.05355

Canonical reference. 71% of citing Pith papers cite this work as background.

55 Pith papers citing it

514 external citations · Pith

Background 71% of classified citations

open full Pith review browse 55 citing papers arXiv PDF

abstract

In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss $\kappa$. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 6 dataset 1

citation-polarity summary

background 5 unclear 1 use dataset 1

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

An Empirical Analysis of Factual Errors in Human-Written Text and its Application

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

cs.MA · 2026-06-25 · unverdicted · novelty 7.0

Models delayed verification in multi-agent LLMs as graph consensus, derives stability thresholds (inverse golden ratio for delay two) via grounded Laplacian, and gives a supermodular greedy rule for corrector placement; experiments on five models confirm dose-delay oscillations.

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Vector Linking via Cross-Model Local Isometric Consistency

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

A reference-based geometric hashing method recovers cross-model vector correspondences by exploiting local isometric consistency in contrastive embeddings and iteratively bootstrapping from a seed of paired anchors.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

cs.CR · 2026-05-03 · unverdicted · novelty 7.0

RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

Embedding-based defenses fail against crafted attacks in LLM MAS; confidence scores from logits improve robustness but decay over communication rounds.

HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads

cs.IR · 2026-04-19 · unverdicted · novelty 7.0

HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.

Spectral Tempering for Embedding Compression in Dense Passage Retrieval

cs.IR · 2026-03-19 · unverdicted · novelty 7.0

Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

cs.CL · 2025-11-02 · unverdicted · novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

cs.CL · 2025-04-27 · conditional · novelty 7.0

BrowseComp-ZH is a new benchmark of 289 Chinese web questions where even the strongest LLM agents reach only 42.9% accuracy.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

cs.CL · 2024-01-27 · accept · novelty 7.0

MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

cs.CL · 2021-10-04 · unverdicted · novelty 7.0

Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.

Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.

RSRank: Learning Relevance from Representational Shifts

cs.IR · 2026-06-16 · unverdicted · novelty 6.0

RSRank learns calibrated relevance scores from alignment between representational shifts induced by candidate documents and those from oracle document sets, enabling zero-threshold filtering.

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

CodeCytos is a code-augmented reasoning agent framework for dynamic, programmable exploration of custom spatial cellular features in molecular imaging data across four tissue types.

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

BOUTEF is a publicly available multilingual corpus for fake news research in Algeria and Tunisia, with narratives, comments, and debunkings across multiple languages and dialects, accompanied by thematic and engagement analyses.

citing papers explorer

Showing 16 of 16 citing papers after filters.

An Empirical Analysis of Factual Errors in Human-Written Text and its Application cs.CL · 2026-06-26 · unverdicted · none · ref 25 · internal anchor
An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.
Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification cs.CL · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision cs.CL · 2026-06-01 · unverdicted · none · ref 46 · internal anchor
EvoPool evolves pools of programmatic annotators that outperform LLM annotation by 0.141 average macro-F1 on 7 of 8 specialized tasks while running thousands of times faster.
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media cs.CL · 2026-05-16 · unverdicted · none · ref 267 · internal anchor
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking cs.CL · 2026-06-23 · unverdicted · none · ref 126 · internal anchor
Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.
BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon cs.CL · 2026-05-29 · unverdicted · none · ref 51 · internal anchor
BOUTEF is a publicly available multilingual corpus for fake news research in Algeria and Tunisia, with narratives, comments, and debunkings across multiple languages and dialects, accompanied by thematic and engagement analyses.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation cs.CL · 2026-05-21 · accept · none · ref 38 · internal anchor
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence cs.CL · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
PrimeFacts extracts decontextualized premises from fact-check articles, raising evidence retrieval MRR by up to 30% and verdict prediction Macro-F1 by 10-20 points over baselines.
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation cs.CL · 2026-05-06 · unverdicted · none · ref 14 · internal anchor
CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 improvements.
Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation cs.CL · 2026-04-09 · unverdicted · none · ref 4 · internal anchor
GuarantRAG improves RAG accuracy up to 12.1% and cuts hallucinations 16.3% by decoupling parametric reasoning from evidence integration via contrastive DPO and joint decoding.
SEEK: Semantic Evidence Extraction via Adaptive ChunKing for Multilingual Fact-Checking cs.CL · 2026-05-26 · unverdicted · none · ref 23 · internal anchor
SEEK uses adaptive semantic chunking to create complete evidence units and fine-tunes multilingual LLMs with LoRA, achieving up to 20% better macro-F1 on fact-checking datasets compared to baselines.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 49 · internal anchor
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations cs.CL · 2026-04-06 · unverdicted · none · ref 29 · internal anchor
LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.
Hybrid Adversarial Defence for Natural Language Understanding Tasks cs.CL · 2026-06-03 · unverdicted · none · ref 41 · internal anchor
Hybrid entropy-uncertainty-geometric defence improves clean accuracy by up to 43% and adversarial robustness by up to 65% on NLU and security benchmarks.
Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit cs.CL · 2026-06-02 · unverdicted · none · ref 35 · internal anchor
Fine-tuned RoBERTa achieves 0.62 macro-F1 on 900 Reddit comments, outperforming best zero-shot LLM at 0.50, with largest gap on detecting belief propagation.
Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking cs.CL · 2026-05-31 · unverdicted · none · ref 36 · internal anchor
InSemRAG combines dynamic intent-aware hybrid retrieval and semantics-preserving chunk repair in an iterative loop, yielding 2.65 F1 gain on HotPotQA and 1.5 accuracy gain on FEVER with 4.32x lower latency than Multi-Hop RAG via SLMs.

FEVER: a large-scale dataset for Fact Extraction and VERification

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer