ACL-Verbatim: hallucination-free question answering for research

· 2026 · cs.CL · arXiv 2605.21102

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

representative citing papers

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

Introduces unified span-level hallucination detection benchmark over code, tool output, and documents; fine-tuned Qwen3.5-2B reaches 0.689 span-F1 and outperforms baselines including on code-agent data.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents cs.CL · 2026-07-01 · unverdicted · none · ref 14 · internal anchor
Introduces unified span-level hallucination detection benchmark over code, tool output, and documents; fine-tuned Qwen3.5-2B reaches 0.689 span-F1 and outperforms baselines including on code-agent data.

ACL-Verbatim: hallucination-free question answering for research

fields

years

verdicts

representative citing papers

citing papers explorer