Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Aditya Pillai; Aleksandr Rubashevskii; Arnav Arora; Iryna Gurevych; Isabelle Augenstein; Jiahui Geng; Liangming Pan; Nadav Borenstein; Osama Mohammed Afzal; Preslav Nakov

arxiv: 2311.09000 · v3 · pith:K6BJG4FOnew · submitted 2023-11-15 · 💻 cs.CL

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang , Revanth Gangi Reddy , Zain Muhammad Mujahid , Arnav Arora , Aleksandr Rubashevskii , Jiahui Geng , Osama Mohammed Afzal , Liangming Pan

show 5 more authors

Nadav Borenstein Aditya Pillai Isabelle Augenstein Iryna Gurevych Preslav Nakov

This is my paper

classification 💻 cs.CL

keywords annotationbenchmarkautomaticevaluationfactualfactualityoutputssolution

0 comments

read the original abstract

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs
cs.CL 2026-06 unverdicted novelty 4.0

Fine-tuned compact models achieve strong multilingual performance and large efficiency gains over LLMs on production data from 114 languages for claim detection and 28 for veracity prediction.