Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
read the original abstract
The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present a holistic end-to-end solution for annotating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels concerning the verifiability and factual inconsistencies found in LLM outputs. We further construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document, aiming to facilitate the evaluation of automatic fact-checking systems. Preliminary experiments show that FacTool, FactScore and Perplexity.ai are struggling to identify false claims, with the best F1=0.63 by this annotation solution based on GPT-4. Annotation tool, benchmark and code are available at https://github.com/yuxiaw/Factcheck-GPT.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs
Fine-tuned compact models achieve strong multilingual performance and large efficiency gains over LLMs on production data from 114 languages for claim detection and 28 for veracity prediction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.