arxiv: 2509.25868 · v3 · submitted 2025-09-30 · 💻 cs.CL

ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations

Yindong Wang , Martin Prei{\ss} , Margarita Bugue\~no , Jan Vincent Hoffbauer , Abdullatif Ghajar , Tolga Buz , Gerard de Melo This is my paper

Pith reviewed 2026-05-18 12:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords scientific confabulationerror detectionLLM benchmarkfactuality evaluationsalient distractorpositional annotationsLLM-as-Judge

0 comments p. Extension

The pith

Large language models consistently select semantically unrelated text for 61% of their scientific error predictions, a pattern unchanged by scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReFACT, a benchmark of 1,001 expert-annotated question-answer pairs from Reddit's r/AskScience, each marked with exact error spans. Testing nine LLMs shows that wrong predictions mostly land on text unrelated to the real mistakes, hitting 61% of cases. This distractor tendency appears in every size tested, from 1B to 70B parameters. Side-by-side comparison of answers makes detection worse than single-answer checks, dropping GPT-4o F1 from 0.67 to 0.53. The results question whether LLMs can reliably judge scientific accuracy on their own.

Core claim

ReFACT shows that LLMs exhibit a dominant salient distractor failure mode where 61% of incorrect span predictions are semantically unrelated to actual errors, a pattern that persists across all tested scales from 1B to 70B parameters and signals a fundamental semantic grounding deficit. Comparative judgment is harder than independent detection, with performance dropping when answers are presented side-by-side, directly challenging the reliability of LLM-as-Judge approaches for scientific factuality.

What carries the argument

ReFACT benchmark of 1,001 expert-annotated pairs with span-level positional error annotations from r/AskScience, used to measure confabulation detection and isolate the salient distractor pattern.

If this is right

Simply increasing model size will not fix the semantic grounding deficit in error detection.
LLM-as-Judge methods are unreliable for scientific factuality because comparative checks perform worse than single-answer checks.
Independent error detection remains more accurate than side-by-side comparison across current models.
Methods other than scaling are required to improve how models connect predictions to actual meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit training on positional error spans could teach models to avoid unrelated distractors more effectively.
The same grounding issue may appear in error detection for medical, legal, or technical domains beyond science.
Hybrid detection systems that add external semantic checks might compensate for the deficit observed here.

Load-bearing premise

Expert annotations correctly mark the true error spans and the evaluation reliably separates semantically unrelated distractors from actual errors.

What would settle it

A new model or method that reduces the share of semantically unrelated incorrect predictions to well below 61% while maintaining overall accuracy would falsify the claim of a scale-invariant grounding deficit.

Figures

Figures reproduced from arXiv: 2509.25868 by Abdullatif Ghajar, Gerard de Melo, Jan Vincent Hoffbauer, Margarita Bugue\~no, Martin Prei{\ss}, Tolga Buz, Yindong Wang.

**Figure 1.** Figure 1: Overview of the ReFACT Evaluation Pipeline (Entity Replacement Example). Given a factual Reddit answer and a minimally transformed counterpart containing a subtle instance of scientific confabulation (“your DNA” → “your RNA”), the model is evaluated on: (1) Judgment – detecting confabulation, (2) Span Localization – identifying the corrupted entity span, and (3) Correction – recovering the original entity … view at source ↗

**Figure 2.** Figure 2: Fact Transformation Pipeline (Data Creation Process). Given a filtered Reddit question–answer pair, factual statements are extracted and systematically corrupted through: (1) Fact Extraction – identify factual claims from the original answer; (2) Selection – selecting the most convincing fact for negation or a keyphrase for entity replacement; (3) Transformation – applying either negation(flipping the fact… view at source ↗

**Figure 3.** Figure 3: Domains of the Dataset [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 5.** Figure 5: Character Count of Dataset Samples [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 4.** Figure 4: Word Count of Dataset Samples [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit's r/AskScience. Evaluating 9 state-of-the-art LLMs reveals two critical limitations. First, models exhibit a dominant "salient distractor" failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors. Crucially, this pattern persists across all model scales (1B to 70B), indicating a fundamental semantic grounding deficit that scaling alone fails to resolve. Second, we find that comparative judgment is paradoxically harder than independent detection, even GPT-4o's F1 score drops from 0.67 to 0.53 when comparing answers side-by-side. These findings directly challenge the reliability of LLM-as-Judge paradigms for scientific factuality. Code and data are released at https://github.com/ddz5431/ReFACT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReFACT adds a practical new benchmark with span annotations from real science Q&A, but the 61% unrelated-error claim needs tighter validation on how the experts labeled and how unrelatedness was scored.

read the letter

The main point from this paper is that they built ReFACT, a set of 1,001 expert-labeled Reddit science answers with exact spans marked as errors. When they test LLMs on finding those errors, over half the wrong guesses land on text that has no semantic connection to the real problem, and this holds from small to large models. What stands out as new is the combination of sourcing from actual user questions on r/AskScience, getting expert positional annotations, and running both standalone detection and a head-to-head comparison task. Releasing the full dataset and code at the GitHub link is a plus, as it lets others build on the exact same material. The work does a good job showing that comparative judgment, where the model sees two answers and picks the faulty one, actually lowers F1 scores compared to judging one at a time. That result challenges some assumptions in LLM-as-Judge setups for factuality checks. On the softer side, the headline 61% figure for semantically unrelated predictions rests on the quality of the expert annotations and the precise way they separate unrelated from related errors. The abstract does not include inter-annotator agreement numbers or the detailed criteria for semantic relatedness, which leaves room for annotation variability to influence the outcome. Without those, it's harder to rule out that the pattern is partly an artifact of how the ground truth was set. The paper is aimed at people studying LLM reliability in scientific domains or developing automated fact-checking tools. Anyone working on benchmarks or error analysis in this area will get practical value from the dataset and the reported trends. It deserves a serious referee because the contribution is empirical and the materials are public, making verification straightforward. I would recommend sending it for peer review. The core dataset and experiments provide enough substance to warrant feedback, even if the interpretation of the failure mode could use more supporting details on annotation reliability.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReFACT, a benchmark of 1,001 expert-annotated question-answer pairs from Reddit's r/AskScience with span-level error annotations. It evaluates nine LLMs and reports that 61% of incorrect span predictions are semantically unrelated to actual errors, a pattern persisting across scales from 1B to 70B and indicating a fundamental semantic grounding deficit that scaling does not resolve. It also finds comparative judgment harder than independent detection, with GPT-4o's F1 dropping from 0.67 to 0.53, challenging LLM-as-Judge approaches for scientific factuality. Code and data are released.

Significance. If the expert annotations prove reliable and the 'semantically unrelated' labeling is reproducible, the benchmark offers concrete evidence of a persistent failure mode in LLM confabulation detection that is not mitigated by scale. The side-by-side comparison result and public release of the dataset strengthen its utility for future work on factuality evaluation.

major comments (3)

[Benchmark construction / Evaluation] Benchmark construction and annotation protocol: the 61% salient-distractor statistic and its invariance across 1B–70B models rest on the expert-annotated error spans serving as ground truth, yet no inter-annotator agreement, adjudication procedure, or explicit decision rules for semantic relatedness are reported. This directly affects the load-bearing claim in the abstract and evaluation sections.
[Evaluation metrics and results] Operational definition of 'semantically unrelated': the criteria used to classify a model-predicted span as unrelated to any actual error span are not specified in sufficient detail, making it impossible to assess whether the 61% figure reflects a genuine model deficit or annotation or scoring artifacts.
[Results and discussion] Statistical support for cross-scale and comparative claims: the persistence of the failure mode and the F1 drop (0.67 to 0.53) are presented without reported statistical tests, confidence intervals, or controls for multiple comparisons, weakening the interpretation that scaling alone fails to resolve the deficit.

minor comments (2)

[Experimental setup] Clarify the exact number of models evaluated and their parameter counts in a table for easy reference.
[Data collection] Provide more detail on how the Reddit posts were selected and filtered to ensure they contain verifiable scientific claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. We provide point-by-point responses to the major comments below, indicating where revisions have been made to address the concerns.

read point-by-point responses

Referee: [Benchmark construction / Evaluation] Benchmark construction and annotation protocol: the 61% salient-distractor statistic and its invariance across 1B–70B models rest on the expert-annotated error spans serving as ground truth, yet no inter-annotator agreement, adjudication procedure, or explicit decision rules for semantic relatedness are reported. This directly affects the load-bearing claim in the abstract and evaluation sections.

Authors: We agree that a more detailed account of the annotation protocol is necessary to substantiate our claims. The annotations were carried out by a single expert with relevant scientific background to ensure consistency across the dataset. We have revised the manuscript to include the full annotation guidelines and explicit decision rules for classifying semantic relatedness (a predicted span is deemed unrelated if it pertains to a different scientific fact or entity with no conceptual connection to the actual error). As the annotation was performed by one expert, inter-annotator agreement and adjudication procedures do not apply; we have clarified this in the text and added it to the limitations discussion. These revisions address the concern regarding the reliability of the ground truth. revision: partial
Referee: [Evaluation metrics and results] Operational definition of 'semantically unrelated': the criteria used to classify a model-predicted span as unrelated to any actual error span are not specified in sufficient detail, making it impossible to assess whether the 61% figure reflects a genuine model deficit or annotation or scoring artifacts.

Authors: We thank the referee for this comment. We have now provided a clear operational definition in the 'Metrics' section of the revised manuscript. Specifically, a predicted span is classified as semantically unrelated if it exhibits no overlap in key terms or concepts with the ground-truth error span, as determined by the expert annotator following the annotation rubric. We have also added concrete examples of related and unrelated predictions to illustrate the distinction. This addition allows for better evaluation of whether the 61% statistic represents a true model behavior. revision: yes
Referee: [Results and discussion] Statistical support for cross-scale and comparative claims: the persistence of the failure mode and the F1 drop (0.67 to 0.53) are presented without reported statistical tests, confidence intervals, or controls for multiple comparisons, weakening the interpretation that scaling alone fails to resolve the deficit.

Authors: We appreciate the suggestion to include statistical analyses. In the revised manuscript, we have incorporated bootstrap-derived confidence intervals for the salient distractor rate and the F1 scores. We have also added the results of statistical tests comparing performance across model scales and between the independent and comparative judgment conditions, with appropriate corrections for multiple comparisons. These enhancements provide stronger quantitative backing for our conclusions regarding the persistence of the failure mode. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and model evaluation

full rationale

The paper introduces the ReFACT benchmark of 1,001 expert-annotated Reddit QA pairs with span-level error labels and reports direct empirical results from evaluating 9 LLMs, including the 61% salient-distractor statistic computed from model span predictions versus the new annotations. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided text. The central claims rest on external model runs against freshly collected and annotated data rather than reducing to prior outputs or self-referential definitions by construction. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality of expert annotations as ground truth and on the assumption that the observed error patterns reflect a semantic grounding deficit rather than annotation artifacts or metric choices.

axioms (1)

domain assumption Expert annotations on Reddit r/AskScience answers accurately identify true scientific confabulations and error spans.
The benchmark treats these annotations as the reference standard for measuring model performance.

pith-pipeline@v0.9.0 · 5736 in / 1234 out tokens · 39903 ms · 2026-05-18T12:49:31.966032+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ReFACT ... benchmark of 1,001 expert-annotated question–answer pairs ... three-tier evaluation framework: (1) binary confabulation judgment, (2) fine-grained error localization at the span level, and (3) correction generation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, Eduard Hovy, Heng Ji, Filippo Menczer, Ruben Miguez, Preslav Nakov, Dietram Scheufele, Shivam Sharma, and Giovanni Zagni. 2023. https://arxiv.org/abs/2310.05189 Factuality challenges in the...

work page arXiv 2023
[4]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. https://arxiv.org/abs/2504.17550 Hallulens: Llm hallucination benchmark . Preprint, arXiv:2504.17550

work page arXiv 2025
[5]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with P ython: A nalyzing text with the natural language toolkit . O'Reilly Media, Inc

work page 2009
[6]

Canyu Chen and Kai Shu. 2023. https://arxiv.org/abs/2311.05656 Combating misinformation in the age of LLMs : Opportunities and challenges . Preprint, arXiv:2311.05656

work page arXiv 2023
[7]

Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. https://arxiv.org/abs/2310.00741 Felm: Benchmarking factuality evaluation of large language models . Preprint, arXiv:2310.00741

work page arXiv 2023
[8]

OpenAI et al. 2024. https://arxiv.org/abs/2303.08774 GPT-4 technical report . Preprint, arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Gemma-Team and Google Deepmind. 2024. https://api.semanticscholar.org/CorpusID:270843326 Gemma 2: Improving open language models at a practical size

work page 2024
[10]

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, and Hao Peng. 2024. https://arxiv.org/abs/2310.14564 Language models hallucinate, but may excel at fact verification . Preprint, arXiv:2310.14564

work page arXiv 2024
[11]

Jan Vincent Hoffbauer, Sylwester Sawicki, Marc Lenard Ulrich, Tolga Buz, Konstantin Dobler, Moritz Schneider, and Gerard de Melo. 2024. https://openreview.net/forum?id=WM5X92815P Knowledge acquisition through continued pretraining is difficult: A case study on r/AskHistorians . In ACL 2024 Workshop Towards Knowledgeable Language Models

work page 2024
[12]

Beizhe Hu, Qiang Sheng, Juan Cao, Yuhui Shi, Yang Li, Danding Wang, and Peng Qi. 2024. https://doi.org/10.1609/aaai.v38i20.30214 Bad actor, good advisor: Exploring the role of large language models in fake news detection . Proceedings of the AAAI Conference on Artificial Intelligence, 38:22105--22113

work page doi:10.1609/aaai.v38i20.30214 2024
[13]

Yue Huang and Lichao Sun. 2024. https://arxiv.org/abs/2310.05046 FakeGPT : Fake news generation, explanation and detection of large language models . Preprint, arXiv:2310.05046

work page arXiv 2024
[14]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1--38

work page 2023
[15]

Moritz Laurer, Wouter van Atteveldt, Andreu Salleras Casas, and Kasper Welbers. 2022. https://osf.io/74b8k Less Annotating , More Classifying – Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT - NLI . Preprint. Publisher: Open Science Framework

work page 2022
[16]

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2023. https://arxiv.org/abs/2206.04624 Factuality enhanced language models for open-ended text generation . Preprint, arXiv:2206.04624

work page arXiv 2023
[17]

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2024 a . https://arxiv.org/abs/2401.03205 The dawn after the dark: An empirical study on factuality hallucination in large language models . Preprint, arXiv:2401.03205

work page arXiv 2024
[18]

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.397 H alu E val: A large-scale hallucination evaluation benchmark for large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449--6464, Singapore. Association for Computation...

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[19]

Malthouse

Xinyi Li, Yongfeng Zhang, and Edward C. Malthouse. 2024 b . https://arxiv.org/abs/2405.01593 Large language model agent for fake news detection . Preprint, arXiv:2405.01593

work page arXiv 2024
[20]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. https://doi.org/10.18653/v1/2022.acl-long.229 T ruthful QA : Measuring how models mimic human falsehoods . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214--3252, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.229 2022
[21]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. https://arxiv.org/abs/2303.08896 Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models . Preprint, arXiv:2303.08896

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://aclanthology.org/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1906--1919

work page 2020
[23]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://arxiv.org/abs/2305.14251 FActScore : Fine-grained atomic evaluation of factual precision in long form text generation . Preprint, arXiv:2305.14251

work page arXiv 2023
[24]

Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. 2024. https://arxiv.org/abs/2401.06855 Fine-grained hallucination detection and editing for language models . Preprint, arXiv:2401.06855

work page arXiv 2024
[25]

Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. https://github.com/doccano/doccano doccano : Text annotation tool for human . Software available from https://github.com/doccano/doccano

work page 2018
[26]

Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. 2025. https://arxiv.org/abs/2501.08292 Halogen: Fantastic llm hallucinations and where to find them . Preprint, arXiv:2501.08292

work page arXiv 2025
[27]

Peiqi Sui, Eamon Duede, Sophie Wu, and Richard So. 2024 a . https://doi.org/10.18653/v1/2024.acl-long.770 Confabulation: The surprising value of large language model hallucinations . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14274--14284, Bangkok, Thailand. Association for Com...

work page doi:10.18653/v1/2024.acl-long.770 2024
[28]

Peiqi Sui, Eamon Duede, Sophie Wu, and Richard Jean So. 2024 b . https://arxiv.org/abs/2406.04175 Confabulation: The surprising value of large language model hallucinations . Preprint, arXiv:2406.04175

work page arXiv 2024
[29]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . Preprint, arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. 2024. https://arxiv.org/abs/2411.04368 Measuring short-form factuality in large language models . arXiv preprint arXiv:2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024