Recognition: 2 theorem links
· Lean TheoremVerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering
Pith reviewed 2026-05-16 14:00 UTC · model grok-4.3
The pith
VerifAI verifies each atomic claim in a generated biomedical answer against retrieved evidence using a fine-tuned NLI engine to cut hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VerifAI combines a hybrid biomedical information retrieval module, a citation-aware generator fine-tuned on a custom dataset, and a Verification Component that decomposes answers into atomic claims and validates each claim with a fine-tuned NLI engine; the verification step reaches state-of-the-art accuracy on the HealthVer benchmark, beats GPT-4, and measurably lowers hallucinated citations relative to zero-shot baselines while exposing the exact evidence lineage for every claim.
What carries the argument
The Verification Component, which breaks generated answers into atomic claims and classifies each claim's support status against the retrieved evidence using a fine-tuned natural language inference engine.
If this is right
- Answers carry an explicit, auditable mapping from each claim to the supporting evidence passages.
- Hallucinated citations drop substantially compared with standard retrieval-augmented generation baselines.
- The full pipeline can be deployed in high-stakes biomedical settings where factual traceability is required.
- The modular design allows independent updates to the retrieval, generation, or verification stages.
Where Pith is reading between the lines
- The same claim-decomposition-plus-NLI pattern could be applied to other domains that need traceable answers, such as legal or policy documents.
- Open release of the models and datasets creates a public baseline that other groups can directly compare against or improve.
- If the NLI step proves robust across new biomedical subfields, it could become a standard post-processing filter for any large language model used in evidence-based domains.
Load-bearing premise
The fine-tuned NLI engine correctly decides whether every atomic claim is supported by the retrieved evidence without missing subtle contradictions or adding new errors of its own.
What would settle it
A held-out HealthVer-style test set containing claims that the verification component either wrongly accepts as supported or wrongly rejects as unsupported, producing lower accuracy than GPT-4 on the same benchmark.
Figures
read the original abstract
We introduce VerifAI, an open-source expert system for biomedical question answering that integrates retrieval-augmented generation (RAG) with a novel post-hoc claim verification mechanism. Unlike standard RAG systems, VerifAI ensures factual consistency by decomposing generated answers into atomic claims and validating them against retrieved evidence using a fine-tuned natural language inference (NLI) engine. The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module optimized for biomedical queries (MAP@10 of 42.7%), (2) a citation-aware Generative Component fine-tuned on a custom dataset to produce referenced answers, and (3) a Verification Component that detects hallucinations with state-of-the-art accuracy, outperforming GPT-4 on the HealthVer benchmark. Evaluations demonstrate that VerifAI significantly reduces hallucinated citations compared to zero-shot baselines and provides a transparent, verifiable lineage for every claim. The full pipeline, including code, models, and datasets, is open-sourced to facilitate reliable AI deployment in high-stakes domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VerifAI, an open-source biomedical QA system integrating hybrid retrieval-augmented generation with a post-hoc verification step. Generated answers are decomposed into atomic claims that are validated against retrieved evidence via a fine-tuned NLI model. The system reports MAP@10 of 42.7% on its IR module, state-of-the-art hallucination detection on HealthVer (outperforming GPT-4), reduced hallucinated citations versus zero-shot baselines, and a transparent claim lineage. The full pipeline, code, models, and datasets are released for reproducibility.
Significance. If the reported results hold under rigorous evaluation, VerifAI provides a modular, verifiable pipeline that addresses hallucination risks in high-stakes biomedical QA. The explicit release of all artifacts (code, fine-tuned NLI model, custom datasets) enables direct reproduction of the MAP@10, citation accuracy, and HealthVer numbers, which is a notable strength for an empirical systems paper in this domain.
major comments (2)
- [Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.
- [Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.
minor comments (2)
- A system diagram or flowchart would clarify the data flow among the three modular components (IR, generative, verification) and the exact interfaces for claim decomposition and evidence alignment.
- The zero-shot baselines used for the hallucinated-citation comparison should be specified with exact prompts and model versions in the experimental section to allow precise replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and have incorporated changes to improve clarity and substantiation of the reported results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.
Authors: We agree that the abstract should provide more context for the key metrics. The evaluation protocols, data splits (e.g., 80/10/10 on the custom biomedical QA dataset), query sets (derived from PubMed and HealthVer), and statistical testing (paired t-tests with p<0.05) are fully described in Sections 4.1 and 4.2. In the revised manuscript we will expand the abstract by one sentence to reference these details and the significance testing while preserving conciseness. revision: yes
-
Referee: [Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.
Authors: The HealthVer results (F1 0.87 vs. GPT-4's 0.81) are obtained via the exact protocol in Section 4.3: zero-shot GPT-4 prompts with the same claim-evidence pairs and majority-vote aggregation. We acknowledge the absence of a dedicated error-analysis subsection. The revised version will add a new subsection 4.3.1 containing (i) quantitative breakdown of false-positive/negative cases on biomedical contradictions, (ii) ablation on edge cases such as negation scope and numerical mismatches, and (iii) the full GPT-4 prompt template and temperature settings used for the direct comparison. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an engineering description of a modular RAG system with post-hoc verification via a fine-tuned NLI model. All performance claims (MAP@10, HealthVer accuracy, hallucination reduction) are empirical measurements against external benchmarks and released artifacts rather than quantities derived from equations or parameters fitted to the same data. No derivations, self-definitional steps, or load-bearing self-citations appear; the architecture is self-contained and reproducible via the open-sourced code, models, and datasets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fine-tuned natural language inference model can accurately determine whether an atomic claim is entailed by retrieved evidence.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module... (2) a citation-aware Generative Component... (3) a Verification Component that detects hallucinations... using a fine-tuned natural language inference (NLI) engine.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We fine-tuned several transformer models for this task... DeBERTa Large... on the transformed SciFact dataset
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam, Philosophical Transactions of the Royal Society A 382 (2270) (2024) 20230254
work page 2024
- [3]
-
[4]
H. Nori, N. King, S. M. McKinney, D. Carignan, E. Horvitz, Ca- pabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM computing surveys 55 (12) (2023) 1–38
work page 2023
-
[6]
G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Poly- chronopoulos, et al., An overview of the bioasq large-scale biomedical semantic indexing and question answering competition, BMC bioinfor- matics 16 (2015) 1–28
work page 2015
-
[7]
D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or Fiction: Verifying Scientific Claims, in: B. Webber, T. Cohn, Y . He, Y . Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 7534– 7550.doi:10.18653/v1/2...
- [8]
-
[9]
doi: 10.18653/v1/ 2021.acl-srw.23
D. Peskoff, B. Stewart, Credible without credit: Domain experts assess generative language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), Association for Computational Linguistics, 2023, pp. 427–438.doi:10.18653/v1/ 2023.acl-short.37
-
[10]
S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, M. Abdelrazek, Seven failure points when engineering a retrieval augmented generation system, in: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 2024, pp. 194–199
work page 2024
-
[11]
R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, T.-Y . Liu, Biogpt: Generative pre-trained transformer for biomedical text generation and mining, Briefings in Bioinformatics 23 (6) (2022) bbac409.doi:10. 1093/bib/bbac409. URL https://doi.org/10.1093/bib/bbac409
-
[12]
C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, Y . Wang, Pmc-llama: toward building open-source language models for medicine, Journal of the American Medical Informatics Association 31 (9) (2024) 1833–1843
work page 2024
-
[13]
Y . Jia, X. Ji, X. Wang, H. Zhang, Z. Meng, J. Zhang, Z. Chen, P. Zhuang, D. Xu, W. Guo, Y . Yan, J. Wei, Qibo: A large language model for traditional chinese medicine, Expert Systems with Applications 284 (2025) 127672. doi:https://doi.org/10.1016/j.eswa.2025.127672. URL https://www.sciencedirect.com/science/article/pii/ S0957417425012941
- [14]
-
[15]
T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations”, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2023, pp. 6465–6488.doi:10.18653/v1/2023.emnlp-main.398
-
[16]
N. F. Liu, T. Zhang, P. Liang, Evaluating verifiability in generative search engines, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 7001–7025
work page 2023
- [17]
-
[18]
Z. Wang, J. Araki, Z. Jiang, M. R. Parvez, G. Neubig, Learning to Filter Context for Retrieval-Augmented Generation, arXiv.org (2023). doi:10.48550/ARXIV.2311.08377. URL https://arxiv.org/abs/2311.08377
- [19]
-
[20]
A. Asai, Z. Wu, Y . Wang, A. Sil, H. Hajishirzi, Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, International Conference on Learning Representations (2023).doi:10.48550/ ARXIV.2310.11511. URL https://arxiv.org/abs/2310.11511
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
V . Rawte, S. M. T. I. Tonmoy, K. Rajbangshi, S. Nag, A. Chadha, A. P. Sheth, A. Das, Factoid: Factual enTailment fOr hallucInation Detection, arXiv.org (2024).doi:10.48550/ARXIV.2403.19113. URL https://arxiv.org/abs/2403.19113
-
[22]
N. A. Birur, T. Baswa, D. Kumar, J. Loya, S. Agarwal, P. Harshangi, Vera: Validation and Enhancement for Retrieval Augmented systems, arXiv.org (2024).doi:10.48550/ARXIV.2409.15364. URL https://arxiv.org/abs/2409.15364
-
[23]
S. Huo, N. Arabzadeh, C. Clarke, Retrieving Supporting Evidence for Generative Question Answering, in: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, ACM, 2023, pp. 11–20.doi:10.1145/3624918.3625336. URL http://dx.doi.org/10.1145/3624918.3625336
-
[24]
Y . Fang, S. W. Thomas, X. Zhu, Hgot: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation, TRUSTNLP (2024).doi:10.48550/ARXIV.2402.09390. URL https://arxiv.org/abs/2402.09390
-
[25]
K. Xu, K. Zhang, J. Li, W. Huang, Y . Wang, Crp-RAG: A Retrieval- Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning, Electronics 14 (1) (2024) 47. doi:10.3390/electronics14010047. URL http://dx.doi.org/10.3390/electronics14010047
-
[27]
H. Tang, D. Chen, Q. Chu, Z. Liu, Chatsos: Vector database augmented generative question answering assistant in safety engineering, Expert Systems with Applications (2025) 128644
work page 2025
- [28]
-
[29]
L. M. Amugongo, P. Mascheroni, S. Brooks, S. Doering, J. Seidel, Retrieval augmented generation for large language models in healthcare: A systematic review, PLOS Digital Health 4 (6) (2025) e0000877
work page 2025
-
[30]
S. Liu, A. B. McCoy, A. Wright, Improving large language model appli- cations in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines, Journal of the American Medical Informatics Association 32 (4) (2025) 605–615
work page 2025
- [31]
-
[32]
C. Niu, Y . Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, T. Zhang, RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models, in: L.-W. Ku, A. Martins, V . Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for Computational...
-
[33]
S. Nishisako, T. Higashi, F. Wakao, Reducing hallucinations and trade- offs in responses in generative ai chatbots for cancer information: Development and evaluation study, JMIR cancer 11 (1) (2025) e70176
work page 2025
- [34]
-
[35]
J. Vladika, P. Schneider, F. Matthes, Healthfc: Verifying health claims with evidence-based medical fact-checking, in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 8095– 8107
work page 2024
-
[36]
H. Liu, A. Soroush, J. G. Nestor, E. Park, B. Idnay, Y . Fang, J. Pan, S. Liao, M. Bernard, Y . Peng, et al., Retrieval augmented scientific claim verification, JAMIA open 7 (1) (2024) ooae021
work page 2024
- [37]
-
[38]
S. Xu, Z. Yan, C. Dai, F. Wu, Mega-rag: a retrieval-augmented gen- eration framework with multi-evidence guided answer refinement for mitigating hallucinations of llms in public health, Frontiers in Public Health 13 (2025) 1635381
work page 2025
-
[39]
D. Baur, J. Ansorg, C.-E. Heyde, A. V oelker, Development and evalu- ation of a retrieval-augmented generation chatbot for orthopedic and trauma surgery patient education: Mixed-methods study, JMIR AI 4 (2025) e75262
work page 2025
-
[40]
National Center for Biotechnology Information, Pubmed, https:// pubmed.ncbi.nlm.nih.gov/, accessed: February 24, 2024
work page 2024
- [41]
-
[42]
S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (4) (2009) 333–389
work page 2009
-
[43]
Y . A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence 42 (4) (2018) 824–836
work page 2018
- [44]
-
[45]
S. Hofst ¨atter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling, in: Proceedings of SIGIR 2021, 2021.doi:https:// doi.org/10.48550/arXiv.2104.06967
-
[46]
A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, Bioasq-qa: A manually curated corpus for biomedical question answering, Scientific Data 10 (1) (2023) 170
work page 2023
-
[48]
B. Ba ˇsaragin, A. Ljaji ´c, D. Medvecki, L. Cassano, M. Ko ˇsprdi´c, N. Milo ˇsevi´c, How do you know that? teaching generative language models to reference answers to biomedical questions, in: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, 2024, pp. 536–547
work page 2024
- [49]
-
[50]
QLoRA: Efficient Finetuning of Quantized LLMs
T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, arXiv preprint arXiv:2305.14314doi:https://doi.org/10.48550/arXiv. 2305.14314
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
- [51]
-
[52]
F.-J. Rodrigo-Gin ´es, J. Carrillo-de Albornoz, L. Plaza, A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it, Expert Systems with Applications 237 (2024) 121641
work page 2024
-
[53]
M. Ko ˇsprdi´c, A. Ljaji ´c, D. Medvecki, B. Ba ˇsaragin, N. Milo ˇsevi´c, Scientific claim verification with fine-tuned NLI models, in: Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS 2024), SCITEPRESS - Science and Technology Publications, 2024, pp. 15– 25.doi:10.5220/00...
-
[54]
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V . Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[55]
Unsupervised Cross-lingual Representation Learning at Scale
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsu- pervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)
work page internal anchor Pith review arXiv 1911
-
[56]
P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, arXiv preprint arXiv:2006.03654 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[57]
N. Tan, T. Nguyen, J. Bensemann, A. Peng, Q. Bao, Y . Chen, M. Ga- hegan, M. J. Witbrock, Multi2claim: Generating scientific claims from multi-choice questions for scientific fact-checking, in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2652–2664
work page 2023
-
[58]
M. Sarrouti, A. Ben Abacha, Y . Mrabet, D. Demner-Fushman, Evidence- based Fact-Checking of Health-related Claims, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Compu- tational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 3499– 3512.doi:10.18653/v...
- [59]
-
[60]
A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai, Information fusion 58 (2020) 82–115. Miloˇs Ko ˇsprdi´cwas born in Novi Sad, Serbia. He rece...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.