arxiv: 2604.08549 · v1 · submitted 2026-01-16 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

Milo\v{s} Ko\v{s}prdi\'c , Adela Ljaji\'c , Bojana Ba\v{s}aragin , Darija Medvecki , Lorenzo Cassano , Nikola Milo\v{s}evi\'c

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:00 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords VerifAIbiomedical question answeringretrieval-augmented generationhallucination detectionnatural language inferencefact verificationverifiable AIopen-source biomedical search

0 comments

The pith

VerifAI verifies each atomic claim in a generated biomedical answer against retrieved evidence using a fine-tuned NLI engine to cut hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VerifAI as an open-source biomedical question-answering system that adds a verification layer on top of retrieval-augmented generation. It splits the model's output into individual claims and runs each one through a specialized natural language inference model trained to decide whether the claim is supported by the source documents. The goal is to produce answers that come with an explicit, checkable record of which evidence backs which part of the response, reducing the risk of fabricated citations or unsupported statements in a domain where errors carry high costs.

Core claim

VerifAI combines a hybrid biomedical information retrieval module, a citation-aware generator fine-tuned on a custom dataset, and a Verification Component that decomposes answers into atomic claims and validates each claim with a fine-tuned NLI engine; the verification step reaches state-of-the-art accuracy on the HealthVer benchmark, beats GPT-4, and measurably lowers hallucinated citations relative to zero-shot baselines while exposing the exact evidence lineage for every claim.

What carries the argument

The Verification Component, which breaks generated answers into atomic claims and classifies each claim's support status against the retrieved evidence using a fine-tuned natural language inference engine.

If this is right

Answers carry an explicit, auditable mapping from each claim to the supporting evidence passages.
Hallucinated citations drop substantially compared with standard retrieval-augmented generation baselines.
The full pipeline can be deployed in high-stakes biomedical settings where factual traceability is required.
The modular design allows independent updates to the retrieval, generation, or verification stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-decomposition-plus-NLI pattern could be applied to other domains that need traceable answers, such as legal or policy documents.
Open release of the models and datasets creates a public baseline that other groups can directly compare against or improve.
If the NLI step proves robust across new biomedical subfields, it could become a standard post-processing filter for any large language model used in evidence-based domains.

Load-bearing premise

The fine-tuned NLI engine correctly decides whether every atomic claim is supported by the retrieved evidence without missing subtle contradictions or adding new errors of its own.

What would settle it

A held-out HealthVer-style test set containing claims that the verification component either wrongly accepts as supported or wrongly rejects as unsupported, producing lower accuracy than GPT-4 on the same benchmark.

Figures

Figures reproduced from arXiv: 2604.08549 by Adela Ljaji\'c, Bojana Ba\v{s}aragin, Darija Medvecki, Lorenzo Cassano, Milo\v{s} Ko\v{s}prdi\'c, Nikola Milo\v{s}evi\'c.

**Figure 1.** Figure 1: The architecture of VerifAI. Approximate Nearest Neighbors (ANN) [44], leveraging dot product metrics. This approach enables rapid and precise similarity searches across high-dimensional embeddings [43]. Vector embeddings are generated using a bi-encoder sentence transformer model, which at the time of indexing demonstrated state-of-the-art performance in the Passage Retrieval Task3 . Specifically, we us… view at source ↗

**Figure 2.** Figure 2: Confusion matrix for DeBERTaSF model. In the Support class, misclassifications into No Evidence primarily stemmed from the model’s inability to reconcile dense numerical data or recognize abbreviations. Conversely, when Support claims were mislabeled as Contradict, the error was often attributable to the semantic complexity of clinical trial data. Specifically, complex immunology terminology and the misali… view at source ↗

**Figure 3.** Figure 3: Screenshot of the VerifAI user interface. Users enter a biomedical question in the input box; the system then displays (1) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

We introduce VerifAI, an open-source expert system for biomedical question answering that integrates retrieval-augmented generation (RAG) with a novel post-hoc claim verification mechanism. Unlike standard RAG systems, VerifAI ensures factual consistency by decomposing generated answers into atomic claims and validating them against retrieved evidence using a fine-tuned natural language inference (NLI) engine. The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module optimized for biomedical queries (MAP@10 of 42.7%), (2) a citation-aware Generative Component fine-tuned on a custom dataset to produce referenced answers, and (3) a Verification Component that detects hallucinations with state-of-the-art accuracy, outperforming GPT-4 on the HealthVer benchmark. Evaluations demonstrate that VerifAI significantly reduces hallucinated citations compared to zero-shot baselines and provides a transparent, verifiable lineage for every claim. The full pipeline, including code, models, and datasets, is open-sourced to facilitate reliable AI deployment in high-stakes domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VerifAI is a practical open-source RAG pipeline for biomedical QA that adds NLI verification and releases all code and models for direct reproduction.

read the letter

VerifAI puts together hybrid retrieval, citation-aware generation, and a post-hoc NLI verification step into one open-source system for biomedical question answering. The core value is that it ships the full pipeline, fine-tuned models, and datasets so others can run and extend the exact setup that produced the reported MAP@10 of 42.7 percent and the HealthVer numbers that beat GPT-4 on hallucination detection. That release is the part worth paying attention to, because it turns the usual RAG claims into something that can be checked without asking the authors for extra files. The modular design also makes sense for this domain, where users need traceable citations and some guard against unsupported statements. The paper shows clear reductions in hallucinated citations versus zero-shot baselines and gives a transparent lineage for each claim. The evaluation numbers are concrete enough to be useful as a baseline for similar work. The main soft spot is that the NLI component still rests on the assumption that decomposing answers into atomic claims and scoring them against retrieved passages catches most real errors without introducing new ones. Subtle contradictions or domain-specific phrasing could slip through, and the abstract does not spell out the exact decomposition rules or cross-domain tests. Since the code is public, that is a fixable issue rather than a fatal one. This paper is for developers and researchers who need a starting point for verifiable biomedical QA tools rather than for theorists looking for new formal methods. It is worth a serious referee because the artifacts are there to verify the central claims and because the engineering choices are laid out plainly enough to critique.

Referee Report

2 major / 2 minor

Summary. The paper introduces VerifAI, an open-source biomedical QA system integrating hybrid retrieval-augmented generation with a post-hoc verification step. Generated answers are decomposed into atomic claims that are validated against retrieved evidence via a fine-tuned NLI model. The system reports MAP@10 of 42.7% on its IR module, state-of-the-art hallucination detection on HealthVer (outperforming GPT-4), reduced hallucinated citations versus zero-shot baselines, and a transparent claim lineage. The full pipeline, code, models, and datasets are released for reproducibility.

Significance. If the reported results hold under rigorous evaluation, VerifAI provides a modular, verifiable pipeline that addresses hallucination risks in high-stakes biomedical QA. The explicit release of all artifacts (code, fine-tuned NLI model, custom datasets) enables direct reproduction of the MAP@10, citation accuracy, and HealthVer numbers, which is a notable strength for an empirical systems paper in this domain.

major comments (2)

[Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.
[Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.

minor comments (2)

A system diagram or flowchart would clarify the data flow among the three modular components (IR, generative, verification) and the exact interfaces for claim decomposition and evidence alignment.
The zero-shot baselines used for the hallucinated-citation comparison should be specified with exact prompts and model versions in the experimental section to allow precise replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and have incorporated changes to improve clarity and substantiation of the reported results.

read point-by-point responses

Referee: [Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.

Authors: We agree that the abstract should provide more context for the key metrics. The evaluation protocols, data splits (e.g., 80/10/10 on the custom biomedical QA dataset), query sets (derived from PubMed and HealthVer), and statistical testing (paired t-tests with p<0.05) are fully described in Sections 4.1 and 4.2. In the revised manuscript we will expand the abstract by one sentence to reference these details and the significance testing while preserving conciseness. revision: yes
Referee: [Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.

Authors: The HealthVer results (F1 0.87 vs. GPT-4's 0.81) are obtained via the exact protocol in Section 4.3: zero-shot GPT-4 prompts with the same claim-evidence pairs and majority-vote aggregation. We acknowledge the absence of a dedicated error-analysis subsection. The revised version will add a new subsection 4.3.1 containing (i) quantitative breakdown of false-positive/negative cases on biomedical contradictions, (ii) ablation on edge cases such as negation scope and numerical mismatches, and (iii) the full GPT-4 prompt template and temperature settings used for the direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering description of a modular RAG system with post-hoc verification via a fine-tuned NLI model. All performance claims (MAP@10, HealthVer accuracy, hallucination reduction) are empirical measurements against external benchmarks and released artifacts rather than quantities derived from equations or parameters fitted to the same data. No derivations, self-definitional steps, or load-bearing self-citations appear; the architecture is self-contained and reproducible via the open-sourced code, models, and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that NLI models can serve as reliable verifiers for atomic claims extracted from generated text. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption A fine-tuned natural language inference model can accurately determine whether an atomic claim is entailed by retrieved evidence.
This is the core premise of the Verification Component.

pith-pipeline@v0.9.0 · 5521 in / 1281 out tokens · 44722 ms · 2026-05-16T14:00:00.017435+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module... (2) a citation-aware Generative Component... (3) a Verification Component that detects hallucinations... using a fine-tuned natural language inference (NLI) engine.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fine-tuned several transformer models for this task... DeBERTa Large... on the transformed SciFact dataset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam, Philosophical Transactions of the Royal Society A 382 (2270) (2024) 20230254

work page 2024
[3]

Bubeck, V

S. Bubeck, V . Chadrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, et al., Sparks of artificial general intelligence: Early experiments with gpt-4 (2023)

work page 2023
[4]

H. Nori, N. King, S. M. McKinney, D. Carignan, E. Horvitz, Ca- pabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM computing surveys 55 (12) (2023) 1–38

work page 2023
[6]

Tsatsaronis, G

G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Poly- chronopoulos, et al., An overview of the bioasq large-scale biomedical semantic indexing and question answering competition, BMC bioinfor- matics 16 (2015) 1–28

work page 2015
[7]

Wadden, S

D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or Fiction: Verifying Scientific Claims, in: B. Webber, T. Cohn, Y . He, Y . Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 7534– 7550.doi:10.18653/v1/2...

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[8]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, ACM Transactions on Information Systems 43 (2) (2025) 1–55

work page 2025
[9]

doi: 10.18653/v1/ 2021.acl-srw.23

D. Peskoff, B. Stewart, Credible without credit: Domain experts assess generative language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), Association for Computational Linguistics, 2023, pp. 427–438.doi:10.18653/v1/ 2023.acl-short.37

work page doi:10.18653/v1/ 2023
[10]

Barnett, S

S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, M. Abdelrazek, Seven failure points when engineering a retrieval augmented generation system, in: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 2024, pp. 194–199

work page 2024
[11]

R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, T.-Y . Liu, Biogpt: Generative pre-trained transformer for biomedical text generation and mining, Briefings in Bioinformatics 23 (6) (2022) bbac409.doi:10. 1093/bib/bbac409. URL https://doi.org/10.1093/bib/bbac409

work page doi:10.1093/bib/bbac409 2022
[12]

C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, Y . Wang, Pmc-llama: toward building open-source language models for medicine, Journal of the American Medical Informatics Association 31 (9) (2024) 1833–1843

work page 2024
[13]

Y . Jia, X. Ji, X. Wang, H. Zhang, Z. Meng, J. Zhang, Z. Chen, P. Zhuang, D. Xu, W. Guo, Y . Yan, J. Wei, Qibo: A large language model for traditional chinese medicine, Expert Systems with Applications 284 (2025) 127672. doi:https://doi.org/10.1016/j.eswa.2025.127672. URL https://www.sciencedirect.com/science/article/pii/ S0957417425012941

work page doi:10.1016/j.eswa.2025.127672 2025
[14]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, et al., Retrieval- augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

work page 2020
[15]

T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations”, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2023, pp. 6465–6488.doi:10.18653/v1/2023.emnlp-main.398

work page doi:10.18653/v1/2023.emnlp-main.398 2023
[16]

N. F. Liu, T. Zhang, P. Liang, Evaluating verifiability in generative search engines, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 7001–7025

work page 2023
[17]

Bastan, M

M. Bastan, M. Surdeanu, N. Balasubramanian, Bionli: Generating a biomedical nli dataset using lexico-semantic constraints for adversarial examples, arXiv preprint arXiv:2210.14814 (2022)

work page arXiv 2022
[18]

Z. Wang, J. Araki, Z. Jiang, M. R. Parvez, G. Neubig, Learning to Filter Context for Retrieval-Augmented Generation, arXiv.org (2023). doi:10.48550/ARXIV.2311.08377. URL https://arxiv.org/abs/2311.08377

work page doi:10.48550/arxiv.2311.08377 2023
[19]

M. Chen, Y . Li, K. Padthe, R. Shao, A. Sun, L. Zettlemoyer, G. Ghosh, W. tau Yih, Improving factuality with explicit working memory (2025). arXiv:2412.18069. URL https://arxiv.org/abs/2412.18069

work page arXiv 2025
[20]

A. Asai, Z. Wu, Y . Wang, A. Sil, H. Hajishirzi, Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, International Conference on Learning Representations (2023).doi:10.48550/ ARXIV.2310.11511. URL https://arxiv.org/abs/2310.11511

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Rawte, S

V . Rawte, S. M. T. I. Tonmoy, K. Rajbangshi, S. Nag, A. Chadha, A. P. Sheth, A. Das, Factoid: Factual enTailment fOr hallucInation Detection, arXiv.org (2024).doi:10.48550/ARXIV.2403.19113. URL https://arxiv.org/abs/2403.19113

work page doi:10.48550/arxiv.2403.19113 2024
[22]

N. A. Birur, T. Baswa, D. Kumar, J. Loya, S. Agarwal, P. Harshangi, Vera: Validation and Enhancement for Retrieval Augmented systems, arXiv.org (2024).doi:10.48550/ARXIV.2409.15364. URL https://arxiv.org/abs/2409.15364

work page doi:10.48550/arxiv.2409.15364 2024
[23]

S. Huo, N. Arabzadeh, C. Clarke, Retrieving Supporting Evidence for Generative Question Answering, in: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, ACM, 2023, pp. 11–20.doi:10.1145/3624918.3625336. URL http://dx.doi.org/10.1145/3624918.3625336

work page doi:10.1145/3624918.3625336 2023
[24]

Y . Fang, S. W. Thomas, X. Zhu, Hgot: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation, TRUSTNLP (2024).doi:10.48550/ARXIV.2402.09390. URL https://arxiv.org/abs/2402.09390

work page doi:10.48550/arxiv.2402.09390 2024
[25]

K. Xu, K. Zhang, J. Li, W. Huang, Y . Wang, Crp-RAG: A Retrieval- Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning, Electronics 14 (1) (2024) 47. doi:10.3390/electronics14010047. URL http://dx.doi.org/10.3390/electronics14010047

work page doi:10.3390/electronics14010047 2024
[27]

H. Tang, D. Chen, Q. Chu, Z. Liu, Chatsos: Vector database augmented generative question answering assistant in safety engineering, Expert Systems with Applications (2025) 128644

work page 2025
[28]

J. He, B. Zhang, H. Rouhizadeh, Y . Chen, R. Yang, J. Lu, X. Chen, N. Liu, I. Li, D. Teodoro, Retrieval-augmented generation in biomedicine: A survey of technologies, datasets, and clinical appli- cations, arXiv preprint arXiv:2505.01146 (2025)

work page arXiv 2025
[29]

L. M. Amugongo, P. Mascheroni, S. Brooks, S. Doering, J. Seidel, Retrieval augmented generation for large language models in healthcare: A systematic review, PLOS Digital Health 4 (6) (2025) e0000877

work page 2025
[30]

S. Liu, A. B. McCoy, A. Wright, Improving large language model appli- cations in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines, Journal of the American Medical Informatics Association 32 (4) (2025) 605–615

work page 2025
[31]

Xiong, Q

G. Xiong, Q. Jin, Z. Lu, A. Zhang, Benchmarking retrieval-augmented generation for medicine, in: Findings of the Association for Computa- tional Linguistics ACL 2024, 2024, pp. 6233–6251

work page 2024
[32]

C. Niu, Y . Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, T. Zhang, RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models, in: L.-W. Ku, A. Martins, V . Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for Computational...

work page doi:10.18653/v1/2024.acl-long.585 2024
[33]

Nishisako, T

S. Nishisako, T. Higashi, F. Wakao, Reducing hallucinations and trade- offs in responses in generative ai chatbots for cancer information: Development and evaluation study, JMIR cancer 11 (1) (2025) e70176

work page 2025
[34]

D. K. Pham, B. Q. V o, Towards reliable medical question answering: Techniques and challenges in mitigating hallucinations in language models, arXiv preprint arXiv:2408.13808 (2024)

work page arXiv 2024
[35]

Vladika, P

J. Vladika, P. Schneider, F. Matthes, Healthfc: Verifying health claims with evidence-based medical fact-checking, in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 8095– 8107

work page 2024
[36]

H. Liu, A. Soroush, J. G. Nestor, E. Park, B. Idnay, Y . Fang, J. Pan, S. Liao, M. Bernard, Y . Peng, et al., Retrieval augmented scientific claim verification, JAMIA open 7 (1) (2024) ooae021

work page 2024
[37]

Mohole, H

S. Mohole, H. Choi, S. Liu, C. Klymko, S. Kushwaha, D. Shi, W. Sakla, S. Galhotra, R. Glatt, Verirag: A post-retrieval auditing of scientific study IEEE ACCESS 17 summaries (2025).arXiv:2507.17948. URL https://arxiv.org/abs/2507.17948

work page arXiv 2025
[38]

S. Xu, Z. Yan, C. Dai, F. Wu, Mega-rag: a retrieval-augmented gen- eration framework with multi-evidence guided answer refinement for mitigating hallucinations of llms in public health, Frontiers in Public Health 13 (2025) 1635381

work page 2025
[39]

D. Baur, J. Ansorg, C.-E. Heyde, A. V oelker, Development and evalu- ation of a retrieval-augmented generation chatbot for orthopedic and trauma surgery patient education: Mixed-methods study, JMIR AI 4 (2025) e75262

work page 2025
[40]

National Center for Biotechnology Information, Pubmed, https:// pubmed.ncbi.nlm.nih.gov/, accessed: February 24, 2024

work page 2024
[41]

S. Kuzi, M. Zhang, C. Li, M. Bendersky, M. Najork, Leveraging semantic and lexical matching to improve the recall of document retrieval systems: A hybrid approach, arXiv preprint arXiv:2010.01195 (2020)

work page arXiv 2010
[42]

Robertson, H

S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (4) (2009) 333–389

work page 2009
[43]

Y . A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence 42 (4) (2018) 824–836

work page 2018
[44]

Indyk, R

P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in: Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, pp. 604–613

work page 1998
[45]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave

S. Hofst ¨atter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling, in: Proceedings of SIGIR 2021, 2021.doi:https:// doi.org/10.48550/arXiv.2104.06967

work page doi:10.48550/arxiv.2104.06967 2021
[46]

Krithara, A

A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, Bioasq-qa: A manually curated corpus for biomedical question answering, Scientific Data 10 (1) (2023) 170

work page 2023
[48]

Ba ˇsaragin, A

B. Ba ˇsaragin, A. Ljaji ´c, D. Medvecki, L. Cassano, M. Ko ˇsprdi´c, N. Milo ˇsevi´c, How do you know that? teaching generative language models to reference answers to biomedical questions, in: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, 2024, pp. 536–547

work page 2024
[49]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, X. Lu, Pubmedqa: A dataset for biomedical research question answering, arXiv preprint arXiv:1909.06146 (2019)

work page arXiv 1909
[50]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, arXiv preprint arXiv:2305.14314doi:https://doi.org/10.48550/arXiv. 2305.14314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[51]

J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, D. Tran, D. Peng, R. Liu, D. Huang, C. Du, Q. V . Le, Long-form factuality in large language models (2024).arXiv:2403.18802

work page arXiv 2024
[52]

Rodrigo-Gin ´es, J

F.-J. Rodrigo-Gin ´es, J. Carrillo-de Albornoz, L. Plaza, A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it, Expert Systems with Applications 237 (2024) 121641

work page 2024
[53]

Ko ˇsprdi´c, A

M. Ko ˇsprdi´c, A. Ljaji ´c, D. Medvecki, B. Ba ˇsaragin, N. Milo ˇsevi´c, Scientific claim verification with fine-tuned NLI models, in: Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS 2024), SCITEPRESS - Science and Technology Publications, 2024, pp. 15– 25.doi:10.5220/00...

work page doi:10.5220/0012900000003838 2024
[54]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V . Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[55]

Unsupervised Cross-lingual Representation Learning at Scale

A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsu- pervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)

work page internal anchor Pith review arXiv 1911
[56]

P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, arXiv preprint arXiv:2006.03654 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2006
[57]

N. Tan, T. Nguyen, J. Bensemann, A. Peng, Q. Bao, Y . Chen, M. Ga- hegan, M. J. Witbrock, Multi2claim: Generating scientific claims from multi-choice questions for scientific fact-checking, in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2652–2664

work page 2023
[58]

Sarrouti, A

M. Sarrouti, A. Ben Abacha, Y . Mrabet, D. Demner-Fushman, Evidence- based Fact-Checking of Health-related Claims, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Compu- tational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 3499– 3512.doi:10.18653/v...

work page doi:10.18653/v1/2021.findings-emnlp.297 2021
[59]

Zhang, V

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, Y . Artzi, Bertscore: Evaluating text generation with bert, in: International Conference on Learning Representations

work page
[60]

A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai, Information fusion 58 (2020) 82–115. Miloˇs Ko ˇsprdi´cwas born in Novi Sad, Serbia. He rece...

work page 2020