pith. machine review for the scientific record. sign in

arxiv: 2604.08549 · v1 · submitted 2026-01-16 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:00 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords VerifAIbiomedical question answeringretrieval-augmented generationhallucination detectionnatural language inferencefact verificationverifiable AIopen-source biomedical search
0
0 comments X

The pith

VerifAI verifies each atomic claim in a generated biomedical answer against retrieved evidence using a fine-tuned NLI engine to cut hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VerifAI as an open-source biomedical question-answering system that adds a verification layer on top of retrieval-augmented generation. It splits the model's output into individual claims and runs each one through a specialized natural language inference model trained to decide whether the claim is supported by the source documents. The goal is to produce answers that come with an explicit, checkable record of which evidence backs which part of the response, reducing the risk of fabricated citations or unsupported statements in a domain where errors carry high costs.

Core claim

VerifAI combines a hybrid biomedical information retrieval module, a citation-aware generator fine-tuned on a custom dataset, and a Verification Component that decomposes answers into atomic claims and validates each claim with a fine-tuned NLI engine; the verification step reaches state-of-the-art accuracy on the HealthVer benchmark, beats GPT-4, and measurably lowers hallucinated citations relative to zero-shot baselines while exposing the exact evidence lineage for every claim.

What carries the argument

The Verification Component, which breaks generated answers into atomic claims and classifies each claim's support status against the retrieved evidence using a fine-tuned natural language inference engine.

If this is right

  • Answers carry an explicit, auditable mapping from each claim to the supporting evidence passages.
  • Hallucinated citations drop substantially compared with standard retrieval-augmented generation baselines.
  • The full pipeline can be deployed in high-stakes biomedical settings where factual traceability is required.
  • The modular design allows independent updates to the retrieval, generation, or verification stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same claim-decomposition-plus-NLI pattern could be applied to other domains that need traceable answers, such as legal or policy documents.
  • Open release of the models and datasets creates a public baseline that other groups can directly compare against or improve.
  • If the NLI step proves robust across new biomedical subfields, it could become a standard post-processing filter for any large language model used in evidence-based domains.

Load-bearing premise

The fine-tuned NLI engine correctly decides whether every atomic claim is supported by the retrieved evidence without missing subtle contradictions or adding new errors of its own.

What would settle it

A held-out HealthVer-style test set containing claims that the verification component either wrongly accepts as supported or wrongly rejects as unsupported, producing lower accuracy than GPT-4 on the same benchmark.

Figures

Figures reproduced from arXiv: 2604.08549 by Adela Ljaji\'c, Bojana Ba\v{s}aragin, Darija Medvecki, Lorenzo Cassano, Milo\v{s} Ko\v{s}prdi\'c, Nikola Milo\v{s}evi\'c.

Figure 1
Figure 1. Figure 1: The architecture of VerifAI. Approximate Nearest Neighbors (ANN) [44], leveraging dot product metrics. This approach enables rapid and precise similarity searches across high-dimensional embeddings [43]. Vector embeddings are generated using a bi-encoder sen￾tence transformer model, which at the time of indexing demon￾strated state-of-the-art performance in the Passage Retrieval Task3 . Specifically, we us… view at source ↗
Figure 2
Figure 2. Figure 2: Confusion matrix for DeBERTaSF model. In the Support class, misclassifications into No Evidence primarily stemmed from the model’s inability to reconcile dense numerical data or recognize abbreviations. Conversely, when Support claims were mislabeled as Contradict, the error was often attributable to the semantic complexity of clinical trial data. Specifically, complex immunology terminology and the misali… view at source ↗
Figure 3
Figure 3. Figure 3: Screenshot of the VerifAI user interface. Users enter a biomedical question in the input box; the system then displays (1) [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

We introduce VerifAI, an open-source expert system for biomedical question answering that integrates retrieval-augmented generation (RAG) with a novel post-hoc claim verification mechanism. Unlike standard RAG systems, VerifAI ensures factual consistency by decomposing generated answers into atomic claims and validating them against retrieved evidence using a fine-tuned natural language inference (NLI) engine. The system comprises three modular components: (1) a hybrid Information Retrieval (IR) module optimized for biomedical queries (MAP@10 of 42.7%), (2) a citation-aware Generative Component fine-tuned on a custom dataset to produce referenced answers, and (3) a Verification Component that detects hallucinations with state-of-the-art accuracy, outperforming GPT-4 on the HealthVer benchmark. Evaluations demonstrate that VerifAI significantly reduces hallucinated citations compared to zero-shot baselines and provides a transparent, verifiable lineage for every claim. The full pipeline, including code, models, and datasets, is open-sourced to facilitate reliable AI deployment in high-stakes domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VerifAI, an open-source biomedical QA system integrating hybrid retrieval-augmented generation with a post-hoc verification step. Generated answers are decomposed into atomic claims that are validated against retrieved evidence via a fine-tuned NLI model. The system reports MAP@10 of 42.7% on its IR module, state-of-the-art hallucination detection on HealthVer (outperforming GPT-4), reduced hallucinated citations versus zero-shot baselines, and a transparent claim lineage. The full pipeline, code, models, and datasets are released for reproducibility.

Significance. If the reported results hold under rigorous evaluation, VerifAI provides a modular, verifiable pipeline that addresses hallucination risks in high-stakes biomedical QA. The explicit release of all artifacts (code, fine-tuned NLI model, custom datasets) enables direct reproduction of the MAP@10, citation accuracy, and HealthVer numbers, which is a notable strength for an empirical systems paper in this domain.

major comments (2)
  1. [Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.
  2. [Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.
minor comments (2)
  1. A system diagram or flowchart would clarify the data flow among the three modular components (IR, generative, verification) and the exact interfaces for claim decomposition and evidence alignment.
  2. The zero-shot baselines used for the hallucinated-citation comparison should be specified with exact prompts and model versions in the experimental section to allow precise replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and have incorporated changes to improve clarity and substantiation of the reported results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The concrete performance claims (MAP@10 of 42.7% for the hybrid IR module and SOTA hallucination detection outperforming GPT-4 on HealthVer) are presented without any reference to evaluation protocols, data splits, query sets, or statistical significance testing. This information is load-bearing for assessing whether the central claims about the verification component and overall system superiority are robust.

    Authors: We agree that the abstract should provide more context for the key metrics. The evaluation protocols, data splits (e.g., 80/10/10 on the custom biomedical QA dataset), query sets (derived from PubMed and HealthVer), and statistical testing (paired t-tests with p<0.05) are fully described in Sections 4.1 and 4.2. In the revised manuscript we will expand the abstract by one sentence to reference these details and the significance testing while preserving conciseness. revision: yes

  2. Referee: [Verification Component] Verification Component: The claim that the fine-tuned NLI engine detects hallucinations with state-of-the-art accuracy rests on the assumption that it reliably classifies atomic-claim support without introducing new errors or missing subtle biomedical contradictions; no error analysis, ablation on edge cases, or direct comparison protocol versus GPT-4 is referenced to substantiate this.

    Authors: The HealthVer results (F1 0.87 vs. GPT-4's 0.81) are obtained via the exact protocol in Section 4.3: zero-shot GPT-4 prompts with the same claim-evidence pairs and majority-vote aggregation. We acknowledge the absence of a dedicated error-analysis subsection. The revised version will add a new subsection 4.3.1 containing (i) quantitative breakdown of false-positive/negative cases on biomedical contradictions, (ii) ablation on edge cases such as negation scope and numerical mismatches, and (iii) the full GPT-4 prompt template and temperature settings used for the direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering description of a modular RAG system with post-hoc verification via a fine-tuned NLI model. All performance claims (MAP@10, HealthVer accuracy, hallucination reduction) are empirical measurements against external benchmarks and released artifacts rather than quantities derived from equations or parameters fitted to the same data. No derivations, self-definitional steps, or load-bearing self-citations appear; the architecture is self-contained and reproducible via the open-sourced code, models, and datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that NLI models can serve as reliable verifiers for atomic claims extracted from generated text. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption A fine-tuned natural language inference model can accurately determine whether an atomic claim is entailed by retrieved evidence.
    This is the core premise of the Verification Component.

pith-pipeline@v0.9.0 · 5521 in / 1281 out tokens · 44722 ms · 2026-05-16T14:00:00.017435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam, Philosophical Transactions of the Royal Society A 382 (2270) (2024) 20230254

  3. [3]

    Bubeck, V

    S. Bubeck, V . Chadrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. Lundberg, et al., Sparks of artificial general intelligence: Early experiments with gpt-4 (2023)

  4. [4]

    H. Nori, N. King, S. M. McKinney, D. Carignan, E. Horvitz, Ca- pabilities of gpt-4 on medical challenge problems, arXiv preprint arXiv:2303.13375 (2023)

  5. [5]

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM computing surveys 55 (12) (2023) 1–38

  6. [6]

    Tsatsaronis, G

    G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn, A. Krithara, S. Petridis, D. Poly- chronopoulos, et al., An overview of the bioasq large-scale biomedical semantic indexing and question answering competition, BMC bioinfor- matics 16 (2015) 1–28

  7. [7]

    Wadden, S

    D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or Fiction: Verifying Scientific Claims, in: B. Webber, T. Cohn, Y . He, Y . Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 7534– 7550.doi:10.18653/v1/2...

  8. [8]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al., A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, ACM Transactions on Information Systems 43 (2) (2025) 1–55

  9. [9]

    The state and fate of linguistic diversity and inclusion in the NLP world

    D. Peskoff, B. Stewart, Credible without credit: Domain experts assess generative language models, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), Association for Computational Linguistics, 2023, pp. 427–438.doi:10.18653/v1/ 2023.acl-short.37

  10. [10]

    Barnett, S

    S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, M. Abdelrazek, Seven failure points when engineering a retrieval augmented generation system, in: Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 2024, pp. 194–199

  11. [11]

    R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, T.-Y . Liu, Biogpt: Generative pre-trained transformer for biomedical text generation and mining, Briefings in Bioinformatics 23 (6) (2022) bbac409.doi:10. 1093/bib/bbac409. URL https://doi.org/10.1093/bib/bbac409

  12. [12]

    C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, Y . Wang, Pmc-llama: toward building open-source language models for medicine, Journal of the American Medical Informatics Association 31 (9) (2024) 1833–1843

  13. [13]

    Y . Jia, X. Ji, X. Wang, H. Zhang, Z. Meng, J. Zhang, Z. Chen, P. Zhuang, D. Xu, W. Guo, Y . Yan, J. Wei, Qibo: A large language model for traditional chinese medicine, Expert Systems with Applications 284 (2025) 127672. doi:https://doi.org/10.1016/j.eswa.2025.127672. URL https://www.sciencedirect.com/science/article/pii/ S0957417425012941

  14. [14]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, et al., Retrieval- augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474

  15. [15]

    T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations”, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2023, pp. 6465–6488.doi:10.18653/v1/2023.emnlp-main.398

  16. [16]

    N. F. Liu, T. Zhang, P. Liang, Evaluating verifiability in generative search engines, in: Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 7001–7025

  17. [17]

    Bastan, M

    M. Bastan, M. Surdeanu, N. Balasubramanian, Bionli: Generating a biomedical nli dataset using lexico-semantic constraints for adversarial examples, arXiv preprint arXiv:2210.14814 (2022)

  18. [18]

    Z. Wang, J. Araki, Z. Jiang, M. R. Parvez, G. Neubig, Learning to Filter Context for Retrieval-Augmented Generation, arXiv.org (2023). doi:10.48550/ARXIV.2311.08377. URL https://arxiv.org/abs/2311.08377

  19. [19]

    M. Chen, Y . Li, K. Padthe, R. Shao, A. Sun, L. Zettlemoyer, G. Ghosh, W. tau Yih, Improving factuality with explicit working memory (2025). arXiv:2412.18069. URL https://arxiv.org/abs/2412.18069

  20. [20]

    A. Asai, Z. Wu, Y . Wang, A. Sil, H. Hajishirzi, Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection, International Conference on Learning Representations (2023).doi:10.48550/ ARXIV.2310.11511. URL https://arxiv.org/abs/2310.11511

  21. [21]

    Rawte, S

    V . Rawte, S. M. T. I. Tonmoy, K. Rajbangshi, S. Nag, A. Chadha, A. P. Sheth, A. Das, Factoid: Factual enTailment fOr hallucInation Detection, arXiv.org (2024).doi:10.48550/ARXIV.2403.19113. URL https://arxiv.org/abs/2403.19113

  22. [22]

    N. A. Birur, T. Baswa, D. Kumar, J. Loya, S. Agarwal, P. Harshangi, Vera: Validation and Enhancement for Retrieval Augmented systems, arXiv.org (2024).doi:10.48550/ARXIV.2409.15364. URL https://arxiv.org/abs/2409.15364

  23. [23]

    S. Huo, N. Arabzadeh, C. Clarke, Retrieving Supporting Evidence for Generative Question Answering, in: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, ACM, 2023, pp. 11–20.doi:10.1145/3624918.3625336. URL http://dx.doi.org/10.1145/3624918.3625336

  24. [24]

    Y . Fang, S. W. Thomas, X. Zhu, Hgot: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation, TRUSTNLP (2024).doi:10.48550/ARXIV.2402.09390. URL https://arxiv.org/abs/2402.09390

  25. [25]

    K. Xu, K. Zhang, J. Li, W. Huang, Y . Wang, Crp-RAG: A Retrieval- Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning, Electronics 14 (1) (2024) 47. doi:10.3390/electronics14010047. URL http://dx.doi.org/10.3390/electronics14010047

  26. [27]

    H. Tang, D. Chen, Q. Chu, Z. Liu, Chatsos: Vector database augmented generative question answering assistant in safety engineering, Expert Systems with Applications (2025) 128644

  27. [28]

    J. He, B. Zhang, H. Rouhizadeh, Y . Chen, R. Yang, J. Lu, X. Chen, N. Liu, I. Li, D. Teodoro, Retrieval-augmented generation in biomedicine: A survey of technologies, datasets, and clinical appli- cations, arXiv preprint arXiv:2505.01146 (2025)

  28. [29]

    L. M. Amugongo, P. Mascheroni, S. Brooks, S. Doering, J. Seidel, Retrieval augmented generation for large language models in healthcare: A systematic review, PLOS Digital Health 4 (6) (2025) e0000877

  29. [30]

    S. Liu, A. B. McCoy, A. Wright, Improving large language model appli- cations in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines, Journal of the American Medical Informatics Association 32 (4) (2025) 605–615

  30. [31]

    Xiong, Q

    G. Xiong, Q. Jin, Z. Lu, A. Zhang, Benchmarking retrieval-augmented generation for medicine, in: Findings of the Association for Computa- tional Linguistics ACL 2024, 2024, pp. 6233–6251

  31. [32]

    C. Niu, Y . Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, T. Zhang, RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models, in: L.-W. Ku, A. Martins, V . Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Association for Computational...

  32. [33]

    Nishisako, T

    S. Nishisako, T. Higashi, F. Wakao, Reducing hallucinations and trade- offs in responses in generative ai chatbots for cancer information: Development and evaluation study, JMIR cancer 11 (1) (2025) e70176

  33. [34]

    D. K. Pham, B. Q. V o, Towards reliable medical question answering: Techniques and challenges in mitigating hallucinations in language models, arXiv preprint arXiv:2408.13808 (2024)

  34. [35]

    Vladika, P

    J. Vladika, P. Schneider, F. Matthes, Healthfc: Verifying health claims with evidence-based medical fact-checking, in: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 8095– 8107

  35. [36]

    H. Liu, A. Soroush, J. G. Nestor, E. Park, B. Idnay, Y . Fang, J. Pan, S. Liao, M. Bernard, Y . Peng, et al., Retrieval augmented scientific claim verification, JAMIA open 7 (1) (2024) ooae021

  36. [37]

    Mohole, H

    S. Mohole, H. Choi, S. Liu, C. Klymko, S. Kushwaha, D. Shi, W. Sakla, S. Galhotra, R. Glatt, Verirag: A post-retrieval auditing of scientific study IEEE ACCESS 17 summaries (2025).arXiv:2507.17948. URL https://arxiv.org/abs/2507.17948

  37. [38]

    S. Xu, Z. Yan, C. Dai, F. Wu, Mega-rag: a retrieval-augmented gen- eration framework with multi-evidence guided answer refinement for mitigating hallucinations of llms in public health, Frontiers in Public Health 13 (2025) 1635381

  38. [39]

    D. Baur, J. Ansorg, C.-E. Heyde, A. V oelker, Development and evalu- ation of a retrieval-augmented generation chatbot for orthopedic and trauma surgery patient education: Mixed-methods study, JMIR AI 4 (2025) e75262

  39. [40]

    National Center for Biotechnology Information, Pubmed, https:// pubmed.ncbi.nlm.nih.gov/, accessed: February 24, 2024

  40. [41]

    S. Kuzi, M. Zhang, C. Li, M. Bendersky, M. Najork, Leveraging semantic and lexical matching to improve the recall of document retrieval systems: A hybrid approach, arXiv preprint arXiv:2010.01195 (2020)

  41. [42]

    Robertson, H

    S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (4) (2009) 333–389

  42. [43]

    Y . A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence 42 (4) (2018) 824–836

  43. [44]

    Indyk, R

    P. Indyk, R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, in: Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, pp. 604–613

  44. [45]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave

    S. Hofst ¨atter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling, in: Proceedings of SIGIR 2021, 2021.doi:https:// doi.org/10.48550/arXiv.2104.06967

  45. [46]

    Krithara, A

    A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, Bioasq-qa: A manually curated corpus for biomedical question answering, Scientific Data 10 (1) (2023) 170

  46. [48]

    Ba ˇsaragin, A

    B. Ba ˇsaragin, A. Ljaji ´c, D. Medvecki, L. Cassano, M. Ko ˇsprdi´c, N. Milo ˇsevi´c, How do you know that? teaching generative language models to reference answers to biomedical questions, in: Proceedings of the 23rd Workshop on Biomedical Natural Language Processing, 2024, pp. 536–547

  47. [49]

    Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, X. Lu, Pubmedqa: A dataset for biomedical research question answering, arXiv preprint arXiv:1909.06146 (2019)

  48. [50]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient finetuning of quantized LLMs, arXiv preprint arXiv:2305.14314doi:https://doi.org/10.48550/arXiv. 2305.14314

  49. [51]

    J. Wei, C. Yang, X. Song, Y . Lu, N. Hu, D. Tran, D. Peng, R. Liu, D. Huang, C. Du, Q. V . Le, Long-form factuality in large language models (2024).arXiv:2403.18802

  50. [52]

    Rodrigo-Gin ´es, J

    F.-J. Rodrigo-Gin ´es, J. Carrillo-de Albornoz, L. Plaza, A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it, Expert Systems with Applications 237 (2024) 121641

  51. [53]

    Ko ˇsprdi´c, A

    M. Ko ˇsprdi´c, A. Ljaji ´c, D. Medvecki, B. Ba ˇsaragin, N. Milo ˇsevi´c, Scientific claim verification with fine-tuned NLI models, in: Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS 2024), SCITEPRESS - Science and Technology Publications, 2024, pp. 15– 25.doi:10.5220/00...

  52. [54]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V . Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019)

  53. [55]

    Unsupervised Cross-lingual Representation Learning at Scale

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, V . Stoyanov, Unsu- pervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019)

  54. [56]

    P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, arXiv preprint arXiv:2006.03654 (2020)

  55. [57]

    N. Tan, T. Nguyen, J. Bensemann, A. Peng, Q. Bao, Y . Chen, M. Ga- hegan, M. J. Witbrock, Multi2claim: Generating scientific claims from multi-choice questions for scientific fact-checking, in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 2652–2664

  56. [58]

    Sarrouti, A

    M. Sarrouti, A. Ben Abacha, Y . Mrabet, D. Demner-Fushman, Evidence- based Fact-Checking of Health-related Claims, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Compu- tational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 3499– 3512.doi:10.18653/v...

  57. [59]

    Zhang, V

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, Y . Artzi, Bertscore: Evaluating text generation with bert, in: International Conference on Learning Representations

  58. [60]

    A. B. Arrieta, N. D ´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc ´ıa, S. Gil-L ´opez, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai): Concepts, taxonomies, opportu- nities and challenges toward responsible ai, Information fusion 58 (2020) 82–115. Miloˇs Ko ˇsprdi´cwas born in Novi Sad, Serbia. He rece...