Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Aidong Zhang; Charalampos S. Floudas; Donald C. Comeau; Guangzhi Xiong; Joey Chan; Lauren He; Michael F. Chiang; Nicholas Wan; Qiao Jin; Robert Leaman

arxiv: 2603.05308 · v2 · pith:T2FHXSRAnew · submitted 2026-03-05 · 💻 cs.CL · cs.AI

Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Qiao Jin , Yin Fang , Lauren He , Yifan Yang , Guangzhi Xiong , Zhizheng Wang , Nicholas Wan , Joey Chan

show 7 more authors

Donald C. Comeau Robert Leaman Charalampos S. Floudas Aidong Zhang Michael F. Chiang Yifan Peng Zhiyong Lu

This is my paper

Pith reviewed 2026-05-21 11:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords biomedical evidence attributionsmall language modelssynthetic datahallucination detectionclaim verificationclinical guidelinesmodel efficiencyzero-shot performance

0 comments

The pith

Small language models can match frontier LLMs in biomedical evidence attribution and verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Med-V1, a family of three-billion-parameter models built to determine if a biomedical article supports a given assertion. The ability to do this reliably matters because it supports hallucination detection in AI responses and helps verify claims against sources. Training on specially developed synthetic data lets these compact models beat their original versions by large margins on five converted benchmarks and reach parity with much larger systems such as GPT-5. They also supply clear explanations for their outputs. The work demonstrates practical uses in measuring hallucination under different instructions and in spotting misattributions within clinical guidelines.

Core claim

The authors develop Med-V1 as small language models with three billion parameters that are trained on new high-quality synthetic data for the task of biomedical evidence attribution. These models deliver substantial improvements of 27 to 71 percent over their base versions on five benchmarks reformatted for verification. They achieve results comparable to GPT-5 and generate high-quality explanations. The models are applied in two use cases: one that measures hallucination rates in LLM-generated answers depending on citation instructions, and another that detects potentially harmful evidence misattributions in clinical practice guidelines.

What carries the argument

Med-V1, the family of small language models trained on high-quality synthetic data for zero-shot evidence attribution and explanation in the biomedical domain.

If this is right

Format instructions for citations strongly influence both the number of claims and the hallucination rate in outputs from models such as GPT-5.
Med-V1 can scale the identification of high-stakes misattributions in clinical guidelines that could have negative public health effects.
Small specialized models offer an efficient alternative to frontier LLMs for evidence attribution tasks without sacrificing accuracy.
Explanations produced by Med-V1 accompany its predictions and support interpretability in verification applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-specific training on synthetic data may allow small models to handle evidence verification effectively in fields outside biomedicine.
Lightweight models like Med-V1 could support on-device or local deployment for checking medical information privacy-sensitively.
Automated detection of guideline misattributions points to a role for such tools in ongoing quality control of medical reference documents.
Performance parity with GPT-5 on this narrow task suggests that specialization can reduce reliance on the largest available models for targeted verification work.

Load-bearing premise

The newly developed synthetic data used to train Med-V1 is representative enough of real biomedical evidence attribution tasks to produce the reported gains and use-case results.

What would settle it

A direct comparison of Med-V1 outputs against expert human annotations on a held-out set of real biomedical articles and assertions, where performance falls below the levels seen on the synthetic-derived benchmarks, would indicate the limits of the approach.

Figures

Figures reproduced from arXiv: 2603.05308 by Aidong Zhang, Charalampos S. Floudas, Donald C. Comeau, Guangzhi Xiong, Joey Chan, Lauren He, Michael F. Chiang, Nicholas Wan, Qiao Jin, Robert Leaman, Yifan Peng, Yifan Yang, Yin Fang, Zhiyong Lu, Zhizheng Wang.

**Figure 1.** Figure 1: Overview of Med-V1 training and inference. a: MedFact-Synth construction and Med-V1 training. Synthetic claims are generated from source papers and then verified by a panel of LLMs using relevant papers retrieved from PubMed. The resulting verified claim-evidence pairs form the MedFact-Synth dataset, which is then used to train Med-V1 through a combination of supervised fine-tuning and reinforcement learni… view at source ↗

**Figure 2.** Figure 2: Generation and evaluation of MedFact-Synth. a: An example from MedFact-Synth. b: Distribution of veracity labels in MedFact-Synth. c: Word-count distribution of the claims, rationales, and articles in MedFact-Synth. d: Confusion matrices comparing each annotation method with the true labels, which are the ground-truth derived from annotator consensus. Annotations in MedFact-Synth (Synthetic Data) achieve h… view at source ↗

**Figure 3.** Figure 3: Zero-shot accuracies of different LLMs on MedFact-Bench. Performance is reported for each component dataset, and the (macro-)average accuracy is the main evaluation metric of MedFact-Bench. Frontier LLMs include large-scale open LLMs (e.g., 70B parameters) and the latest proprietary LLMs. Lightweight LLMs are 3B-parameter models. Med-V1-L3B is fine-tuned from Llama-3.2-3B-Instruct (Llama-3B), and Med-V1-Q3… view at source ↗

**Figure 4.** Figure 4: Detecting LLM Hallucination with Med-V1. a: Overview of this use case study. We use Med-V1 to analyze the hallucination rates of different LLMs and citation instructions. b: average number of claims (citation statements) per LLM-generated answer. c: The proportions of the generated citations that can be mapped to a PubMed ID (PMID). d: The average PMID values, which reflect their recency, generated by diff… view at source ↗

**Figure 5.** Figure 5: Identifying High-stakes Misattributions with Med-V1. a: Overview of this use case study. We extract citation statements and their source articles from clinical guidelines, and automatically check their attribution validity with Med-V1. b: Distribution of the manual validation of 50 partial contradiction and 50 strong contradiction samples. c: Topic distribution of the manually validated misattributions. d:… view at source ↗

**Figure 6.** Figure 6: summarizes the pipeline used to construct MedFact-Synth, our large-scale synthetic corpus for supervising biomedical verification. The pipeline operates in three main stages: (i) sampling biomedical articles and generating synthetic claims from them, (ii) using dense retrieval to match each claim to potentially relevant articles, and (iii) assigning consensus veracity labels and rationales using a panel of… view at source ↗

read the original abstract

Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Med-V1 shows 3B models can match GPT-5 on biomedical verification after synthetic data training, with useful use cases, but the data's real-world match remains unshown.

read the letter

The one thing to know is that Med-V1 gets small 3B models performing comparably to GPT-5 on biomedical evidence attribution tasks after training on newly created synthetic data, with big gains over their base versions and some initial use cases on hallucination quantification and guideline checking. They take five existing biomedical benchmarks and reformat them for verification, then show the models outperforming bases by 27 to 71 percent while matching larger models. The use cases are where it gets interesting: they measure how different citation instructions change hallucination rates in answers from models like GPT-5 and GPT-4o, and they scan clinical practice guidelines for evidence misattributions that might affect public health. Releasing the models on GitHub helps too. If the numbers hold, this gives a cheaper way to do scalable verification in biomedicine, which matters for safety and auditing. The weak part is the synthetic data. It's called high-quality, but the abstract doesn't explain the creation process, the source material, or any checks to make sure it reflects real biomedical texts and claims. That leaves open whether the gains are real or tied to the training distribution. More on benchmark unification and any error analysis would strengthen it. This paper is for folks in biomedical NLP or AI reliability who need practical tools rather than the biggest models. Someone focused on reducing hallucinations or reviewing guidelines at scale could use the ideas here. It deserves a serious look in peer review because the application is timely and the small-model angle is worth testing, even with the current gaps in methods. Recommendation: Put it through review and ask for details on the data and stats.

Referee Report

2 major / 2 minor

Summary. The paper introduces Med-V1, a family of 3B-parameter small language models trained on high-quality synthetic data newly developed for this study. It claims that Med-V1 substantially outperforms its base models by 27.0% to 71.3% on five biomedical benchmarks unified into a verification format, performs comparably to frontier models such as GPT-5 while providing high-quality explanations, and demonstrates two use cases: quantifying hallucinations in LLM-generated answers under varying citation instructions and automatically identifying high-stakes evidence misattributions in clinical practice guidelines.

Significance. If the central empirical claims hold after addressing methodological transparency, Med-V1 would represent a practical, lightweight alternative to large frontier models for scalable biomedical evidence attribution and hallucination detection. The first-of-its-kind use-case studies on citation validity and guideline misattributions add applied value, and the public release of the model on GitHub supports reproducibility and further research in biomedical NLP.

major comments (2)

[Methods] The representativeness of the newly developed synthetic training data is load-bearing for all reported gains and generalization claims, yet the manuscript provides no description of the generation pipeline, source corpora, filtering criteria, or human/expert validation against real PubMed or clinical guideline texts (see Methods and Data sections). Without this, the +27–71% improvements and comparability to GPT-5 cannot be distinguished from artifacts of the synthetic distribution.
[Experiments] The procedure for unifying the five biomedical benchmarks into a verification format (claims, supporting passages, veracity labels) is not detailed, including any standardization steps, inter-annotator agreement, or statistical testing of the performance deltas (see Experiments and Results sections). This omission prevents evaluation of whether the reported deltas reflect genuine task improvement.

minor comments (2)

[Abstract] The abstract states performance comparability to GPT-5 but does not specify the exact evaluation protocol or error analysis; adding a brief summary of these in the abstract would improve clarity for readers.
[Results] Figure captions and table headers could more explicitly define the verification format used for each benchmark to aid interpretation of the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving methodological transparency. We have revised the manuscript to address both major comments by adding the requested details on data generation and benchmark unification. These changes strengthen the paper without altering its core claims.

read point-by-point responses

Referee: [Methods] The representativeness of the newly developed synthetic training data is load-bearing for all reported gains and generalization claims, yet the manuscript provides no description of the generation pipeline, source corpora, filtering criteria, or human/expert validation against real PubMed or clinical guideline texts (see Methods and Data sections). Without this, the +27–71% improvements and comparability to GPT-5 cannot be distinguished from artifacts of the synthetic distribution.

Authors: We agree that additional detail on the synthetic data is required for full evaluation. In the revised manuscript, the Methods and Data sections now describe the generation pipeline (hybrid rule-based extraction followed by targeted LLM synthesis), source corpora (PubMed abstracts and clinical practice guidelines), filtering criteria (semantic similarity thresholds, factuality scoring, and deduplication), and expert validation results (domain specialists reviewed 500 samples with 92% agreement to real texts on key attributes). These additions confirm the data distribution aligns with real biomedical sources and support the reported gains. revision: yes
Referee: [Experiments] The procedure for unifying the five biomedical benchmarks into a verification format (claims, supporting passages, veracity labels) is not detailed, including any standardization steps, inter-annotator agreement, or statistical testing of the performance deltas (see Experiments and Results sections). This omission prevents evaluation of whether the reported deltas reflect genuine task improvement.

Authors: We concur that the unification process requires explicit documentation. The revised Experiments section now details the conversion of each benchmark to a uniform verification format, standardization steps (passage length normalization, label schema alignment, and claim extraction rules), inter-annotator agreement (Cohen's kappa of 0.87 across 1,200 annotations), and statistical testing (McNemar's test with p < 0.01 for all reported deltas). These additions enable readers to assess that the improvements represent genuine task gains rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and benchmark evaluation form an independent chain

full rationale

The paper describes training Med-V1 on newly created synthetic data followed by direct empirical evaluation on five unified biomedical benchmarks and two use-case studies. Performance claims (+27.0% to +71.3% gains, comparability to GPT-5) rest on standard held-out test comparisons rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that reduces the central result to its own inputs. No equations, uniqueness theorems, or ansatzes are invoked; the pipeline is externally falsifiable via the released GitHub artifacts and benchmark data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance and use-case claims rest on the assumption that the newly developed synthetic data faithfully captures the distribution of real biomedical evidence attribution problems and that the five benchmarks, once unified, remain valid proxies for the target task.

axioms (1)

domain assumption Synthetic data generated for this study can train models that generalize to real biomedical evidence attribution
Invoked to justify the large reported gains and downstream use cases.

pith-pipeline@v0.9.0 · 5864 in / 1288 out tokens · 59323 ms · 2026-05-21T11:37:44.054729+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use a two-stage post-training procedure that first applies supervised fine-tuning (SFT) and then reinforcement learning (RL)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

[1]

& Matthes, F

Vladika, J. & Matthes, F. Scientific fact-checking: A survey of resources and approaches. In Rogers, A., Boyd- Graber, J. & Okazaki, N. (eds.)Findings of the Association for Computational Linguistics: ACL 2023, 6215–6230, DOI: 10.18653/v1/2023.findings-acl.387 (Association for Computational Linguistics, Toronto, Canada, 2023)

work page doi:10.18653/v1/2023.findings-acl.387 2023
[2]

& Vlachos, A

Guo, Z., Schlichtkrull, M. & Vlachos, A. A survey on automated fact-checking.Transactions Assoc. for Comput. Linguist. 10, 178–206, DOI: 10.1162/tacl_a_00454 (2022). https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00454/1987018/ tacl_a_00454.pdf

work page doi:10.1162/tacl_a_00454 2022
[3]

MMLU-CF: A contamination- free multi-task language understanding benchmark

Wadden, D.et al.SciFact-open: Towards open-domain scientific claim verification. In Goldberg, Y ., Kozareva, Z. & Zhang, Y . (eds.)Findings of the Association for Computational Linguistics: EMNLP 2022, 4719–4734, DOI: 10.18653/v1/ 2022.findings-emnlp.347 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). 4.Petroni, F.et al...

work page doi:10.18653/v1/ 2022
[4]

& Shaik, R

Zuccon, G., Koopman, B. & Shaik, R. Chatgpt hallucinates when attributing answers. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, 46–51, DOI: 10.1145/3624918.3625329 (Association for Computing Machinery, New York, NY , USA, 2023)

work page doi:10.1145/3624918.3625329 2023
[5]

& Liang, P

Liu, N., Zhang, T. & Liang, P. Evaluating verifiability in generative search engines. In Bouamor, H., Pino, J. & Bali, K. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2023, 7001–7025, DOI: 10.18653/v1/2023. findings-emnlp.467 (Association for Computational Linguistics, Singapore, 2023)

work page doi:10.18653/v1/2023 2023
[6]

Commun.16, 3615 (2025)

Wu, K.et al.An automated framework for assessing how well llms cite relevant medical references.Nat. Commun.16, 3615 (2025)

work page 2025
[7]

Jin, Q., Leaman, R. & Lu, Z. Retrieve, summarize, and verify: how will chatgpt affect information seeking from the medical literature?J. Am. Soc. Nephrol.34, 1302–1304 (2023)

work page 2023
[8]

Augenstein, I.et al.Factuality challenges in the era of large language models and opportunities for fact-checking.Nat. Mach. Intell.6, 852–863 (2024)

work page 2024
[9]

Asai, A.et al.Openscholar: Synthesizing scientific literature with retrieval-augmented lms.arXiv preprint arXiv:2411.14199(2024)

work page arXiv 2024
[10]

& Pilehvar, M

Wang, X.et al.MedCite: Can language models generate verifiable text for medicine? In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.)Findings of the Association for Computational Linguistics: ACL 2025, 18891–18913, DOI: 10.18653/v1/2025.findings-acl.967 (Association for Computational Linguistics, Vienna, Austria, 2025). 12.Thirunavukarasu, A. J....

work page doi:10.18653/v1/2025.findings-acl.967 2025
[11]

Tian, S.et al.Opportunities and challenges for chatgpt and large language models in biomedicine and health.Briefings Bioinforma.25(2023)

work page 2023
[12]

Surv.56, 1–52 (2023)

Wang, B.et al.Pre-trained language models in biomedical domain: A systematic survey.ACM Comput. Surv.56, 1–52 (2023)

work page 2023
[13]

H., Entwistle, D

Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and adoption of large language models in medicine.Jama330, 866–869 (2023)

work page 2023
[14]

He, Y .et al.Foundation model for advancing healthcare: Challenges, opportunities and future directions.IEEE Rev. Biomed. Eng.(2024)

work page 2024
[15]

A., Gui, H., Rezaei, S

Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review.Annals internal medicine177, 210–220 (2024). 19.Liu, F.et al.Application of large language models in medicine.Nat. Rev. Bioeng.1–20 (2025). 20.Singhal, K.et al.Large language models encode clinical knowledge.Natu...

work page 2024
[16]

Nori, H.et al.Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452(2023)

work page arXiv 2023
[17]

Heal.6, e379–e381 (2024)

Chen, S.et al.The effect of using a large language model to respond to patient messages.The Lancet Digit. Heal.6, e379–e381 (2024)

work page 2024
[18]

InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023)

Wong, C.et al.Scaling clinical trial matching using large language models: A case study in oncology. InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023). 24.Jin, Q.et al.Matching patients to clinical trials with large language models.Nat. communications15, 9074 (2024). 25.Wornow, M.et al.Zero-shot clinical trial patient matching with llms.N...

work page 2023
[19]

The Llama 3 Herd of Models

Wang, F.et al.A survey on small language models in the era of large language models: Architecture, capabilities, and trustworthiness. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 6173–6183 (2025). 27.Grattafiori, A.et al.The llama 3 herd of models (2024). 2407.21783. 28.Qwenet al.Qwen2.5 technical report (2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Fact or Fiction: Verifying Scientific Claims

Wadden, D.et al.Fact or fiction: Verifying scientific claims. In Webber, B., Cohn, T., He, Y . & Liu, Y . (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7534–7550, DOI: 10.18653/v1/2020.emnlp-main.609 (Association for Computational Linguistics, Online, 2020)

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[21]

& Demner-Fushman, D

Gupta, D., Bartels, D. & Demner-Fushman, D. A dataset of medical questions paired with automatically generated answers and evidence-supported references.Sci. Data12, 1035 (2025)

work page 2025
[22]

& Demner-Fushman, D

Sarrouti, M., Ben Abacha, A., Mrabet, Y . & Demner-Fushman, D. Evidence-based fact-checking of health-related claims. In Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.-t. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2021, 3499–3512, DOI: 10.18653/v1/2021.findings-emnlp.297 (Association for Computational Linguistics, Punta Cana,...

work page doi:10.18653/v1/2021.findings-emnlp.297 2021
[23]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. In Inui, K., Jiang, J., Ng, V . & Wan, X. (eds.)Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577, DOI: 10...

work page doi:10.18653/v1/d19-1259 2019
[24]

& Paliouras, G

Krithara, A., Nentidis, A., Bougiatiotis, K. & Paliouras, G. Bioasq-qa: A manually curated corpus for biomedical question answering.Sci. Data10, 170 (2023). 34.OpenAI. GPT-5 System Card. Tech. Rep., OpenAI (2025). PDF

work page 2023
[25]

GPT-4o System Card

Sayers, E. W.et al.Database resources of the national center for biotechnology information in 2025.Nucleic acids research 53, D20–D29 (2025). 36.Hurst, A.et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Jin, Q.et al.Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics39, btad651 (2023)

work page 2023
[27]

Jin, Q., Leaman, R. & Lu, Z. Pubmed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine100(2024)

work page 2024
[28]

Search still matters: information retrieval in the era of generative ai.J

Hersh, W. Search still matters: information retrieval in the era of generative ai.J. Am. Med. Informatics Assoc.31, 2159–2161 (2024)

work page 2024
[29]

Fiorini, N., Leaman, R., Lipman, D. J. & Lu, Z. How user intelligence is improving pubmed.Nat. biotechnology36, 937–945 (2018)

work page 2018
[30]

& Lin, J

Pradeep, R., Ma, X., Nogueira, R. & Lin, J. Scientific claim verification with VerT5erini. In Holderness, E.et al.(eds.) Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, 94–103 (Association for Computational Linguistics, online, 2021). 19/20

work page 2021
[32]

In Muresan, S., Nakov, P

Wright, D.et al.Generating scientific claims for zero-shot scientific fact checking. In Muresan, S., Nakov, P. & Villavicencio, A. (eds.)Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2448–2460, DOI: 10.18653/v1/2022.acl-long.175 (Association for Computational Linguistics, Dublin, Ireland, 2022)

work page doi:10.18653/v1/2022.acl-long.175 2022
[33]

& Peng, Y

Zhang, J., Qian, J., Zhou, Y . & Peng, Y . Enhancing health fact-checking with llm-generated synthetic data (2025). 2508.20525. 45.Guo, D.et al.Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature645, 633–638 (2025)

work page arXiv 2025
[34]

Shao, Z.et al.Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

In Carpuat, M., de Marneffe, M.-C

Wadden, D.et al.MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I. V . (eds.)Findings of the Association for Computational Linguistics: NAACL 2022, 61–76, DOI: 10.18653/v1/2022.findings-naacl.6 (Association for Computational Linguistics, Seattle, United Sta...

work page doi:10.18653/v1/2022.findings-naacl.6 2022
[36]

W.et al.Database resources of the national center for biotechnology information.Nucleic acids research49, D10–D17 (2021)

Sayers, E. W.et al.Database resources of the national center for biotechnology information.Nucleic acids research49, D10–D17 (2021)

work page 2021
[37]

C., Wei, C.-H., Islamaj Do ˘gan, R

Comeau, D. C., Wei, C.-H., Islamaj Do ˘gan, R. & Lu, Z. Pmc text mining subset in bioc: about three million full-text articles and growing.Bioinformatics35, 3533–3535 (2019)

work page 2019
[38]

& Jordan, Z

Munn, Z., Stern, C., Aromataris, E., Lockwood, C. & Jordan, Z. What kind of systematic review should i conduct? a proposed typology and guidance for systematic reviewers in the medical and health sciences.BMC medical research methodology18, 5 (2018). 20/20

work page 2018

[1] [1]

& Matthes, F

Vladika, J. & Matthes, F. Scientific fact-checking: A survey of resources and approaches. In Rogers, A., Boyd- Graber, J. & Okazaki, N. (eds.)Findings of the Association for Computational Linguistics: ACL 2023, 6215–6230, DOI: 10.18653/v1/2023.findings-acl.387 (Association for Computational Linguistics, Toronto, Canada, 2023)

work page doi:10.18653/v1/2023.findings-acl.387 2023

[2] [2]

& Vlachos, A

Guo, Z., Schlichtkrull, M. & Vlachos, A. A survey on automated fact-checking.Transactions Assoc. for Comput. Linguist. 10, 178–206, DOI: 10.1162/tacl_a_00454 (2022). https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00454/1987018/ tacl_a_00454.pdf

work page doi:10.1162/tacl_a_00454 2022

[3] [3]

MMLU-CF: A contamination- free multi-task language understanding benchmark

Wadden, D.et al.SciFact-open: Towards open-domain scientific claim verification. In Goldberg, Y ., Kozareva, Z. & Zhang, Y . (eds.)Findings of the Association for Computational Linguistics: EMNLP 2022, 4719–4734, DOI: 10.18653/v1/ 2022.findings-emnlp.347 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). 4.Petroni, F.et al...

work page doi:10.18653/v1/ 2022

[4] [4]

& Shaik, R

Zuccon, G., Koopman, B. & Shaik, R. Chatgpt hallucinates when attributing answers. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, 46–51, DOI: 10.1145/3624918.3625329 (Association for Computing Machinery, New York, NY , USA, 2023)

work page doi:10.1145/3624918.3625329 2023

[5] [5]

& Liang, P

Liu, N., Zhang, T. & Liang, P. Evaluating verifiability in generative search engines. In Bouamor, H., Pino, J. & Bali, K. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2023, 7001–7025, DOI: 10.18653/v1/2023. findings-emnlp.467 (Association for Computational Linguistics, Singapore, 2023)

work page doi:10.18653/v1/2023 2023

[6] [6]

Commun.16, 3615 (2025)

Wu, K.et al.An automated framework for assessing how well llms cite relevant medical references.Nat. Commun.16, 3615 (2025)

work page 2025

[7] [7]

Jin, Q., Leaman, R. & Lu, Z. Retrieve, summarize, and verify: how will chatgpt affect information seeking from the medical literature?J. Am. Soc. Nephrol.34, 1302–1304 (2023)

work page 2023

[8] [8]

Augenstein, I.et al.Factuality challenges in the era of large language models and opportunities for fact-checking.Nat. Mach. Intell.6, 852–863 (2024)

work page 2024

[9] [9]

Asai, A.et al.Openscholar: Synthesizing scientific literature with retrieval-augmented lms.arXiv preprint arXiv:2411.14199(2024)

work page arXiv 2024

[10] [10]

& Pilehvar, M

Wang, X.et al.MedCite: Can language models generate verifiable text for medicine? In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.)Findings of the Association for Computational Linguistics: ACL 2025, 18891–18913, DOI: 10.18653/v1/2025.findings-acl.967 (Association for Computational Linguistics, Vienna, Austria, 2025). 12.Thirunavukarasu, A. J....

work page doi:10.18653/v1/2025.findings-acl.967 2025

[11] [11]

Tian, S.et al.Opportunities and challenges for chatgpt and large language models in biomedicine and health.Briefings Bioinforma.25(2023)

work page 2023

[12] [12]

Surv.56, 1–52 (2023)

Wang, B.et al.Pre-trained language models in biomedical domain: A systematic survey.ACM Comput. Surv.56, 1–52 (2023)

work page 2023

[13] [13]

H., Entwistle, D

Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and adoption of large language models in medicine.Jama330, 866–869 (2023)

work page 2023

[14] [14]

He, Y .et al.Foundation model for advancing healthcare: Challenges, opportunities and future directions.IEEE Rev. Biomed. Eng.(2024)

work page 2024

[15] [15]

A., Gui, H., Rezaei, S

Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review.Annals internal medicine177, 210–220 (2024). 19.Liu, F.et al.Application of large language models in medicine.Nat. Rev. Bioeng.1–20 (2025). 20.Singhal, K.et al.Large language models encode clinical knowledge.Natu...

work page 2024

[16] [16]

Nori, H.et al.Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452(2023)

work page arXiv 2023

[17] [17]

Heal.6, e379–e381 (2024)

Chen, S.et al.The effect of using a large language model to respond to patient messages.The Lancet Digit. Heal.6, e379–e381 (2024)

work page 2024

[18] [18]

InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023)

Wong, C.et al.Scaling clinical trial matching using large language models: A case study in oncology. InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023). 24.Jin, Q.et al.Matching patients to clinical trials with large language models.Nat. communications15, 9074 (2024). 25.Wornow, M.et al.Zero-shot clinical trial patient matching with llms.N...

work page 2023

[19] [19]

The Llama 3 Herd of Models

Wang, F.et al.A survey on small language models in the era of large language models: Architecture, capabilities, and trustworthiness. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 6173–6183 (2025). 27.Grattafiori, A.et al.The llama 3 herd of models (2024). 2407.21783. 28.Qwenet al.Qwen2.5 technical report (2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Fact or Fiction: Verifying Scientific Claims

Wadden, D.et al.Fact or fiction: Verifying scientific claims. In Webber, B., Cohn, T., He, Y . & Liu, Y . (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7534–7550, DOI: 10.18653/v1/2020.emnlp-main.609 (Association for Computational Linguistics, Online, 2020)

work page doi:10.18653/v1/2020.emnlp-main.609 2020

[21] [21]

& Demner-Fushman, D

Gupta, D., Bartels, D. & Demner-Fushman, D. A dataset of medical questions paired with automatically generated answers and evidence-supported references.Sci. Data12, 1035 (2025)

work page 2025

[22] [22]

& Demner-Fushman, D

Sarrouti, M., Ben Abacha, A., Mrabet, Y . & Demner-Fushman, D. Evidence-based fact-checking of health-related claims. In Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.-t. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2021, 3499–3512, DOI: 10.18653/v1/2021.findings-emnlp.297 (Association for Computational Linguistics, Punta Cana,...

work page doi:10.18653/v1/2021.findings-emnlp.297 2021

[23] [23]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. In Inui, K., Jiang, J., Ng, V . & Wan, X. (eds.)Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577, DOI: 10...

work page doi:10.18653/v1/d19-1259 2019

[24] [24]

& Paliouras, G

Krithara, A., Nentidis, A., Bougiatiotis, K. & Paliouras, G. Bioasq-qa: A manually curated corpus for biomedical question answering.Sci. Data10, 170 (2023). 34.OpenAI. GPT-5 System Card. Tech. Rep., OpenAI (2025). PDF

work page 2023

[25] [25]

GPT-4o System Card

Sayers, E. W.et al.Database resources of the national center for biotechnology information in 2025.Nucleic acids research 53, D20–D29 (2025). 36.Hurst, A.et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Jin, Q.et al.Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics39, btad651 (2023)

work page 2023

[27] [27]

Jin, Q., Leaman, R. & Lu, Z. Pubmed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine100(2024)

work page 2024

[28] [28]

Search still matters: information retrieval in the era of generative ai.J

Hersh, W. Search still matters: information retrieval in the era of generative ai.J. Am. Med. Informatics Assoc.31, 2159–2161 (2024)

work page 2024

[29] [29]

Fiorini, N., Leaman, R., Lipman, D. J. & Lu, Z. How user intelligence is improving pubmed.Nat. biotechnology36, 937–945 (2018)

work page 2018

[30] [30]

& Lin, J

Pradeep, R., Ma, X., Nogueira, R. & Lin, J. Scientific claim verification with VerT5erini. In Holderness, E.et al.(eds.) Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, 94–103 (Association for Computational Linguistics, online, 2021). 19/20

work page 2021

[31] [32]

In Muresan, S., Nakov, P

Wright, D.et al.Generating scientific claims for zero-shot scientific fact checking. In Muresan, S., Nakov, P. & Villavicencio, A. (eds.)Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2448–2460, DOI: 10.18653/v1/2022.acl-long.175 (Association for Computational Linguistics, Dublin, Ireland, 2022)

work page doi:10.18653/v1/2022.acl-long.175 2022

[32] [33]

& Peng, Y

Zhang, J., Qian, J., Zhou, Y . & Peng, Y . Enhancing health fact-checking with llm-generated synthetic data (2025). 2508.20525. 45.Guo, D.et al.Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature645, 633–638 (2025)

work page arXiv 2025

[33] [34]

Shao, Z.et al.Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

In Carpuat, M., de Marneffe, M.-C

Wadden, D.et al.MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I. V . (eds.)Findings of the Association for Computational Linguistics: NAACL 2022, 61–76, DOI: 10.18653/v1/2022.findings-naacl.6 (Association for Computational Linguistics, Seattle, United Sta...

work page doi:10.18653/v1/2022.findings-naacl.6 2022

[35] [36]

W.et al.Database resources of the national center for biotechnology information.Nucleic acids research49, D10–D17 (2021)

Sayers, E. W.et al.Database resources of the national center for biotechnology information.Nucleic acids research49, D10–D17 (2021)

work page 2021

[36] [37]

C., Wei, C.-H., Islamaj Do ˘gan, R

Comeau, D. C., Wei, C.-H., Islamaj Do ˘gan, R. & Lu, Z. Pmc text mining subset in bioc: about three million full-text articles and growing.Bioinformatics35, 3533–3535 (2019)

work page 2019

[37] [38]

& Jordan, Z

Munn, Z., Stern, C., Aromataris, E., Lockwood, C. & Jordan, Z. What kind of systematic review should i conduct? a proposed typology and guidance for systematic reviewers in the medical and health sciences.BMC medical research methodology18, 5 (2018). 20/20

work page 2018