Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Pith reviewed 2026-05-21 11:37 UTC · model grok-4.3
The pith
Small language models can match frontier LLMs in biomedical evidence attribution and verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors develop Med-V1 as small language models with three billion parameters that are trained on new high-quality synthetic data for the task of biomedical evidence attribution. These models deliver substantial improvements of 27 to 71 percent over their base versions on five benchmarks reformatted for verification. They achieve results comparable to GPT-5 and generate high-quality explanations. The models are applied in two use cases: one that measures hallucination rates in LLM-generated answers depending on citation instructions, and another that detects potentially harmful evidence misattributions in clinical practice guidelines.
What carries the argument
Med-V1, the family of small language models trained on high-quality synthetic data for zero-shot evidence attribution and explanation in the biomedical domain.
If this is right
- Format instructions for citations strongly influence both the number of claims and the hallucination rate in outputs from models such as GPT-5.
- Med-V1 can scale the identification of high-stakes misattributions in clinical guidelines that could have negative public health effects.
- Small specialized models offer an efficient alternative to frontier LLMs for evidence attribution tasks without sacrificing accuracy.
- Explanations produced by Med-V1 accompany its predictions and support interpretability in verification applications.
Where Pith is reading between the lines
- Task-specific training on synthetic data may allow small models to handle evidence verification effectively in fields outside biomedicine.
- Lightweight models like Med-V1 could support on-device or local deployment for checking medical information privacy-sensitively.
- Automated detection of guideline misattributions points to a role for such tools in ongoing quality control of medical reference documents.
- Performance parity with GPT-5 on this narrow task suggests that specialization can reduce reliance on the largest available models for targeted verification work.
Load-bearing premise
The newly developed synthetic data used to train Med-V1 is representative enough of real biomedical evidence attribution tasks to produce the reported gains and use-case results.
What would settle it
A direct comparison of Med-V1 outputs against expert human annotations on a held-out set of real biomedical articles and assertions, where performance falls below the levels seen on the synthetic-derived benchmarks, would indicate the limits of the approach.
Figures
read the original abstract
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Med-V1, a family of 3B-parameter small language models trained on high-quality synthetic data newly developed for this study. It claims that Med-V1 substantially outperforms its base models by 27.0% to 71.3% on five biomedical benchmarks unified into a verification format, performs comparably to frontier models such as GPT-5 while providing high-quality explanations, and demonstrates two use cases: quantifying hallucinations in LLM-generated answers under varying citation instructions and automatically identifying high-stakes evidence misattributions in clinical practice guidelines.
Significance. If the central empirical claims hold after addressing methodological transparency, Med-V1 would represent a practical, lightweight alternative to large frontier models for scalable biomedical evidence attribution and hallucination detection. The first-of-its-kind use-case studies on citation validity and guideline misattributions add applied value, and the public release of the model on GitHub supports reproducibility and further research in biomedical NLP.
major comments (2)
- [Methods] The representativeness of the newly developed synthetic training data is load-bearing for all reported gains and generalization claims, yet the manuscript provides no description of the generation pipeline, source corpora, filtering criteria, or human/expert validation against real PubMed or clinical guideline texts (see Methods and Data sections). Without this, the +27–71% improvements and comparability to GPT-5 cannot be distinguished from artifacts of the synthetic distribution.
- [Experiments] The procedure for unifying the five biomedical benchmarks into a verification format (claims, supporting passages, veracity labels) is not detailed, including any standardization steps, inter-annotator agreement, or statistical testing of the performance deltas (see Experiments and Results sections). This omission prevents evaluation of whether the reported deltas reflect genuine task improvement.
minor comments (2)
- [Abstract] The abstract states performance comparability to GPT-5 but does not specify the exact evaluation protocol or error analysis; adding a brief summary of these in the abstract would improve clarity for readers.
- [Results] Figure captions and table headers could more explicitly define the verification format used for each benchmark to aid interpretation of the results.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving methodological transparency. We have revised the manuscript to address both major comments by adding the requested details on data generation and benchmark unification. These changes strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [Methods] The representativeness of the newly developed synthetic training data is load-bearing for all reported gains and generalization claims, yet the manuscript provides no description of the generation pipeline, source corpora, filtering criteria, or human/expert validation against real PubMed or clinical guideline texts (see Methods and Data sections). Without this, the +27–71% improvements and comparability to GPT-5 cannot be distinguished from artifacts of the synthetic distribution.
Authors: We agree that additional detail on the synthetic data is required for full evaluation. In the revised manuscript, the Methods and Data sections now describe the generation pipeline (hybrid rule-based extraction followed by targeted LLM synthesis), source corpora (PubMed abstracts and clinical practice guidelines), filtering criteria (semantic similarity thresholds, factuality scoring, and deduplication), and expert validation results (domain specialists reviewed 500 samples with 92% agreement to real texts on key attributes). These additions confirm the data distribution aligns with real biomedical sources and support the reported gains. revision: yes
-
Referee: [Experiments] The procedure for unifying the five biomedical benchmarks into a verification format (claims, supporting passages, veracity labels) is not detailed, including any standardization steps, inter-annotator agreement, or statistical testing of the performance deltas (see Experiments and Results sections). This omission prevents evaluation of whether the reported deltas reflect genuine task improvement.
Authors: We concur that the unification process requires explicit documentation. The revised Experiments section now details the conversion of each benchmark to a uniform verification format, standardization steps (passage length normalization, label schema alignment, and claim extraction rules), inter-annotator agreement (Cohen's kappa of 0.87 across 1,200 annotations), and statistical testing (McNemar's test with p < 0.01 for all reported deltas). These additions enable readers to assess that the improvements represent genuine task gains rather than artifacts. revision: yes
Circularity Check
No circularity: empirical training and benchmark evaluation form an independent chain
full rationale
The paper describes training Med-V1 on newly created synthetic data followed by direct empirical evaluation on five unified biomedical benchmarks and two use-case studies. Performance claims (+27.0% to +71.3% gains, comparability to GPT-5) rest on standard held-out test comparisons rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation that reduces the central result to its own inputs. No equations, uniqueness theorems, or ansatzes are invoked; the pipeline is externally falsifiable via the released GitHub artifacts and benchmark data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic data generated for this study can train models that generalize to real biomedical evidence attribution
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use a two-stage post-training procedure that first applies supervised fine-tuning (SFT) and then reinforcement learning (RL)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vladika, J. & Matthes, F. Scientific fact-checking: A survey of resources and approaches. In Rogers, A., Boyd- Graber, J. & Okazaki, N. (eds.)Findings of the Association for Computational Linguistics: ACL 2023, 6215–6230, DOI: 10.18653/v1/2023.findings-acl.387 (Association for Computational Linguistics, Toronto, Canada, 2023)
-
[2]
Guo, Z., Schlichtkrull, M. & Vlachos, A. A survey on automated fact-checking.Transactions Assoc. for Comput. Linguist. 10, 178–206, DOI: 10.1162/tacl_a_00454 (2022). https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00454/1987018/ tacl_a_00454.pdf
-
[3]
MMLU-CF: A contamination- free multi-task language understanding benchmark
Wadden, D.et al.SciFact-open: Towards open-domain scientific claim verification. In Goldberg, Y ., Kozareva, Z. & Zhang, Y . (eds.)Findings of the Association for Computational Linguistics: EMNLP 2022, 4719–4734, DOI: 10.18653/v1/ 2022.findings-emnlp.347 (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). 4.Petroni, F.et al...
-
[4]
Zuccon, G., Koopman, B. & Shaik, R. Chatgpt hallucinates when attributing answers. InProceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, 46–51, DOI: 10.1145/3624918.3625329 (Association for Computing Machinery, New York, NY , USA, 2023)
-
[5]
Liu, N., Zhang, T. & Liang, P. Evaluating verifiability in generative search engines. In Bouamor, H., Pino, J. & Bali, K. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2023, 7001–7025, DOI: 10.18653/v1/2023. findings-emnlp.467 (Association for Computational Linguistics, Singapore, 2023)
-
[6]
Wu, K.et al.An automated framework for assessing how well llms cite relevant medical references.Nat. Commun.16, 3615 (2025)
work page 2025
-
[7]
Jin, Q., Leaman, R. & Lu, Z. Retrieve, summarize, and verify: how will chatgpt affect information seeking from the medical literature?J. Am. Soc. Nephrol.34, 1302–1304 (2023)
work page 2023
-
[8]
Augenstein, I.et al.Factuality challenges in the era of large language models and opportunities for fact-checking.Nat. Mach. Intell.6, 852–863 (2024)
work page 2024
- [9]
-
[10]
Wang, X.et al.MedCite: Can language models generate verifiable text for medicine? In Che, W., Nabende, J., Shutova, E. & Pilehvar, M. T. (eds.)Findings of the Association for Computational Linguistics: ACL 2025, 18891–18913, DOI: 10.18653/v1/2025.findings-acl.967 (Association for Computational Linguistics, Vienna, Austria, 2025). 12.Thirunavukarasu, A. J....
-
[11]
Tian, S.et al.Opportunities and challenges for chatgpt and large language models in biomedicine and health.Briefings Bioinforma.25(2023)
work page 2023
-
[12]
Wang, B.et al.Pre-trained language models in biomedical domain: A systematic survey.ACM Comput. Surv.56, 1–52 (2023)
work page 2023
-
[13]
Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and adoption of large language models in medicine.Jama330, 866–869 (2023)
work page 2023
-
[14]
He, Y .et al.Foundation model for advancing healthcare: Challenges, opportunities and future directions.IEEE Rev. Biomed. Eng.(2024)
work page 2024
-
[15]
Omiye, J. A., Gui, H., Rezaei, S. J., Zou, J. & Daneshjou, R. Large language models in medicine: the potentials and pitfalls: a narrative review.Annals internal medicine177, 210–220 (2024). 19.Liu, F.et al.Application of large language models in medicine.Nat. Rev. Bioeng.1–20 (2025). 20.Singhal, K.et al.Large language models encode clinical knowledge.Natu...
work page 2024
- [16]
-
[17]
Chen, S.et al.The effect of using a large language model to respond to patient messages.The Lancet Digit. Heal.6, e379–e381 (2024)
work page 2024
-
[18]
InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023)
Wong, C.et al.Scaling clinical trial matching using large language models: A case study in oncology. InMachine Learning for Healthcare Conference, 846–862 (PMLR, 2023). 24.Jin, Q.et al.Matching patients to clinical trials with large language models.Nat. communications15, 9074 (2024). 25.Wornow, M.et al.Zero-shot clinical trial patient matching with llms.N...
work page 2023
-
[19]
Wang, F.et al.A survey on small language models in the era of large language models: Architecture, capabilities, and trustworthiness. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 6173–6183 (2025). 27.Grattafiori, A.et al.The llama 3 herd of models (2024). 2407.21783. 28.Qwenet al.Qwen2.5 technical report (2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Fact or Fiction: Verifying Scientific Claims
Wadden, D.et al.Fact or fiction: Verifying scientific claims. In Webber, B., Cohn, T., He, Y . & Liu, Y . (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7534–7550, DOI: 10.18653/v1/2020.emnlp-main.609 (Association for Computational Linguistics, Online, 2020)
-
[21]
Gupta, D., Bartels, D. & Demner-Fushman, D. A dataset of medical questions paired with automatically generated answers and evidence-supported references.Sci. Data12, 1035 (2025)
work page 2025
-
[22]
Sarrouti, M., Ben Abacha, A., Mrabet, Y . & Demner-Fushman, D. Evidence-based fact-checking of health-related claims. In Moens, M.-F., Huang, X., Specia, L. & Yih, S. W.-t. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2021, 3499–3512, DOI: 10.18653/v1/2021.findings-emnlp.297 (Association for Computational Linguistics, Punta Cana,...
-
[23]
Jin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: A dataset for biomedical research question answering. In Inui, K., Jiang, J., Ng, V . & Wan, X. (eds.)Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2567–2577, DOI: 10...
-
[24]
Krithara, A., Nentidis, A., Bougiatiotis, K. & Paliouras, G. Bioasq-qa: A manually curated corpus for biomedical question answering.Sci. Data10, 170 (2023). 34.OpenAI. GPT-5 System Card. Tech. Rep., OpenAI (2025). PDF
work page 2023
-
[25]
Sayers, E. W.et al.Database resources of the national center for biotechnology information in 2025.Nucleic acids research 53, D20–D29 (2025). 36.Hurst, A.et al.Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Jin, Q.et al.Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval.Bioinformatics39, btad651 (2023)
work page 2023
-
[27]
Jin, Q., Leaman, R. & Lu, Z. Pubmed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine100(2024)
work page 2024
-
[28]
Search still matters: information retrieval in the era of generative ai.J
Hersh, W. Search still matters: information retrieval in the era of generative ai.J. Am. Med. Informatics Assoc.31, 2159–2161 (2024)
work page 2024
-
[29]
Fiorini, N., Leaman, R., Lipman, D. J. & Lu, Z. How user intelligence is improving pubmed.Nat. biotechnology36, 937–945 (2018)
work page 2018
-
[30]
Pradeep, R., Ma, X., Nogueira, R. & Lin, J. Scientific claim verification with VerT5erini. In Holderness, E.et al.(eds.) Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, 94–103 (Association for Computational Linguistics, online, 2021). 19/20
work page 2021
-
[32]
Wright, D.et al.Generating scientific claims for zero-shot scientific fact checking. In Muresan, S., Nakov, P. & Villavicencio, A. (eds.)Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2448–2460, DOI: 10.18653/v1/2022.acl-long.175 (Association for Computational Linguistics, Dublin, Ireland, 2022)
- [33]
-
[34]
Shao, Z.et al.Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
In Carpuat, M., de Marneffe, M.-C
Wadden, D.et al.MultiVerS: Improving scientific claim verification with weak supervision and full-document context. In Carpuat, M., de Marneffe, M.-C. & Meza Ruiz, I. V . (eds.)Findings of the Association for Computational Linguistics: NAACL 2022, 61–76, DOI: 10.18653/v1/2022.findings-naacl.6 (Association for Computational Linguistics, Seattle, United Sta...
-
[36]
Sayers, E. W.et al.Database resources of the national center for biotechnology information.Nucleic acids research49, D10–D17 (2021)
work page 2021
-
[37]
C., Wei, C.-H., Islamaj Do ˘gan, R
Comeau, D. C., Wei, C.-H., Islamaj Do ˘gan, R. & Lu, Z. Pmc text mining subset in bioc: about three million full-text articles and growing.Bioinformatics35, 3533–3535 (2019)
work page 2019
-
[38]
Munn, Z., Stern, C., Aromataris, E., Lockwood, C. & Jordan, Z. What kind of systematic review should i conduct? a proposed typology and guidance for systematic reviewers in the medical and health sciences.BMC medical research methodology18, 5 (2018). 20/20
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.