pith. sign in

arxiv: 2311.16079 · v1 · pith:RJTES52Ynew · submitted 2023-11-27 · 💻 cs.CL · cs.AI· cs.LG

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Pith reviewed 2026-05-21 14:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords large language modelsmedical domain adaptationopen-source LLMscontinued pretrainingmedical benchmarksmodel scaling
0
0 comments X

The pith

MEDITRON-70B shows that open-source medical language models can outperform GPT-3.5 and come close to GPT-4 after continued pretraining on domain data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MEDITRON as a pair of open-source large language models at 7 billion and 70 billion parameters that receive extra pretraining on medical material. The authors start from an existing base model and extend its training using a collection of PubMed articles, abstracts, and international medical guidelines. They measure the models on four established medical benchmarks both with and without further task-specific training. The results report clear gains over other public models of comparable size and results that surpass GPT-3.5 and Med-PaLM while staying within a few points of GPT-4 and Med-PaLM-2. A reader would care because the work releases both the model weights and the code used to build the training corpus, lowering the barrier to capable medical language models.

Core claim

MEDITRON-70B achieves a 6 percent absolute performance gain over the best public baseline in its parameter class and 3 percent over the strongest baseline finetuned from the base model. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5 percent of GPT-4 and 10 percent of Med-PaLM-2 on the four major medical benchmarks.

What carries the argument

Continued pretraining on a curated corpus of PubMed articles, abstracts, and recognized medical guidelines applied to a base large language model through an adapted distributed trainer.

If this is right

  • Medical large language models obtain measurable gains from additional pretraining on domain-specific text at the 70 billion parameter scale.
  • Public release of the model weights and the corpus curation code enables community efforts to build on the same base for medical applications.
  • Task-specific finetuning after the medical pretraining step produces further improvements on targeted medical tasks.
  • The 70 billion parameter version delivers higher medical benchmark scores than the 7 billion parameter version after identical adaptation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretraining approach on specialized corpora could be tested in other knowledge-heavy domains to create competitive open models.
  • Deployment in actual medical workflows would reveal whether the benchmark gains correspond to useful improvements in real assistance tasks.
  • Larger scales or additional data modalities such as clinical notes could be examined to narrow remaining gaps with the strongest closed models.

Load-bearing premise

The four major medical benchmarks used provide a representative and reliable measure of real-world medical knowledge and reasoning that generalizes beyond the test sets.

What would settle it

A new medical reasoning test or real-world clinical evaluation on which MEDITRON-70B shows substantially lower relative performance than the reported gaps on the original four benchmarks.

read the original abstract

Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MEDITRON, a suite of open-source 7B and 70B LLMs obtained by continued pretraining of Llama-2 on a curated medical corpus of selected PubMed articles, abstracts, and internationally-recognized medical guidelines. It reports 6% absolute gains over the best public baseline and 3% over a finetuned Llama-2 baseline across four major medical benchmarks, with MEDITRON-70B outperforming GPT-3.5 and Med-PaLM while remaining within 5% of GPT-4 and 10% of Med-PaLM-2. The authors release code for corpus curation and the model weights.

Significance. If the benchmark improvements reflect genuine gains from domain-adaptive pretraining rather than contamination, the work supplies the first open-source 70B-scale medical LLM together with reproducible curation code and weights. This directly supports further research on accessible medical AI and provides a concrete baseline for scaling domain pretraining in the 70B regime.

major comments (2)
  1. [§3] §3 (Medical Pretraining Corpus): The description of the curated corpus from PubMed articles, abstracts, and medical guidelines contains no mention of decontamination, n-gram overlap checks, or exclusion of items from the four benchmark test sets. Because the central claim that MEDITRON-70B outperforms GPT-3.5/Med-PaLM and approaches GPT-4/Med-PaLM-2 rests on these benchmarks measuring improved reasoning, the absence of leakage controls is load-bearing and must be addressed with explicit overlap statistics.
  2. [Evaluation section] Evaluation section: The reported 6% and 3% absolute gains and the closed-model comparisons are presented without error bars, ablation studies on the continued-pretraining hyperparameters, or a full description of the evaluation protocol (prompt templates, decoding settings, and whether the same protocol was used for GPT-4/Med-PaLM-2). These omissions prevent assessment of whether the gains are statistically reliable or protocol-dependent.
minor comments (2)
  1. A single table collating exact scores for MEDITRON-70B, all baselines, and the closed models on each of the four benchmarks would improve readability of the main results.
  2. [Abstract] The abstract states performance numbers without citing the specific benchmark names or the exact public baselines; adding these references in the abstract would help readers immediately locate the comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below and have revised the manuscript to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: The description of the curated corpus from PubMed articles, abstracts, and medical guidelines contains no mention of decontamination, n-gram overlap checks, or exclusion of items from the four benchmark test sets. Because the central claim that MEDITRON-70B outperforms GPT-3.5/Med-PaLM and approaches GPT-4/Med-PaLM-2 rests on these benchmarks measuring improved reasoning, the absence of leakage controls is load-bearing and must be addressed with explicit overlap statistics.

    Authors: We appreciate this observation on ensuring benchmark validity. Our curation pipeline did include n-gram overlap filtering to exclude potential test-set contamination from the pretraining corpus. We will expand §3 to describe the decontamination process in detail and report the overlap statistics (negligible rates across benchmarks). The released curation code already implements these checks. revision: yes

  2. Referee: The reported 6% and 3% absolute gains and the closed-model comparisons are presented without error bars, ablation studies on the continued-pretraining hyperparameters, or a full description of the evaluation protocol (prompt templates, decoding settings, and whether the same protocol was used for GPT-4/Med-PaLM-2). These omissions prevent assessment of whether the gains are statistically reliable or protocol-dependent.

    Authors: We agree these details improve rigor. The revised Evaluation section now includes error bars from repeated runs, ablations on key hyperparameters (learning rate, steps), and full protocol descriptions with prompt templates and decoding settings. For closed models we report the published figures, as identical execution is not possible; this is now explicitly stated. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical model release with external benchmarks

full rationale

This is an empirical paper reporting pretraining of LLMs on a curated medical corpus followed by evaluation on four standard medical benchmarks. No derivations, equations, fitted predictions, or first-principles claims exist that could reduce to self-defined inputs. Results are presented as direct benchmark scores against external baselines and closed models; no self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core performance claims. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the effectiveness of continued pretraining for domain adaptation and on the assumption that the selected medical corpus improves downstream medical task performance. No new entities are postulated.

free parameters (1)
  • continued_pretraining_hyperparameters
    Learning rate, number of tokens, and other training choices for the medical adaptation phase are selected but not enumerated in the abstract.
axioms (1)
  • domain assumption The curated selection of PubMed articles, abstracts, and medical guidelines constitutes high-quality, representative data for improving LLM medical capabilities.
    The paper's performance claims depend on this data curation step being effective.

pith-pipeline@v0.9.0 · 5901 in / 1202 out tokens · 79817 ms · 2026-05-21T14:03:45.586376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines... Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

    cs.HC 2024-05 conditional novelty 8.0

    AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences acros...

  2. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

    cs.AI 2026-05 conditional novelty 7.0

    Presents the first fully open pipeline for clinical LLMs that unifies eight public QA datasets with clinician-vetted synthetic data from guidelines and vignettes, achieving improved performance on medical benchmarks w...

  3. MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical...

  4. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 7.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

  5. Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

    cs.AI 2026-04 unverdicted novelty 7.0

    MED-VRAG reaches 78.6% average accuracy on four medical QA benchmarks by iteratively retrieving PMC page images with ColQwen2.5 embeddings and a VLM that refines queries over up to three rounds.

  6. Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

    cs.CL 2026-04 unverdicted novelty 7.0

    Domain-adaptive pre-training on a new French health corpus yields limited gains and risks general capability loss unless followed by model merging, which can even boost specialized performance.

  7. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  8. Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

    cs.CL 2026-05 unverdicted novelty 6.0

    Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.

  9. From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...

  10. Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

    cs.CL 2026-05 unverdicted novelty 6.0

    Freezing deep layers and training shallow layers during continued pre-training of LLMs outperforms full fine-tuning and the opposite allocation on C-Eval and CMMLU, guided by a new layer-sensitivity diagnostic.

  11. CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...

  12. Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

    cs.LG 2026-03 unverdicted novelty 6.0

    CAMEL is a scaling law capturing nonlinear model-size and mixture interactions to extrapolate optimal data mixtures for large LLMs from small-model experiments, reducing optimization cost by 50% and improving benchmar...

  13. CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

    cs.AI 2026-01 unverdicted novelty 6.0

    CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.

  14. AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

    cs.AI 2025-09 unverdicted novelty 6.0

    AgenticEval is a multi-agent framework that ingests unstructured policies to generate and self-evolve comprehensive safety benchmarks for LLMs, with experiments showing declining safety rates as tests harden.

  15. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  16. RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

    cs.CL 2026-05 unverdicted novelty 5.0

    LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

  17. Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

    cs.LG 2026-04 unverdicted novelty 5.0

    Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

  18. Elder-Sim: A Psychometrically Validated Platform for Personality-Stable Elderly Digital Twins

    cs.HC 2026-03 unverdicted novelty 5.0

    ELDER-SIM builds personality-stable elderly digital twins via LLM orchestration with OCEAN traits, Beck CBT diagrams, long-term memory, and LoRA fine-tuning on CHARLS data, validated by Cronbach's alpha 0.70-0.94 and ...

  19. Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.

  20. Language corpora for the Dutch medical domain

    cs.CL 2026-04 unverdicted novelty 4.0

    A 35-billion-token Dutch medical corpus was assembled from translated, mined, and extracted sources and released publicly on Hugging Face as the first large-scale resource of its kind.

  21. ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

    eess.SP 2026-04 unverdicted novelty 3.0

    ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 20 Pith papers · 5 internal anchors

  1. [1]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv e-prints, page arXiv:2305.13245

  2. [2]

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Co- jocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance

  3. [3]

    Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA. Association for Computational Linguistics

  4. [4]

    Jiang, Jia Deng, Stella Biderman, and Sean Welleck

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. 2023. Llemma: An open language model for mathematics

  5. [5]

    Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023. Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

  6. [6]

    Berg, David Atkins, and William Tierney

    Alfred O. Berg, David Atkins, and William Tierney. 1997. Clinical practice guidelines in practice and education. Journal of General Internal Medicine, 12(S2)

  7. [7]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  9. [9]

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4

  10. [10]

    Burns, Rod J

    Patricia B. Burns, Rod J. Rohrich, and Kevin C. Chung. 2011. The levels of evidence and their role in evidence-based medicine. Plastic and Reconstructive Surgery, 128(1):305–310

  11. [11]

    Alejandro Hernández Cano, Matteo Pagliardini, Andreas Köpf, Kyle Matoba, Amirkeivan Mo- htashami, Olivia Simin Fan, Axel Marmet, Deniz Bayazit, Igor Krawczuk, Zeming Chen, Francesco Salvi, Antoine Bosselut, and Martin Jaggi. 2023. epfLLM Megatron-LLM. https://github. com/epfLLM/Megatron-LLM

  12. [12]

    Tuhin Chakrabarty, Christopher Hidey, and Kathy McKeown. 2019. IMHO fine-tuning improves claim detection. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 558–563, Minneapolis, Minnesota. Association for Computational L...

  13. [13]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  14. [14]

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. Arxiv

  15. [15]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  16. [16]

    Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning

  17. [17]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness

  18. [18]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms

  19. [19]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics

  20. [20]

    Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language gen- eration. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM. REFERENCES 18

  21. [21]

    Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...

  22. [22]

    Leo Gao, Stella Rose Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027

  23. [23]

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23

  24. [24]

    Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

    Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin Anthony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual pre-training of large language models: How to (re)warm your model?

  25. [25]

    Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 8342–8360, Online. Association for Computational Linguistics

  26. [26]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

  27. [27]

    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland

    ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics

  28. [28]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding

  29. [29]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021b. Measuring massive multitask language understanding

  30. [30]

    Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. 2019. Efficient adaptation of pretrained transformers for abstractive summarization. ArXiv, abs/1906.00138

  31. [31]

    Rae, Oriol Vinyals, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre...

  32. [32]

    Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classifica- tion. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics

  33. [33]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b

  34. [34]

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2020. What disease does this patient have? a large-scale open domain question answering dataset from medical exams

  35. [35]

    Hongpeng Jin, Wenqi Wei, Xuyu Wang, Wenbin Zhang, and Yanzhao Wu. 2023. Rethinking learning rate tuning in the era of large language models. arXiv preprint arXiv:2309.08859

  36. [36]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. ...

  37. [37]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models

  38. [38]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large language models are zero-shot reasoners

  39. [39]

    Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. Arxiv

  40. [40]

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240

  41. [41]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

  42. [42]

    Manning, Christopher Ré, Diana Acosta-Navas, Drew A

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

  43. [43]

    Holistic evaluation of language models

  44. [44]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics

  45. [45]

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. 2020. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics

  46. [46]

    Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, and Daphne Ippolito. 2023. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, and toxicity

  47. [47]

    Med42 - clinical large language model

    M42-Health. Med42 - clinical large language model. https://huggingface.co/ m42-health/med42-70b. Accessed: 2023-11-05

  48. [48]

    Yingwei Ma, Yue Liu, Yue Yu, Yuanliang Zhang, Yu Jiang, Changjian Wang, and Shanshan Li. 2023. At which training stage does code data help llms reasoning?

  49. [49]

    Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. 2022. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

  50. [50]

    Sourab Mangrulkar, Sylvain Gugger, Lewis Tunstall, and Philipp Schmid. 2023. Fine- tuning Llama 2 70b using PyTorch FSDP. https://huggingface.co/blog/ ram-efficient-pytorch-fsdp . Accessed 2023-11-02

  51. [51]

    2003–2023

    Bethesda (MD): National Library of Medicine. 2003–2023. PMC Open Access Subset. https: //www.ncbi.nlm.nih.gov/pmc/tools/openftlist/. Accessed on 12/10/2023. REFERENCES 20

  52. [52]

    MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05

  53. [53]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phan- ishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. In Proceedings of the International Conference for Hig...

  54. [54]

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems

  55. [55]

    Omiye, Jenna C

    Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou

  56. [56]

    npj Digital Medicine, 6(1)

    Large language models propagate race-based medicine. npj Digital Medicine, 6(1)

  57. [57]

    OpenAI. 2023a. Chatml. https://github.com/openai/openai-python/blob/ main/chatml.md. Accessed 2023-11-02

  58. [58]

    OpenAI. 2023b. Gpt-4 technical report

  59. [59]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback

  60. [60]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR

  61. [61]

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

  62. [62]

    Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, Gloria Lipori, Duane A Mitchell, Naykky S Ospina, Mustafa M Ahmed, William R Hogan, Elizabeth A Shenkman, Yi Guo, Jiang Bian, and Yonghui Wu. 2023. A study of generative large language model for medical research and healthcare

  63. [63]

    Jason Phang, Thibault Févry, and Samuel R. Bowman. 2019. Sentence encoders on stilts: Supple- mentary training on intermediate labeled-data tasks

  64. [64]

    Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training

  65. [65]

    Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J

    Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67

  66. [66]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thoma...

  67. [67]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv e-prints, page arXiv:1909.08053

  68. [68]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado,...

  69. [69]

    Sara Mahdavi, Joelle Barral, Dale Webster, Greg S

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi,...

  70. [70]

    Peters, Abhilasha Ravichander, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Evan Pete Walsh, Hannaneh Hajishirzi, Noah A

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Khyathi Chandu, Jennifer Dumas, Li Lucy, Xinxi Lyu, Ian Magnusson, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Evan Pete Walsh, Hannaneh Hajishirzi, Noah A. Smith, Luke Zet...

  71. [71]

    Biomedlm

    MosaicML Stanford CRFM. Biomedlm. https://huggingface.co/stanford-crfm/ BioMedLM. Accessed: 2023-11-05

  72. [72]

    Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2020a. How to fine-tune bert for text classification?

  73. [73]

    Jingyuan Sun, Shaonan Wang, Jiajun Zhang, and Chengqing Zong. 2020b. Distill and replay for continual language learning. In International Conference on Computational Linguistics

  74. [74]

    Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science

  75. [75]

    Comeau, Rezarta Islamaj, Aadit Kapoor, Xin Gao, and Zhiyong Lu

    Shubo Tian, Qiao Jin, Lana Yeganova, Po-Ting Lai, Qingqing Zhu, Xiuying Chen, Yifan Yang, Qingyu Chen, Won Kim, Donald C. Comeau, Rezarta Islamaj, Aadit Kapoor, Xin Gao, and Zhiyong Lu. 2023. Opportunities and challenges for chatgpt and large language models in biomedicine and health

  76. [76]

    Together AI. 2023. Redpajama: An open source recipe to reproduce llama training dataset. https: //github.com/togethercomputer/RedPajama-Data

  77. [77]

    Lawler, Jimmy Ba, Rahul G

    Augustin Toma, Patrick R. Lawler, Jimmy Ba, Rahul G. Krishnan, Barry B. Rubin, and Bo Wang

  78. [78]

    Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding

  79. [79]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023a. Llama: Open and efficient foundation language models

  80. [80]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cris- tian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Har...

Showing first 80 references.