pith. sign in

arxiv: 2305.09617 · v1 · pith:SUKURWIFnew · submitted 2023-05-16 · 💻 cs.CL · cs.AI· cs.LG

Towards Expert-Level Medical Question Answering with Large Language Models

Pith reviewed 2026-05-24 04:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords medical question answeringlarge language modelsMed-PaLM 2MedQAUSMLEphysician evaluationensemble refinement
0
0 comments X

The pith

Med-PaLM 2 reaches 86.5% on MedQA and is preferred by physicians over human answers on eight of nine clinical utility axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Med-PaLM 2, built from the PaLM 2 base model with added medical finetuning and new prompting methods, lifts accuracy on medical question benchmarks far above earlier systems. It records a 19-point gain on the MedQA dataset of USMLE-style questions and matches or exceeds leading results on MedMCQA, PubMedQA, and clinical MMLU topics. In head-to-head rankings of long-form answers to 1066 consumer questions, physicians rated Med-PaLM 2 higher than physician-written answers on most measures of clinical usefulness. The same pattern held on a new set of adversarial questions meant to expose model weaknesses. These results indicate that large language models are moving closer to physician-level performance on medical question answering tasks.

Core claim

Med-PaLM 2 combines PaLM 2 base improvements, medical domain finetuning, and a novel ensemble refinement prompting strategy to score up to 86.5% on MedQA, more than 19 points above Med-PaLM, while also producing answers that physicians prefer to those written by other physicians on eight of nine axes of clinical utility in pairwise tests on over one thousand consumer questions.

What carries the argument

Ensemble refinement, a prompting method that generates and iteratively selects among multiple candidate answers to improve final output quality on medical questions.

If this is right

  • Large language models can now exceed passing thresholds on USMLE-style questions by wide margins.
  • Physician preference for model answers extends across multiple dimensions of clinical utility including accuracy and helpfulness.
  • Performance gains appear on both standard and adversarially designed medical question sets.
  • The same scaling and prompting techniques that lifted Med-PaLM to Med-PaLM 2 can be applied to additional medical QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the preference results hold in live settings, AI systems could serve as first-line responders for routine medical queries.
  • The gap between benchmark success and safe deployment in hospitals still requires separate safety and outcome studies.
  • Similar refinement methods might transfer to other high-stakes domains where expert preference data can be collected.

Load-bearing premise

Benchmark accuracy on MedQA together with physician preference rankings on selected long-form questions is a sufficient stand-in for expert-level medical question answering ability.

What would settle it

A study in which Med-PaLM 2 answers produce measurably lower accuracy or worse patient outcomes than physicians in unscripted clinical encounters would show the performance gap has not closed.

read the original abstract

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Med-PaLM 2, which combines PaLM 2 base model improvements, medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. It reports up to 86.5% accuracy on MedQA (a >19% improvement over Med-PaLM and new SOTA), near-SOTA results on MedMCQA, PubMedQA, and MMLU clinical topics, and human evaluations where physicians preferred Med-PaLM 2 answers over physician answers on 8 of 9 clinical utility axes for 1066 consumer questions (p<0.001) plus gains on 240 adversarial long-form questions.

Significance. If the performance numbers and preference results hold under scrutiny, the work demonstrates meaningful progress toward physician-comparable medical QA with LLMs, supported by both objective benchmarks and a large-scale human study. Credit is due for the scale of the pairwise human evaluation (1066+240 questions), inclusion of adversarial probing, statistical reporting, and the appropriately cautious framing that further real-world studies are needed.

major comments (2)
  1. [Abstract / Methods] Abstract and methods description: the central accuracy claims (e.g., 86.5% on MedQA) and p-values for human preferences rest on unreported details of the medical finetuning corpus, exact implementation of ensemble refinement, statistical test procedures (including multiple-comparison correction), and any data-leakage audits against the benchmark test sets; these omissions are load-bearing for independent verification of the reported gains.
  2. [Human Evaluation] Human evaluation section: the selection criteria for the 1066 consumer questions and the protocol for generating the physician reference answers are not specified, which directly affects the interpretability of the 8/9-axis preference result and the claim of clinical utility.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it explicitly stated the total number of questions per evaluation axis rather than aggregating the 1066 figure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the scale of our human evaluation and the cautious framing of our results. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript's transparency and reproducibility.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and methods description: the central accuracy claims (e.g., 86.5% on MedQA) and p-values for human preferences rest on unreported details of the medical finetuning corpus, exact implementation of ensemble refinement, statistical test procedures (including multiple-comparison correction), and any data-leakage audits against the benchmark test sets; these omissions are load-bearing for independent verification of the reported gains.

    Authors: We agree that greater methodological transparency will facilitate independent verification. In the revised manuscript we will expand the Methods section with: (i) a high-level characterization of the medical domain finetuning data (while noting that the precise corpus composition remains proprietary), (ii) a step-by-step description of the ensemble refinement procedure, (iii) the exact statistical tests employed together with any multiple-comparison corrections, and (iv) the data-leakage audit protocol applied to the benchmark test sets. These additions directly address the referee's concern. revision: partial

  2. Referee: [Human Evaluation] Human evaluation section: the selection criteria for the 1066 consumer questions and the protocol for generating the physician reference answers are not specified, which directly affects the interpretability of the 8/9-axis preference result and the claim of clinical utility.

    Authors: We acknowledge that explicit description of question selection and the physician-answer protocol is necessary for proper interpretation. We will revise the Human Evaluation section to specify the criteria used to curate the 1066 consumer questions (including relevance, diversity, and source distribution) and to detail the instructions and quality-control steps given to the physicians who produced the reference answers. revision: yes

standing simulated objections not resolved
  • Exact composition and provenance of the proprietary medical finetuning corpus

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks and independent evaluations

full rationale

The paper reports empirical performance of Med-PaLM 2 on standard external benchmarks (MedQA at 86.5%, MedMCQA, PubMedQA, MMLU) and physician pairwise preferences on 1066 consumer and 240 adversarial long-form questions. These metrics are not derived from any self-referential definitions, fitted parameters renamed as predictions, or self-citation chains. The work explicitly caveats that further real-world studies are needed and frames results as progress toward expert-level performance rather than attainment. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the derivation chain is absent because the contribution is empirical reporting against independent references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The work is empirical and likely contains many training hyperparameters and data choices that are not described here.

pith-pipeline@v0.9.0 · 6020 in / 1218 out tokens · 53190 ms · 2026-05-24T04:29:42.543473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  2. CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...

  3. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  4. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

    cs.AI 2026-04 unverdicted novelty 6.0

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  5. The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...

  6. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

  7. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  8. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  9. Retrieval-Augmented Generation for Natural Language Processing: A Survey

    cs.CL 2024-07 accept novelty 6.0

    The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.

  10. Capabilities of Gemini Models in Medicine

    cs.AI 2024-04 unverdicted novelty 6.0

    Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

  11. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

    cs.AI 2026-05 unverdicted novelty 5.0

    NeuroAgent uses a hierarchical LLM agent framework with Generate-Execute-Validate loops to automate neuroimaging preprocessing, reaching 84.8% end-to-end correctness and 0.9518 AUC for Alzheimer's classification on 14...

  12. PRISM: Perception Reasoning Interleaved for Sequential Decision Making

    cs.AI 2026-05 unverdicted novelty 5.0

    PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.

  13. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  14. MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

    cs.CV 2025-12 unverdicted novelty 4.0

    Fine-tuned MedGemma outperforms untuned GPT-4 in zero-shot medical image disease classification, achieving 80.37% versus 69.58% mean test accuracy with higher sensitivity for cancer and pneumonia.

  15. QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

    cs.CL 2025-04 unverdicted novelty 4.0

    QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% im...

  16. Query pipeline optimization for cancer patient question answering systems

    cs.CL 2024-12 unverdicted novelty 4.0

    Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.

  17. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

    cs.CL 2024-08 unverdicted novelty 4.0

    GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.

  18. ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

    eess.SP 2026-04 unverdicted novelty 3.0

    ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

  19. Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

    cs.CL 2025-02 unverdicted novelty 3.0

    Fine-tuning and data augmentation improve LLM performance on medical jargon extraction and prioritization from EHR notes, with augmented open-source models sometimes outperforming closed-source ones on 106 annotated notes.

  20. Benchmark Data Contamination of Large Language Models: A Survey

    cs.CL 2024-06 unverdicted novelty 3.0

    A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

  21. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  22. Data-Centric Foundation Models in Computational Healthcare: A Survey

    cs.LG 2024-01 unverdicted novelty 3.0

    The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

  23. Entry-level guide to the use of large language models for medical research

    cs.AI 2024-10 unverdicted novelty 2.0

    A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Reference graph

Works this paper leans on

136 extracted references · 136 canonical work pages · cited by 23 Pith papers · 33 internal anchors

  1. [1]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

  2. [2]

    LaMDA: Language Models for Dialog Applications

    Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

  3. [3]

    medRxiv , pages=

    The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model , author=. medRxiv , pages=. 2023 , publisher=

  4. [4]

    International Conference on Machine Learning , pages=

    Glam: Efficient scaling of language models with mixture-of-experts , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  5. [5]

    arXiv preprint arXiv:2107.13586 , year=

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. arXiv preprint arXiv:2107.13586 , year=

  6. [6]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  7. [7]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  8. [8]

    Canadian Journal of Philosophy , pages=

    The Algorithmic Leviathan: Arbitrariness, Fairness, and Opportunity in Algorithmic Decision-Making Systems , author=. Canadian Journal of Philosophy , pages=. 2022 , publisher=

  9. [9]

    Medical care , volume=

    Measuring harm in healthcare: optimizing adverse event review , author=. Medical care , volume=. 2017 , publisher=

  10. [10]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

  11. [11]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

  12. [12]

    Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , journal=

  13. [13]

    Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=

  14. [14]

    Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

  15. [15]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

  16. [16]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  17. [17]

    Scaling Instruction-Finetuned Language Models

    Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

  18. [18]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili. arXiv preprint arXiv:2211.05100 , year=

  19. [19]

    Solving Quantitative Reasoning Problems with Language Models

    Solving quantitative reasoning problems with language models , author=. arXiv preprint arXiv:2206.14858 , year=

  20. [20]

    Applied Sciences , volume=

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

  21. [21]

    Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W and Lu, Xinghua , journal=

  22. [22]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=

  23. [23]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

  24. [24]

    arXiv preprint arXiv:2205.12689 , year=

    Large Language Models are Zero-Shot Clinical Information Extractors , author=. arXiv preprint arXiv:2205.12689 , year=

  25. [25]

    arXiv preprint arXiv:2207.08143 , year=

    Can large language models reason about medical questions? , author=. arXiv preprint arXiv:2207.08143 , year=

  26. [26]

    2015 , journal =

    Tamara Williams and Marilyn Szekendi and Stephen Pavkovic and Wanda Clevenger and Julie Cerese , title =. 2015 , journal =. doi:10.1097/PTS.0b013e3182948ef9 , language =

  27. [27]

    Feng, Steven Y and Khetan, Vivek and Sacaleanu, Bogdan and Gershman, Anatole and Hovy, Eduard , journal=

  28. [28]

    Taylor, Ross and Kardas, Marcin and Cucurull, Guillem and Scialom, Thomas and Hartshorn, Anthony and Saravia, Elvis and Poulton, Andrew and Kerkez, Viktor and Stojnic, Robert , journal=

  29. [29]

    arXiv preprint arXiv:2204.02329 , year=

    Can language models learn from explanations in context? , author=. arXiv preprint arXiv:2204.02329 , year=

  30. [30]

    Large Language Models Still Can't Plan (A Benchmark for

    Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , journal=. Large Language Models Still Can't Plan (A Benchmark for

  31. [31]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  32. [32]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

  33. [33]

    arXiv preprint arXiv:2202.01875 , year=

    Rethinking Explainability as a Dialogue: A Practitioner's Perspective , author=. arXiv preprint arXiv:2202.01875 , year=

  34. [34]

    Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

    Expert discussions improve comprehension of difficult cases in medical image assessment , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

  35. [35]

    NPJ digital medicine , volume=

    Deep learning-enabled medical computer vision , author=. NPJ digital medicine , volume=. 2021 , publisher=

  36. [36]

    Nature Protocols , volume=

    Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records , author=. Nature Protocols , volume=. 2021 , publisher=

  37. [37]

    2022 , organization=

    Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle=. 2022 , organization=

  38. [38]

    Overview of the medical question answering task at

    Abacha, Asma Ben and Agichtein, Eugene and Pinter, Yuval and Demner-Fushman, Dina , booktitle=. Overview of the medical question answering task at

  39. [39]

    , author=

    Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. , author=. MedInfo , pages=

  40. [40]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint arXiv:2206.04615 , year=

  41. [41]

    Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3

    Korngiebel, Diane M and Mooney, Sean D , journal=. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3. 2021 , publisher=

  42. [42]

    Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3

    Sezgin, Emre and Sirrianni, Joseph and Linwood, Simon L and others , journal=. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3. 2022 , publisher=

  43. [43]

    Hong, Zhi and Ajith, Aswathy and Pauloski, Gregory and Duede, Eamon and Malamud, Carl and Magoulas, Roger and Chard, Kyle and Foster, Ian , journal=

  44. [44]

    arXiv preprint arXiv:2010.06060 , year=

    BioMegatron: Larger biomedical domain language model , author=. arXiv preprint arXiv:2010.06060 , year=

  45. [45]

    Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

    Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art , author=. Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

  46. [46]

    Beltagy, Iz and Lo, Kyle and Cohan, Arman , journal=

  47. [47]

    2022 , publisher=

    Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan , journal=. 2022 , publisher=

  48. [48]

    arXiv preprint arXiv:2207.07411 , year=

    Plex: Towards reliability using pretrained large model extensions , author=. arXiv preprint arXiv:2207.07411 , year=

  49. [49]

    ACM Transactions on Computing for Healthcare (HEALTH) , volume=

    Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

  50. [50]

    Language Models (Mostly) Know What They Know

    Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

  51. [51]

    Teaching Models to Express Their Uncertainty in Words

    Teaching Models to Express Their Uncertainty in Words , author=. arXiv preprint arXiv:2205.14334 , year=

  52. [52]

    Bioinformatics , volume=

    BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

  53. [53]

    Papanikolaou, Yannis and Pierleoni, Andrea , journal=

  54. [54]

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=

  55. [55]

    Training language models to follow instructions with human feedback

    Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

  56. [56]

    Proceedings of Machine Learning and Systems , volume=

    Pathways: Asynchronous distributed dataflow for ML , author=. Proceedings of Machine Learning and Systems , volume=

  57. [57]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Multitask prompted training enables zero-shot task generalization , author=. arXiv preprint arXiv:2110.08207 , year=

  58. [58]

    Emergent Abilities of Large Language Models

    Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  59. [59]

    Nature Medicine , volume=

    Predicting conversion to wet age-related macular degeneration using deep learning , author=. Nature Medicine , volume=. 2020 , publisher=

  60. [60]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  61. [61]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  62. [62]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  63. [63]

    Large Language Models are Zero-Shot Reasoners

    Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

  64. [64]

    emrQA: A Large Corpus for Question Answering on Electronic Medical Records

    emrqa: A large corpus for question answering on electronic medical records , author=. arXiv preprint arXiv:1809.00732 , year=

  65. [65]

    BMC bioinformatics , volume=

    An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

  66. [66]

    Transactions of the Association for Computational Linguistics , volume=

    TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

  67. [67]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

  68. [68]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

  69. [69]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  70. [70]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  71. [71]

    Psicologia: Reflexao e Critica , volume=

    Scale development: ten main limitations and recommendations to improve future research practices , author=. Psicologia: Reflexao e Critica , volume=. 2017 , publisher=

  72. [72]

    Journal of patient safety , volume=

    The reliability of AHRQ Common Format Harm Scales in rating patient safety events , author=. Journal of patient safety , volume=. 2015 , publisher=

  73. [73]

    AI Open , year=

    Ptr: Prompt tuning with rules for text classification , author=. AI Open , year=

  74. [74]

    arXiv preprint arXiv:2109.04332 , year=

    Ppt: Pre-trained prompt tuning for few-shot learning , author=. arXiv preprint arXiv:2109.04332 , year=

  75. [75]

    arXiv preprint arXiv:2210.03029 , year=

    Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization , author=. arXiv preprint arXiv:2210.03029 , year=

  76. [76]

    arXiv preprint arXiv:2103.10385 , year=

    GPT understands, too , author=. arXiv preprint arXiv:2103.10385 , year=

  77. [77]

    arXiv preprint arXiv:2210.09338 , year=

    Deep bidirectional language-knowledge graph pretraining , author=. arXiv preprint arXiv:2210.09338 , year=

  78. [78]

    Bolton, Elliot and Hall, David and Yasunaga, Michihiro and Lee, Tony and Manning, Chris and Liang, Percy , title =

  79. [79]

    arXiv preprint arXiv:2203.15827 , year=

    LinkBERT: Pretraining Language Models with Document Links , author=. arXiv preprint arXiv:2203.15827 , year=

  80. [80]

    doi:10.5281/zenodo.5297715 , url =

    Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

Showing first 80 references.