Towards Expert-Level Medical Question Answering with Large Language Models

Alan Karthikesalingam; Amy Wang; Blaise Aguera y Arcas; Bradley Green; Christopher Semturs; Dale Webster; Darlene Neal; Ellery Wulczyn; Ewa Dominowska; Greg S. Corrado

arxiv: 2305.09617 · v1 · pith:SUKURWIFnew · submitted 2023-05-16 · 💻 cs.CL · cs.AI· cs.LG

Towards Expert-Level Medical Question Answering with Large Language Models

Karan Singhal , Tao Tu , Juraj Gottweis , Rory Sayres , Ellery Wulczyn , Le Hou , Kevin Clark , Stephen Pfohl

show 23 more authors

Heather Cole-Lewis Darlene Neal Mike Schaekermann Amy Wang Mohamed Amin Sami Lachgar Philip Mansfield Sushant Prakash Bradley Green Ewa Dominowska Blaise Aguera y Arcas Nenad Tomasev Yun Liu Renee Wong Christopher Semturs S. Sara Mahdavi Joelle Barral Dale Webster Greg S. Corrado Yossi Matias Shekoofeh Azizi Alan Karthikesalingam Vivek Natarajan

This is my paper

Pith reviewed 2026-05-24 04:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords medical question answeringlarge language modelsMed-PaLM 2MedQAUSMLEphysician evaluationensemble refinement

0 comments

The pith

Med-PaLM 2 reaches 86.5% on MedQA and is preferred by physicians over human answers on eight of nine clinical utility axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Med-PaLM 2, built from the PaLM 2 base model with added medical finetuning and new prompting methods, lifts accuracy on medical question benchmarks far above earlier systems. It records a 19-point gain on the MedQA dataset of USMLE-style questions and matches or exceeds leading results on MedMCQA, PubMedQA, and clinical MMLU topics. In head-to-head rankings of long-form answers to 1066 consumer questions, physicians rated Med-PaLM 2 higher than physician-written answers on most measures of clinical usefulness. The same pattern held on a new set of adversarial questions meant to expose model weaknesses. These results indicate that large language models are moving closer to physician-level performance on medical question answering tasks.

Core claim

Med-PaLM 2 combines PaLM 2 base improvements, medical domain finetuning, and a novel ensemble refinement prompting strategy to score up to 86.5% on MedQA, more than 19 points above Med-PaLM, while also producing answers that physicians prefer to those written by other physicians on eight of nine axes of clinical utility in pairwise tests on over one thousand consumer questions.

What carries the argument

Ensemble refinement, a prompting method that generates and iteratively selects among multiple candidate answers to improve final output quality on medical questions.

If this is right

Large language models can now exceed passing thresholds on USMLE-style questions by wide margins.
Physician preference for model answers extends across multiple dimensions of clinical utility including accuracy and helpfulness.
Performance gains appear on both standard and adversarially designed medical question sets.
The same scaling and prompting techniques that lifted Med-PaLM to Med-PaLM 2 can be applied to additional medical QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the preference results hold in live settings, AI systems could serve as first-line responders for routine medical queries.
The gap between benchmark success and safe deployment in hospitals still requires separate safety and outcome studies.
Similar refinement methods might transfer to other high-stakes domains where expert preference data can be collected.

Load-bearing premise

Benchmark accuracy on MedQA together with physician preference rankings on selected long-form questions is a sufficient stand-in for expert-level medical question answering ability.

What would settle it

A study in which Med-PaLM 2 answers produce measurably lower accuracy or worse patient outcomes than physicians in unscripted clinical encounters would show the performance gap has not closed.

read the original abstract

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Med-PaLM 2, which combines PaLM 2 base model improvements, medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. It reports up to 86.5% accuracy on MedQA (a >19% improvement over Med-PaLM and new SOTA), near-SOTA results on MedMCQA, PubMedQA, and MMLU clinical topics, and human evaluations where physicians preferred Med-PaLM 2 answers over physician answers on 8 of 9 clinical utility axes for 1066 consumer questions (p<0.001) plus gains on 240 adversarial long-form questions.

Significance. If the performance numbers and preference results hold under scrutiny, the work demonstrates meaningful progress toward physician-comparable medical QA with LLMs, supported by both objective benchmarks and a large-scale human study. Credit is due for the scale of the pairwise human evaluation (1066+240 questions), inclusion of adversarial probing, statistical reporting, and the appropriately cautious framing that further real-world studies are needed.

major comments (2)

[Abstract / Methods] Abstract and methods description: the central accuracy claims (e.g., 86.5% on MedQA) and p-values for human preferences rest on unreported details of the medical finetuning corpus, exact implementation of ensemble refinement, statistical test procedures (including multiple-comparison correction), and any data-leakage audits against the benchmark test sets; these omissions are load-bearing for independent verification of the reported gains.
[Human Evaluation] Human evaluation section: the selection criteria for the 1066 consumer questions and the protocol for generating the physician reference answers are not specified, which directly affects the interpretability of the 8/9-axis preference result and the claim of clinical utility.

minor comments (1)

[Abstract] The abstract would be clearer if it explicitly stated the total number of questions per evaluation axis rather than aggregating the 1066 figure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and for recognizing the scale of our human evaluation and the cautious framing of our results. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript's transparency and reproducibility.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and methods description: the central accuracy claims (e.g., 86.5% on MedQA) and p-values for human preferences rest on unreported details of the medical finetuning corpus, exact implementation of ensemble refinement, statistical test procedures (including multiple-comparison correction), and any data-leakage audits against the benchmark test sets; these omissions are load-bearing for independent verification of the reported gains.

Authors: We agree that greater methodological transparency will facilitate independent verification. In the revised manuscript we will expand the Methods section with: (i) a high-level characterization of the medical domain finetuning data (while noting that the precise corpus composition remains proprietary), (ii) a step-by-step description of the ensemble refinement procedure, (iii) the exact statistical tests employed together with any multiple-comparison corrections, and (iv) the data-leakage audit protocol applied to the benchmark test sets. These additions directly address the referee's concern. revision: partial
Referee: [Human Evaluation] Human evaluation section: the selection criteria for the 1066 consumer questions and the protocol for generating the physician reference answers are not specified, which directly affects the interpretability of the 8/9-axis preference result and the claim of clinical utility.

Authors: We acknowledge that explicit description of question selection and the physician-answer protocol is necessary for proper interpretation. We will revise the Human Evaluation section to specify the criteria used to curate the 1066 consumer questions (including relevance, diversity, and source distribution) and to detail the instructions and quality-control steps given to the physicians who produced the reference answers. revision: yes

standing simulated objections not resolved

Exact composition and provenance of the proprietary medical finetuning corpus

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks and independent evaluations

full rationale

The paper reports empirical performance of Med-PaLM 2 on standard external benchmarks (MedQA at 86.5%, MedMCQA, PubMedQA, MMLU) and physician pairwise preferences on 1066 consumer and 240 adversarial long-form questions. These metrics are not derived from any self-referential definitions, fitted parameters renamed as predictions, or self-citation chains. The work explicitly caveats that further real-world studies are needed and frames results as progress toward expert-level performance rather than attainment. No equations, ansatzes, or uniqueness theorems are invoked in a load-bearing way; the derivation chain is absent because the contribution is empirical reporting against independent references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The work is empirical and likely contains many training hyperparameters and data choices that are not described here.

pith-pipeline@v0.9.0 · 6020 in / 1218 out tokens · 53190 ms · 2026-05-24T04:29:42.543473+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
cs.CL 2026-05 unverdicted novelty 6.0

CuraView detects sentence-level faithfulness hallucinations in medical discharge summaries via GraphRAG knowledge graphs and multi-agent evidence grading, achieving 0.831 F1 on critical contradictions with a fine-tune...
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompt...
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
Retrieval-Augmented Generation for Natural Language Processing: A Survey
cs.CL 2024-07 accept novelty 6.0

The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.
Capabilities of Gemini Models in Medicine
cs.AI 2024-04 unverdicted novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research
cs.AI 2026-05 unverdicted novelty 5.0

NeuroAgent uses a hierarchical LLM agent framework with Generate-Execute-Validate loops to automate neuroimaging preprocessing, reaching 84.8% end-to-end correctness and 0.9518 AUC for Alzheimer's classification on 14...
PRISM: Perception Reasoning Interleaved for Sequential Decision Making
cs.AI 2026-05 unverdicted novelty 5.0

PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
cs.CV 2025-12 unverdicted novelty 4.0

Fine-tuned MedGemma outperforms untuned GPT-4 in zero-shot medical image disease classification, achieving 80.37% versus 69.58% mean test accuracy with higher sensitivity for cancer and pneumonia.
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
cs.CL 2025-04 unverdicted novelty 4.0

QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% im...
Query pipeline optimization for cancer patient question answering systems
cs.CL 2024-12 unverdicted novelty 4.0

Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.
Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
cs.CL 2024-08 unverdicted novelty 4.0

GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.
ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook
eess.SP 2026-04 unverdicted novelty 3.0

ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation
cs.CL 2025-02 unverdicted novelty 3.0

Fine-tuning and data augmentation improve LLM performance on medical jargon extraction and prioritization from EHR notes, with augmented open-source models sometimes outperforming closed-source ones on 106 annotated notes.
Benchmark Data Contamination of Large Language Models: A Survey
cs.CL 2024-06 unverdicted novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Data-Centric Foundation Models in Computational Healthcare: A Survey
cs.LG 2024-01 unverdicted novelty 3.0

The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
Entry-level guide to the use of large language models for medical research
cs.AI 2024-10 unverdicted novelty 2.0

A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.

Reference graph

Works this paper leans on

136 extracted references · 136 canonical work pages · cited by 23 Pith papers · 33 internal anchors

[1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

medRxiv , pages=

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model , author=. medRxiv , pages=. 2023 , publisher=

work page 2023
[4]

International Conference on Machine Learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[5]

arXiv preprint arXiv:2107.13586 , year=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. arXiv preprint arXiv:2107.13586 , year=

work page arXiv
[6]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[8]

Canadian Journal of Philosophy , pages=

The Algorithmic Leviathan: Arbitrariness, Fairness, and Opportunity in Algorithmic Decision-Making Systems , author=. Canadian Journal of Philosophy , pages=. 2022 , publisher=

work page 2022
[9]

Medical care , volume=

Measuring harm in healthcare: optimizing adverse event review , author=. Medical care , volume=. 2017 , publisher=

work page 2017
[10]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , journal=

work page
[13]

Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=

work page
[14]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

work page
[15]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[16]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Solving Quantitative Reasoning Problems with Language Models

Solving quantitative reasoning problems with language models , author=. arXiv preprint arXiv:2206.14858 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021
[21]

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W and Lu, Xinghua , journal=

work page
[22]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=

work page
[23]

The Power of Scale for Parameter-Efficient Prompt Tuning

The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2205.12689 , year=

Large Language Models are Zero-Shot Clinical Information Extractors , author=. arXiv preprint arXiv:2205.12689 , year=

work page arXiv
[25]

arXiv preprint arXiv:2207.08143 , year=

Can large language models reason about medical questions? , author=. arXiv preprint arXiv:2207.08143 , year=

work page arXiv
[26]

2015 , journal =

Tamara Williams and Marilyn Szekendi and Stephen Pavkovic and Wanda Clevenger and Julie Cerese , title =. 2015 , journal =. doi:10.1097/PTS.0b013e3182948ef9 , language =

work page doi:10.1097/pts.0b013e3182948ef9 2015
[27]

Feng, Steven Y and Khetan, Vivek and Sacaleanu, Bogdan and Gershman, Anatole and Hovy, Eduard , journal=

work page
[28]

Taylor, Ross and Kardas, Marcin and Cucurull, Guillem and Scialom, Thomas and Hartshorn, Anthony and Saravia, Elvis and Poulton, Andrew and Kerkez, Viktor and Stojnic, Robert , journal=

work page
[29]

arXiv preprint arXiv:2204.02329 , year=

Can language models learn from explanations in context? , author=. arXiv preprint arXiv:2204.02329 , year=

work page arXiv
[30]

Large Language Models Still Can't Plan (A Benchmark for

Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , journal=. Large Language Models Still Can't Plan (A Benchmark for

work page
[31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2202.01875 , year=

Rethinking Explainability as a Dialogue: A Practitioner's Perspective , author=. arXiv preprint arXiv:2202.01875 , year=

work page arXiv
[34]

Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

Expert discussions improve comprehension of difficult cases in medical image assessment , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

work page 2020
[35]

NPJ digital medicine , volume=

Deep learning-enabled medical computer vision , author=. NPJ digital medicine , volume=. 2021 , publisher=

work page 2021
[36]

Nature Protocols , volume=

Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records , author=. Nature Protocols , volume=. 2021 , publisher=

work page 2021
[37]

2022 , organization=

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle=. 2022 , organization=

work page 2022
[38]

Overview of the medical question answering task at

Abacha, Asma Ben and Agichtein, Eugene and Pinter, Yuval and Demner-Fushman, Dina , booktitle=. Overview of the medical question answering task at

work page
[39]

, author=

Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. , author=. MedInfo , pages=

work page
[40]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint arXiv:2206.04615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3

Korngiebel, Diane M and Mooney, Sean D , journal=. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3. 2021 , publisher=

work page 2021
[42]

Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3

Sezgin, Emre and Sirrianni, Joseph and Linwood, Simon L and others , journal=. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3. 2022 , publisher=

work page 2022
[43]

Hong, Zhi and Ajith, Aswathy and Pauloski, Gregory and Duede, Eamon and Malamud, Carl and Magoulas, Roger and Chard, Kyle and Foster, Ian , journal=

work page
[44]

arXiv preprint arXiv:2010.06060 , year=

BioMegatron: Larger biomedical domain language model , author=. arXiv preprint arXiv:2010.06060 , year=

work page arXiv 2010
[45]

Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art , author=. Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

work page
[46]

Beltagy, Iz and Lo, Kyle and Cohan, Arman , journal=

work page
[47]

2022 , publisher=

Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan , journal=. 2022 , publisher=

work page 2022
[48]

arXiv preprint arXiv:2207.07411 , year=

Plex: Towards reliability using pretrained large model extensions , author=. arXiv preprint arXiv:2207.07411 , year=

work page arXiv
[49]

ACM Transactions on Computing for Healthcare (HEALTH) , volume=

Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

work page 2021
[50]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Teaching Models to Express Their Uncertainty in Words

Teaching Models to Express Their Uncertainty in Words , author=. arXiv preprint arXiv:2205.14334 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Bioinformatics , volume=

BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020
[53]

Papanikolaou, Yannis and Pierleoni, Andrea , journal=

work page
[54]

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=

work page
[55]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Proceedings of Machine Learning and Systems , volume=

Pathways: Asynchronous distributed dataflow for ML , author=. Proceedings of Machine Learning and Systems , volume=

work page
[57]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask prompted training enables zero-shot task generalization , author=. arXiv preprint arXiv:2110.08207 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Nature Medicine , volume=

Predicting conversion to wet age-related macular degeneration using deep learning , author=. Nature Medicine , volume=. 2020 , publisher=

work page 2020
[60]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[62]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

emrqa: A large corpus for question answering on electronic medical records , author=. arXiv preprint arXiv:1809.00732 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015
[66]

Transactions of the Association for Computational Linguistics , volume=

TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

work page 2020
[67]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page
[71]

Psicologia: Reflexao e Critica , volume=

Scale development: ten main limitations and recommendations to improve future research practices , author=. Psicologia: Reflexao e Critica , volume=. 2017 , publisher=

work page 2017
[72]

Journal of patient safety , volume=

The reliability of AHRQ Common Format Harm Scales in rating patient safety events , author=. Journal of patient safety , volume=. 2015 , publisher=

work page 2015
[73]

AI Open , year=

Ptr: Prompt tuning with rules for text classification , author=. AI Open , year=

work page
[74]

arXiv preprint arXiv:2109.04332 , year=

Ppt: Pre-trained prompt tuning for few-shot learning , author=. arXiv preprint arXiv:2109.04332 , year=

work page arXiv
[75]

arXiv preprint arXiv:2210.03029 , year=

Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization , author=. arXiv preprint arXiv:2210.03029 , year=

work page arXiv
[76]

arXiv preprint arXiv:2103.10385 , year=

GPT understands, too , author=. arXiv preprint arXiv:2103.10385 , year=

work page arXiv
[77]

arXiv preprint arXiv:2210.09338 , year=

Deep bidirectional language-knowledge graph pretraining , author=. arXiv preprint arXiv:2210.09338 , year=

work page arXiv
[78]

Bolton, Elliot and Hall, David and Yasunaga, Michihiro and Lee, Tony and Manning, Chris and Liang, Percy , title =

work page
[79]

arXiv preprint arXiv:2203.15827 , year=

LinkBERT: Pretraining Language Models with Document Links , author=. arXiv preprint arXiv:2203.15827 , year=

work page arXiv
[80]

doi:10.5281/zenodo.5297715 , url =

Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

work page doi:10.5281/zenodo.5297715

Showing first 80 references.

[1] [1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

LaMDA: Language Models for Dialog Applications

Lamda: Language models for dialog applications , author=. arXiv preprint arXiv:2201.08239 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

medRxiv , pages=

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model , author=. medRxiv , pages=. 2023 , publisher=

work page 2023

[4] [4]

International Conference on Machine Learning , pages=

Glam: Efficient scaling of language models with mixture-of-experts , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022

[5] [5]

arXiv preprint arXiv:2107.13586 , year=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. arXiv preprint arXiv:2107.13586 , year=

work page arXiv

[6] [6]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[7] [7]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[8] [8]

Canadian Journal of Philosophy , pages=

The Algorithmic Leviathan: Arbitrariness, Fairness, and Opportunity in Algorithmic Decision-Making Systems , author=. Canadian Journal of Philosophy , pages=. 2022 , publisher=

work page 2022

[9] [9]

Medical care , volume=

Measuring harm in healthcare: optimizing adverse event review , author=. Medical care , volume=. 2017 , publisher=

work page 2017

[10] [10]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and others , journal=

work page

[13] [13]

Zhang, Susan and Roller, Stephen and Goyal, Naman and Artetxe, Mikel and Chen, Moya and Chen, Shuohui and Dewan, Christopher and Diab, Mona and Li, Xian and Lin, Xi Victoria and others , journal=

work page

[14] [14]

Chowdhery, Aakanksha and Narang, Sharan and Devlin, Jacob and Bosma, Maarten and Mishra, Gaurav and Roberts, Adam and Barham, Paul and Chung, Hyung Won and Sutton, Charles and Gehrmann, Sebastian and others , journal=

work page

[15] [15]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[16] [16]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Scao, Teven Le and Fan, Angela and Akiki, Christopher and Pavlick, Ellie and Ili. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Solving Quantitative Reasoning Problems with Language Models

Solving quantitative reasoning problems with language models , author=. arXiv preprint arXiv:2206.14858 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Applied Sciences , volume=

What disease does this patient have? a large-scale open domain question answering dataset from medical exams , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021

[21] [21]

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W and Lu, Xinghua , journal=

work page

[22] [22]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=

work page

[23] [23]

The Power of Scale for Parameter-Efficient Prompt Tuning

The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2205.12689 , year=

Large Language Models are Zero-Shot Clinical Information Extractors , author=. arXiv preprint arXiv:2205.12689 , year=

work page arXiv

[25] [25]

arXiv preprint arXiv:2207.08143 , year=

Can large language models reason about medical questions? , author=. arXiv preprint arXiv:2207.08143 , year=

work page arXiv

[26] [26]

2015 , journal =

Tamara Williams and Marilyn Szekendi and Stephen Pavkovic and Wanda Clevenger and Julie Cerese , title =. 2015 , journal =. doi:10.1097/PTS.0b013e3182948ef9 , language =

work page doi:10.1097/pts.0b013e3182948ef9 2015

[27] [27]

Feng, Steven Y and Khetan, Vivek and Sacaleanu, Bogdan and Gershman, Anatole and Hovy, Eduard , journal=

work page

[28] [28]

Taylor, Ross and Kardas, Marcin and Cucurull, Guillem and Scialom, Thomas and Hartshorn, Anthony and Saravia, Elvis and Poulton, Andrew and Kerkez, Viktor and Stojnic, Robert , journal=

work page

[29] [29]

arXiv preprint arXiv:2204.02329 , year=

Can language models learn from explanations in context? , author=. arXiv preprint arXiv:2204.02329 , year=

work page arXiv

[30] [30]

Large Language Models Still Can't Plan (A Benchmark for

Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , journal=. Large Language Models Still Can't Plan (A Benchmark for

work page

[31] [31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2202.01875 , year=

Rethinking Explainability as a Dialogue: A Practitioner's Perspective , author=. arXiv preprint arXiv:2202.01875 , year=

work page arXiv

[34] [34]

Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

Expert discussions improve comprehension of difficult cases in medical image assessment , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

work page 2020

[35] [35]

NPJ digital medicine , volume=

Deep learning-enabled medical computer vision , author=. NPJ digital medicine , volume=. 2021 , publisher=

work page 2021

[36] [36]

Nature Protocols , volume=

Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records , author=. Nature Protocols , volume=. 2021 , publisher=

work page 2021

[37] [37]

2022 , organization=

Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan , booktitle=. 2022 , organization=

work page 2022

[38] [38]

Overview of the medical question answering task at

Abacha, Asma Ben and Agichtein, Eugene and Pinter, Yuval and Demner-Fushman, Dina , booktitle=. Overview of the medical question answering task at

work page

[39] [39]

, author=

Bridging the Gap Between Consumers' Medication Questions and Trusted Answers. , author=. MedInfo , pages=

work page

[40] [40]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint arXiv:2206.04615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3

Korngiebel, Diane M and Mooney, Sean D , journal=. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3. 2021 , publisher=

work page 2021

[42] [42]

Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3

Sezgin, Emre and Sirrianni, Joseph and Linwood, Simon L and others , journal=. Operationalizing and Implementing Pretrained, Large Artificial Intelligence Linguistic Models in the US Health Care System: Outlook of Generative Pretrained Transformer 3. 2022 , publisher=

work page 2022

[43] [43]

Hong, Zhi and Ajith, Aswathy and Pauloski, Gregory and Duede, Eamon and Malamud, Carl and Magoulas, Roger and Chard, Kyle and Foster, Ian , journal=

work page

[44] [44]

arXiv preprint arXiv:2010.06060 , year=

BioMegatron: Larger biomedical domain language model , author=. arXiv preprint arXiv:2010.06060 , year=

work page arXiv 2010

[45] [45]

Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

Pretrained language models for biomedical and clinical tasks: Understanding and extending the state-of-the-art , author=. Proceedings of the 3rd Clinical Natural Language Processing Workshop , pages=

work page

[46] [46]

Beltagy, Iz and Lo, Kyle and Cohan, Arman , journal=

work page

[47] [47]

2022 , publisher=

Luo, Renqian and Sun, Liai and Xia, Yingce and Qin, Tao and Zhang, Sheng and Poon, Hoifung and Liu, Tie-Yan , journal=. 2022 , publisher=

work page 2022

[48] [48]

arXiv preprint arXiv:2207.07411 , year=

Plex: Towards reliability using pretrained large model extensions , author=. arXiv preprint arXiv:2207.07411 , year=

work page arXiv

[49] [49]

ACM Transactions on Computing for Healthcare (HEALTH) , volume=

Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

work page 2021

[50] [50]

Language Models (Mostly) Know What They Know

Language models (mostly) know what they know , author=. arXiv preprint arXiv:2207.05221 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Teaching Models to Express Their Uncertainty in Words

Teaching Models to Express Their Uncertainty in Words , author=. arXiv preprint arXiv:2205.14334 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Bioinformatics , volume=

BioBERT: a pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020

[53] [53]

Papanikolaou, Yannis and Pierleoni, Andrea , journal=

work page

[54] [54]

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=

work page

[55] [55]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[56] [56]

Proceedings of Machine Learning and Systems , volume=

Pathways: Asynchronous distributed dataflow for ML , author=. Proceedings of Machine Learning and Systems , volume=

work page

[57] [57]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Multitask prompted training enables zero-shot task generalization , author=. arXiv preprint arXiv:2110.08207 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Nature Medicine , volume=

Predicting conversion to wet age-related macular degeneration using deep learning , author=. Nature Medicine , volume=. 2020 , publisher=

work page 2020

[60] [60]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page

[62] [62]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

Large Language Models are Zero-Shot Reasoners

Large Language Models are Zero-Shot Reasoners , author=. arXiv preprint arXiv:2205.11916 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

emrqa: A large corpus for question answering on electronic medical records , author=. arXiv preprint arXiv:1809.00732 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015

[66] [66]

Transactions of the Association for Computational Linguistics , volume=

TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

work page 2020

[67] [67]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [70]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page

[71] [71]

Psicologia: Reflexao e Critica , volume=

Scale development: ten main limitations and recommendations to improve future research practices , author=. Psicologia: Reflexao e Critica , volume=. 2017 , publisher=

work page 2017

[72] [72]

Journal of patient safety , volume=

The reliability of AHRQ Common Format Harm Scales in rating patient safety events , author=. Journal of patient safety , volume=. 2015 , publisher=

work page 2015

[73] [73]

AI Open , year=

Ptr: Prompt tuning with rules for text classification , author=. AI Open , year=

work page

[74] [74]

arXiv preprint arXiv:2109.04332 , year=

Ppt: Pre-trained prompt tuning for few-shot learning , author=. arXiv preprint arXiv:2109.04332 , year=

work page arXiv

[75] [75]

arXiv preprint arXiv:2210.03029 , year=

Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization , author=. arXiv preprint arXiv:2210.03029 , year=

work page arXiv

[76] [76]

arXiv preprint arXiv:2103.10385 , year=

GPT understands, too , author=. arXiv preprint arXiv:2103.10385 , year=

work page arXiv

[77] [77]

arXiv preprint arXiv:2210.09338 , year=

Deep bidirectional language-knowledge graph pretraining , author=. arXiv preprint arXiv:2210.09338 , year=

work page arXiv

[78] [78]

Bolton, Elliot and Hall, David and Yasunaga, Michihiro and Lee, Tony and Manning, Chris and Liang, Percy , title =

work page

[79] [79]

arXiv preprint arXiv:2203.15827 , year=

LinkBERT: Pretraining Language Models with Document Links , author=. arXiv preprint arXiv:2203.15827 , year=

work page arXiv

[80] [80]

doi:10.5281/zenodo.5297715 , url =

Black, Sid and Gao, Leo and Wang, Phil and Leahy, Connor and Biderman, Stella , title =. doi:10.5281/zenodo.5297715 , url =

work page doi:10.5281/zenodo.5297715