arxiv: 2404.18416 · v2 · submitted 2024-04-29 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: no theorem link

Capabilities of Gemini Models in Medicine

Khaled Saab , Tao Tu , Wei-Hung Weng , Ryutaro Tanno , David Stutz , Ellery Wulczyn , Fan Zhang , Tim Strother

show 59 more authors

Chunjong Park Elahe Vedadi Juanma Zambrano Chaves Szu-Yeu Hu Mike Schaekermann Aishwarya Kamath Yong Cheng David G.T. Barrett Cathy Cheung Basil Mustafa Anil Palepu Daniel McDuff Le Hou Tomer Golany Luyang Liu Jean-baptiste Alayrac Neil Houlsby Nenad Tomasev Jan Freyberg Charles Lau Jonas Kemp Jeremy Lai Shekoofeh Azizi Kimberly Kanada SiWai Man Kavita Kulkarni Ruoxi Sun Siamak Shakeri Luheng He Ben Caine Albert Webson Natasha Latysheva Melvin Johnson Philip Mansfield Jian Lu Ehud Rivlin Jesper Anderson Bradley Green Renee Wong Jonathan Krause Jonathon Shlens Ewa Dominowska S. M. Ali Eslami Katherine Chou Claire Cui Oriol Vinyals Koray Kavukcuoglu James Manyika Jeff Dean Demis Hassabis Yossi Matias Dale Webster Joelle Barral Greg Corrado Christopher Semturs S. Sara Mahdavi Juraj Gottweis Alan Karthikesalingam Vivek Natarajan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG

keywords Med-Geminimedical AImultimodal modelsMedQAUSMLEGemini modelsstate-of-the-art performancemedical benchmarks

0 comments

The pith

Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Med-Gemini as a family of multimodal models built from Gemini foundations and specialized for medicine through web search integration and custom encoders for new data types. It reports new state-of-the-art results on 10 of 14 medical benchmarks, with the top model hitting 91.1 percent accuracy on MedQA using uncertainty-guided search and an average 44.5 percent relative gain over GPT-4V on seven multimodal tasks. A sympathetic reader would care because these results point to AI systems that could soon support diagnostic reasoning, long patient record analysis, medical education, and text summarization at or above expert levels. The work also shows strong in-context performance on needle-in-a-haystack retrieval from health records and medical video question answering without task-specific fine-tuning.

Core claim

Med-Gemini models, specialized for medicine with seamless web search and custom encoders, establish new state-of-the-art performance on 10 out of 14 medical benchmarks. The best model achieves 91.1 percent accuracy on the MedQA USMLE benchmark through a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark with direct comparison, and improves over GPT-4V by an average relative margin of 44.5 percent across seven multimodal benchmarks including NEJM Image Challenges and the health subset of MMMU. The models further demonstrate long-context strengths by achieving state-of-the-art results on retrieval from long de-identified health records and medical-vid

What carries the argument

The Med-Gemini family of multimodal models that add medical specialization to Gemini's core strengths via web search access and custom encoders for novel modalities.

If this is right

AI systems could match or exceed human performance on medical text summarization and multimodal image interpretation tasks.
Uncertainty-guided search offers a practical way to raise accuracy on medical question answering without additional training.
Long-context capabilities enable effective in-context use of full patient histories and video data for research and education.
The same specialization pattern could be applied to other high-stakes domains that need up-to-date knowledge and multimodal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world deployment would still require separate safety testing on noisy clinical data, since benchmark gains do not automatically guarantee clinical reliability.
The 44.5 percent multimodal margin suggests similar gains may appear in other visual-heavy medical workflows such as radiology or pathology.
Because the models rely on web search, their outputs could be kept current with new guidelines more easily than static fine-tuned systems.
Strong needle-in-a-haystack retrieval performance opens the possibility of using the models to surface relevant prior cases from large hospital archives.

Load-bearing premise

High accuracy on curated medical benchmarks will translate to reliable and safe performance when the models encounter noisy, incomplete, or out-of-distribution patient data in actual clinical settings.

What would settle it

A side-by-side comparison of Med-Gemini outputs against board-certified physicians on a held-out set of real de-identified hospital cases, measuring diagnostic accuracy and rate of unsafe recommendations.

read the original abstract

Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Med-Gemini hits 91.1% on MedQA and widens the multimodal gap over GPT-4V, but the gains rest on clean benchmarks without checks for real clinical noise.

read the letter

Med-Gemini reaches 91.1% on MedQA using uncertainty-guided search and improves over GPT-4V by 44.5% relative on seven multimodal tasks. Those numbers stand out because they come from a specialized family built on the Gemini base with added custom encoders and web search integration. The paper also shows the long-context strength on needle-in-haystack retrieval from de-identified health records and medical video QA, beating prior methods with simple in-context learning. It reports new state-of-the-art on ten of fourteen medical benchmarks and direct wins against the GPT-4 family wherever comparison is possible. The empirical sweep is broad and the comparisons are straightforward, which gives the results some weight for anyone tracking scaling in this domain. The main soft spot is the evaluation setup. All testing stays on curated, expert-cleaned datasets like MedQA, NEJM Image Challenges, and the MMMU health subset. There is no systematic probe for input noise, missing context, or distribution shift that shows up in actual EHRs or clinical dialogues. The abstract itself calls for further rigorous evaluation before deployment, and that caution lines up with the gap. Details on ablations for the search strategy and error bars on the headline scores are thin, so it is hard to separate the contribution of the new techniques from data scale or tuning. This paper is for groups working on medical multimodal models or clinical AI benchmarks. Readers who need concrete numbers on what current general models can achieve after specialization will find it useful. It deserves peer review because the benchmark claims are specific and falsifiable even if the generalization story needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Med-Gemini, a family of multimodal models specialized for medicine by building on Gemini's core strengths in multimodal and long-context reasoning. It reports new state-of-the-art results on 10 of 14 medical benchmarks, including 91.1% accuracy on MedQA (USMLE) via a novel uncertainty-guided search strategy, consistent outperformance over the GPT-4 family on all directly comparable benchmarks, and a 44.5% average relative improvement over GPT-4V across 7 multimodal benchmarks (NEJM Image Challenges, MMMU health subset, etc.). Additional results highlight long-context retrieval from de-identified health records, medical video QA, and surpassing human experts on medical text summarization, while noting the need for further rigorous evaluation before clinical deployment.

Significance. If the benchmark results prove robust, the work establishes a meaningful advance in medical AI capabilities by showing how general multimodal models can be efficiently specialized for high-stakes domains. The combination of web search integration, custom encoders, and in-context long-context handling without bespoke training is a notable strength. These results provide concrete evidence of progress toward AI support for medical reasoning, education, and research, while the paper's explicit caution about real-world deployment aligns with the safety-critical context.

major comments (3)

Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.
MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.
Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.

minor comments (2)

The abstract could explicitly list the exact prior SoTA scores and the specific GPT-4 variants used for each direct comparison to improve transparency.
Terminology such as 'needle-in-a-haystack retrieval' in the long-context experiments would benefit from a short definition or citation on first use for broader accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of result presentation and evaluation rigor. We address each major point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.

Authors: We agree that error bars and statistical measures would strengthen the presentation. The reported numbers reflect single-run evaluations on fixed benchmarks, which is standard practice for large multimodal models given the high computational cost of repeated full evaluations. In the revision we will add explicit discussion of this limitation in the results section, note that gains are consistent with prior single-run reports in the literature, and include any available variance from prompt-sensitivity checks performed during development. revision: partial
Referee: MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.

Authors: We will add a targeted ablation in the revised manuscript (main text or supplementary) that isolates the uncertainty-guided search by comparing it directly against a standard search baseline and a chain-of-thought baseline using the identical Med-Gemini model, prompt template, and retrieval setup. This will quantify the incremental contribution of the uncertainty component. revision: yes
Referee: Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.

Authors: We will expand the multimodal results section to include a full per-benchmark table listing absolute accuracies for both Med-Gemini and GPT-4V on each of the seven tasks, together with the relative improvement for every individual benchmark. Any variance estimates obtainable from the evaluation logs will be reported; otherwise we will note the single-run nature consistently with the first comment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are self-contained measurements

full rationale

The paper presents Med-Gemini as a family of specialized multimodal models evaluated directly on 14 fixed medical benchmarks (MedQA, NEJM Image Challenges, MMMU health subset, etc.). All reported results—91.1% MedQA accuracy via uncertainty-guided search, 44.5% relative gains over GPT-4V, long-context retrieval, and human-expert comparisons—are obtained by running the models on these curated test sets and recording accuracy or other metrics. No equations, parameter fits, uniqueness theorems, or ansatzes are invoked whose outputs are then relabeled as predictions; the central claims reduce only to the empirical measurements themselves. Self-citations, if present, are limited to prior Gemini work and do not carry the load of the performance numbers. The derivation chain is therefore empty of circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard large-language-model training assumptions and the premise that public medical benchmarks are representative; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5921 in / 1252 out tokens · 42701 ms · 2026-05-15T17:08:19.092721+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
cs.LG 2026-04 unverdicted novelty 7.0

BLEG enhances GNNs for fMRI brain network analysis by prompting LLMs for text augmentation, using cost-effective instruction tuning, and applying alignment losses during joint training.
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
cs.CL 2026-05 conditional novelty 6.0

Intent-aware retrieval over assertion-labeled knowledge graphs improves clinical QA accuracy by 22 percentage points on a new MIMIC-IV benchmark that stresses negation, temporality, and attribution.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment
cs.AI 2026-05 conditional novelty 6.0

In a large real-world randomized study, SymptomAI agents achieved higher differential diagnosis accuracy (OR 2.47) than clinicians and showed stronger results with dedicated symptom interviews.
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment
cs.AI 2026-05 unverdicted novelty 6.0

Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.
GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
cs.LG 2026-04 unverdicted novelty 6.0

GAZE framework with viewer tools and literature retrieval achieves 58.2 mAP@0.3 lesion localization and 34.9% top-1 diagnostic accuracy on 906 rare brain MRI cases in zero-shot setting, with larger gains on rarest pat...
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
cs.CL 2024-12 unverdicted novelty 6.0

HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
cs.CL 2026-04 unverdicted novelty 5.0

Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
cs.LG 2026-04 unverdicted novelty 5.0

BMBE separates LLM language handling from a standalone Bayesian diagnostic engine, producing calibrated selective diagnosis, a performance gap over frontier LLMs, and robustness to adversarial inputs.
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
cs.LG 2026-04 unverdicted novelty 5.0

Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.
Predicting Post-Traumatic Epilepsy from Clinical Records using Large Language Model Embeddings
cs.LG 2026-04 unverdicted novelty 5.0

LLM embeddings from clinical records, fused with tabular data via gradient-boosted trees, predict post-traumatic epilepsy at AUC-ROC 0.892 and AUPRC 0.798.
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
cs.CY 2026-04 unverdicted novelty 5.0

CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...
Measuring the metacognition of AI
cs.AI 2026-03 unverdicted novelty 5.0

Meta-d' and signal detection theory provide quantitative tools to assess metacognitive sensitivity and risk-based regulation in large language models.
Medical Reasoning with Large Language Models: A Survey and MR-Bench
cs.CL 2026-03 accept novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
cs.CL 2026-05 unverdicted novelty 4.0

Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
cs.CL 2026-05 unverdicted novelty 4.0

A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.

Reference graph

Works this paper leans on

269 extracted references · 269 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[1]

M. D. Abr \`a moff, M. E. Tarver, N. Loyo-Berrios, S. Trujillo, D. Char, Z. Obermeyer, M. B. Eydelman, F. P. of Ophthalmic Imaging, D. Algorithmic Interpretation Working Group of the Collaborative Community for Ophthalmic Imaging Foundation, Washington, and W. H. Maisel. Considerations for addressing bias in artificial intelligence for health equity. NPJ ...

work page 2023
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

work page 2022
[4]

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Antaki, D

F. Antaki, D. Milad, M. A. Chia, C.- \'E . Gigu \`e re, S. Touma, J. El-Khoury, P. A. Keane, and R. Duval. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. British Journal of Ophthalmology, 2023

work page 2023
[6]

Azizi, L

S. Azizi, L. Culp, J. Freyberg, B. Mustafa, S. Baur, S. Kornblith, T. Chen, N. Tomasev, J. Mitrovi \'c , P. Strachan, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7 0 (6): 0 756--779, 2023

work page 2023
[7]

Barham, A

P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, et al. Pathways: Asynchronous distributed dataflow for ML . Proceedings of Machine Learning and Systems, 4: 0 430--449, 2022

work page 2022
[8]

Besta, N

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models, 2024

work page 2024
[9]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[11]

\'A . A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern, and D. H. Chau. Fairvis: Visual analytics for discovering intersectional bias in machine learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 46--56. IEEE, 2019

work page 2019
[13]

D. S. Char, N. H. Shah, and D. Magnus. Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine, 378 0 (11): 0 981, 2018

work page 2018
[14]

W. Chen, J. Feng, J. Lu, and J. Zhou. Endo3d: Online workflow analysis for endoscopic surgeries based on 3d cnn and lstm. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th Internatio...

work page 2018
[15]

X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI : A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

work page arXiv 2022
[16]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM : Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

work page 2023
[17]

Cirillo, S

D. Cirillo, S. Catuara-Solarz, C. Morey, E. Guney, L. Subirats, S. Mellino, A. Gigante, A. Valencia, M. J. Rementeria, A. S. Chadha, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine, 3 0 (1): 0 1--11, 2020

work page 2020
[18]

Claussnitzer, S

M. Claussnitzer, S. N. Dankel, K.-H. Kim, G. Quon, W. Meuleman, C. Haugen, V. Glunk, I. S. Sousa, J. L. Beaudry, V. Puviindran, et al. Fto obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine, 373 0 (10): 0 895--907, 2015

work page 2015
[19]

Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014

Cochrane. Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014. https://methods.cochrane.org/sites/methods.cochrane.org.sdt/files/uploads/Draft PLS document.pdf

work page 2014
[20]

E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702--703, 2020

work page 2020
[22]

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[23]

P. Densen. Challenges and opportunities facing medical education. Transactions of the American Clinical and Climatological Association, 122: 0 48, 2011

work page 2011
[24]

Devaraj, I

A. Devaraj, I. Marshall, B. Wallace, and J. J. Li. Paragraph-level simplification of medical texts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4972--4984. Association for Computational Linguistics, June 2021. URL https://www.aclweb.org/anthology/2021.naacl-main.395

work page 2021
[25]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

A. V. Eriksen, S. M \"o ller, and J. Ryg. Use of GPT-4 to diagnose complex clinical cases, 2023

work page 2023
[28]

Feder, I

A. Feder, I. Laish, S. Agarwal, U. Lerner, A. Atias, C. Cheung, P. Clardy, A. Peled-Cohen, R. Fellinger, H. Liu, et al. Building a clinically-focused problem list from medical notes. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 60--68, 2022

work page 2022
[29]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

work page 2022
[31]

E. Ford, J. A. Carroll, H. E. Smith, D. Scott, and J. A. Cassell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23 0 (5): 0 1007--1015, 2016

work page 2016
[33]

Ganapathi, J

S. Ganapathi, J. Palmer, J. E. Alderman, M. Calvert, C. Espinoza, J. Gath, M. Ghassemi, K. Heller, F. Mckay, A. Karthikesalingam, et al. Tackling bias in ai health datasets through the standing together initiative. Nature Medicine, 28 0 (11): 0 2232--2233, 2022

work page 2022
[34]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024

work page 2024
[36]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

work page 2024
[37]

J. W. Gichoya, I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L.-C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S.-C. Huang, et al. Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 4 0 (6): 0 e406--e414, 2022

work page 2022
[38]

Golany, A

T. Golany, A. Aides, D. Freedman, N. Rabani, Y. Liu, E. Rivlin, G. S. Corrado, Y. Matias, W. Khoury, H. Kashtan, et al. Artificial intelligence for phase recognition in complex laparoscopic cholecystectomy. Surgical Endoscopy, 36 0 (12): 0 9215--9223, 2022

work page 2022
[40]

E. D. Goodman, K. K. Patel, Y. Zhang, W. Locke, C. J. Kennedy, R. Mehrotra, S. Ren, M. Guan, O. Zohar, M. Downing, et al. Analyzing surgical technique in diverse open surgical videos with multitask machine learning. JAMA surgery, 159 0 (2): 0 185--192, 2024

work page 2024
[41]

K. K. Grandage, D. C. Slawson, and A. F. Shaughnessy. When less is more: a practical approach to searching for evidence-based answers. Journal of the Medical Library Association, 90 0 (3): 0 298, 2002

work page 2002
[42]

L. D. Gruppen. Clinical reasoning: defining it, teaching it, assessing it, studying it. Western Journal of Emergency Medicine, 18 0 (1): 0 4, 2017

work page 2017
[44]

Gupta, K

D. Gupta, K. Attal, and D. Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10 0 (1): 0 158, 2023

work page 2023
[45]

S. Hao, T. Liu, Z. Wang, and Z. Hu. ToolkenGPT : Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024

work page 2024
[46]

X. He, Z. Cai, W. Wei, Y. Zhang, L. Mou, E. Xing, and P. Xie. PathVQA : 30000+ questions for medical visual question answering. arXiv preprint arXiv:2010.12435, 2020

work page arXiv 2010
[47]

Horvitz, D

E. Horvitz, D. Heckerman, B. N. Nathwani, and L. M. Fagan. Diagnostic strategies in the hypothesis-directed pathfinder system. pages 630--636, January 1984. URL https://www.microsoft.com/en-us/research/publication/diagnostic-strategies-hypothesis-directed-pathfinder-system/

work page 1984
[48]

Hou and Z

W. Hou and Z. Ji. GeneTuring tests GPT models in genomics. BioRxiv, 2023

work page 2023
[49]

Huang and K

J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey, 2023

work page 2023
[50]

Huang, L

J. Huang, L. Neill, M. Wittbrodt, D. Melnick, M. Klug, M. Thompson, J. Bailitz, T. Loftus, S. Malik, A. Phull, et al. Generative artificial intelligence for chest radiograph interpretation in the emergency department. JAMA Network Open, 6 0 (10): 0 e2336100--e2336100, 2023

work page 2023
[52]

Irvin, P

J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. Che X pert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597, 2019

work page 2019
[53]

A. Iyer, G. Sen, and P. \"O stlin. The intersections of gender and class in health status and health care. Global public health, 3 0 (S1): 0 13--24, 2008

work page 2008
[54]

S. E. Jackson, R. A. Hackett, and A. Steptoe. Associations between age discrimination and health and wellbeing: cross-sectional and prospective analysis of the english longitudinal study of ageing. The Lancet Public Health, 4 0 (4): 0 e200--e208, 2019

work page 2019
[55]

P. B. Jensen, L. J. Jensen, and S. Brunak. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13 0 (6): 0 395--405, 2012

work page 2012
[56]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

work page 2021
[57]

Q. Jin, Y. Yang, Q. Chen, and Z. Lu. GeneGPT : Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 0 (2): 0 btae075, 2024

work page 2024
[58]

A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. MIMIC-III , a freely accessible critical care database. Scientific data, 3 0 (1): 0 1--9, 2016

work page 2016
[59]

A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng. MIMIC-CXR , a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6 0 (1): 0 317, 2019 a

work page 2019
[60]

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng. MIMIC-CXR-JPG , a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019 b

work page arXiv 1901
[61]

Kanjee, B

Z. Kanjee, B. Crowe, and A. Rodman. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama, 330 0 (1): 0 78--80, 2023

work page 2023
[62]

J. A. Kent, V. Patel, and N. A. Varela. Gender disparities in health care. Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine, 79 0 (5): 0 555--559, 2012

work page 2012
[63]

u r Evidenz, Fortbildung und Qualit \

I. Klerings, A. S. Weinhandl, and K. J. Thaler. Information overload in healthcare: too much of a good thing? Zeitschrift f \"u r Evidenz, Fortbildung und Qualit \"a t im Gesundheitswesen , 109 0 (4-5): 0 285--290, 2015

work page 2015
[64]

Kouzy, J

R. Kouzy, J. Abi Jaoude, A. Kraitem, M. B. El Alam, B. Karam, E. Adib, J. Zarka, C. Traboulsi, E. W. Akl, and K. Baddour. Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter. Cureus, 12 0 (3), 2020

work page 2020
[65]

Laber, S

S. Laber, S. Forcisi, L. Bentley, J. Petzold, F. Moritz, K. S. Smirnov, L. Al Sadat, I. Williamson, S. Strobel, T. Agnew, et al. Linking the fto obesity rs1421085 variant circuitry to cellular, metabolic, and organismal phenotypes in vivo. Science advances, 7 0 (30): 0 eabg0108, 2021

work page 2021
[66]

Le Scao, A

T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili \'c , D. Hesslow, R. Castagn \'e , A. S. Luccioni, F. Yvon, M. Gall \'e , et al. Bloom: A 176b-parameter open-access multilingual language model. 2022

work page 2022
[67]

Leifman, A

G. Leifman, A. Aides, T. Golany, D. Freedman, and E. Rivlin. Pixel-accurate segmentation of surgical tools based on bounding box annotations. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 5096--5103. IEEE, 2022

work page 2022
[69]

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. LLaVa-Med : Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[70]

Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo. A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30 0 (2): 0 340--347, 2023

work page 2023
[71]

Liu, L.-M

B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650--1654. IEEE, 2021

work page 2021
[72]

M. Liu, Y. Ning, S. Teixayavong, M. Mertens, J. Xu, D. S. W. Ting, L. T.-E. Cheng, J. C. L. Ong, Z. L. Teo, T. F. Tan, et al. A translational perspective towards clinical ai fairness. NPJ Digital Medicine, 6 0 (1): 0 172, 2023

work page 2023
[73]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

work page 2024
[74]

R. J. Loos and G. S. Yeo. The genetics of obesity: from discovery to biology. Nature Reviews Genetics, 23 0 (2): 0 120--133, 2022

work page 2022
[75]

L \'o pez and V

N. L \'o pez and V. L. Gadsden. Health inequities, social determinants, and intersectionality. In Perspectives on health equity and social determinants of health. National Academies Press (US), 2017

work page 2017
[77]

R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23 0 (6): 0 bbac409, 2022

work page 2022
[81]

Medina-Mart \' nez, C

J. Medina-Mart \' nez, C. Saus-Ortega, M. M. S \'a nchez-Lorente, E. M. Sosa-Palanca, P. Garc \' a-Mart \' nez, and M. I. M \'a rmol-L \'o pez. Health inequities in lgbt people and nursing interventions to reduce them: A systematic review. International Journal of Environmental Research and Public Health, 18 0 (22): 0 11801, 2021

work page 2021
[82]

Papers with code - medical, 2024

Meta. Papers with code - medical, 2024. URL https://paperswithcode.com/area/medical. Accessed: 2024-04-26

work page 2024
[83]

M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616 0 (7956): 0 259--265, 2023 a

work page 2023
[84]

M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353--367. PMLR, 2023 b

work page 2023
[85]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[87]

Novin and E

A. Novin and E. Meyers. Making sense of conflicting science information: Exploring bias in the search engine result page. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 175--184, 2017

work page 2017
[88]

C. I. Nwoye, D. Mutter, J. Marescaux, and N. Padoy. Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14: 0 1059--1067, 2019

work page 2019
[89]

Obermeyer, B

Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366 0 (6464): 0 447--453, 2019

work page 2019
[90]

J. Oh, G. Lee, S. Bae, J.-m. Kwon, and E. Choi. Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 66277--66288. Curran Associates, Inc., 2023. URL https://proceedings.neurips...

work page 2023
[91]

J. A. Omiye, J. C. Lester, S. Spichak, V. Rotemberg, and R. Daneshjou. Large language models propagate race-based medicine. NPJ Digital Medicine, 6 0 (1): 0 195, 2023

work page 2023
[92]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[93]

A. G. Pacheco, G. R. Lima, A. S. Salomao, B. Krohling, I. P. Biral, G. G. de Angelo, F. C. Alves Jr, J. G. Esgario, A. C. Simora, P. B. Castro, et al. PAD-UFES-20 : A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32: 0 106221, 2020

work page 2020
[94]

Parmar, A

M. Parmar, A. Naik, H. Gupta, D. Agrawal, and C. Baral. LongBoX : Evaluating transformers on long-sequence clinical tasks, 2023

work page 2023
[95]

Pelka, S

O. Pelka, S. Koitka, J. R \"u ckert, F. Nensa, and C. M. Friedrich. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 201...

work page 2018

Showing first 80 references.