pith. machine review for the scientific record. sign in

arxiv: 2404.18416 · v2 · submitted 2024-04-29 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: no theorem link

Capabilities of Gemini Models in Medicine

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords Med-Geminimedical AImultimodal modelsMedQAUSMLEGemini modelsstate-of-the-art performancemedical benchmarks
0
0 comments X

The pith

Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Med-Gemini as a family of multimodal models built from Gemini foundations and specialized for medicine through web search integration and custom encoders for new data types. It reports new state-of-the-art results on 10 of 14 medical benchmarks, with the top model hitting 91.1 percent accuracy on MedQA using uncertainty-guided search and an average 44.5 percent relative gain over GPT-4V on seven multimodal tasks. A sympathetic reader would care because these results point to AI systems that could soon support diagnostic reasoning, long patient record analysis, medical education, and text summarization at or above expert levels. The work also shows strong in-context performance on needle-in-a-haystack retrieval from health records and medical video question answering without task-specific fine-tuning.

Core claim

Med-Gemini models, specialized for medicine with seamless web search and custom encoders, establish new state-of-the-art performance on 10 out of 14 medical benchmarks. The best model achieves 91.1 percent accuracy on the MedQA USMLE benchmark through a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark with direct comparison, and improves over GPT-4V by an average relative margin of 44.5 percent across seven multimodal benchmarks including NEJM Image Challenges and the health subset of MMMU. The models further demonstrate long-context strengths by achieving state-of-the-art results on retrieval from long de-identified health records and medical-vid

What carries the argument

The Med-Gemini family of multimodal models that add medical specialization to Gemini's core strengths via web search access and custom encoders for novel modalities.

If this is right

  • AI systems could match or exceed human performance on medical text summarization and multimodal image interpretation tasks.
  • Uncertainty-guided search offers a practical way to raise accuracy on medical question answering without additional training.
  • Long-context capabilities enable effective in-context use of full patient histories and video data for research and education.
  • The same specialization pattern could be applied to other high-stakes domains that need up-to-date knowledge and multimodal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world deployment would still require separate safety testing on noisy clinical data, since benchmark gains do not automatically guarantee clinical reliability.
  • The 44.5 percent multimodal margin suggests similar gains may appear in other visual-heavy medical workflows such as radiology or pathology.
  • Because the models rely on web search, their outputs could be kept current with new guidelines more easily than static fine-tuned systems.
  • Strong needle-in-a-haystack retrieval performance opens the possibility of using the models to surface relevant prior cases from large hospital archives.

Load-bearing premise

High accuracy on curated medical benchmarks will translate to reliable and safe performance when the models encounter noisy, incomplete, or out-of-distribution patient data in actual clinical settings.

What would settle it

A side-by-side comparison of Med-Gemini outputs against board-certified physicians on a held-out set of real de-identified hospital cases, measuring diagnostic accuracy and rate of unsafe recommendations.

read the original abstract

Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Med-Gemini, a family of multimodal models specialized for medicine by building on Gemini's core strengths in multimodal and long-context reasoning. It reports new state-of-the-art results on 10 of 14 medical benchmarks, including 91.1% accuracy on MedQA (USMLE) via a novel uncertainty-guided search strategy, consistent outperformance over the GPT-4 family on all directly comparable benchmarks, and a 44.5% average relative improvement over GPT-4V across 7 multimodal benchmarks (NEJM Image Challenges, MMMU health subset, etc.). Additional results highlight long-context retrieval from de-identified health records, medical video QA, and surpassing human experts on medical text summarization, while noting the need for further rigorous evaluation before clinical deployment.

Significance. If the benchmark results prove robust, the work establishes a meaningful advance in medical AI capabilities by showing how general multimodal models can be efficiently specialized for high-stakes domains. The combination of web search integration, custom encoders, and in-context long-context handling without bespoke training is a notable strength. These results provide concrete evidence of progress toward AI support for medical reasoning, education, and research, while the paper's explicit caution about real-world deployment aligns with the safety-critical context.

major comments (3)
  1. Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.
  2. MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.
  3. Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.
minor comments (2)
  1. The abstract could explicitly list the exact prior SoTA scores and the specific GPT-4 variants used for each direct comparison to improve transparency.
  2. Terminology such as 'needle-in-a-haystack retrieval' in the long-context experiments would benefit from a short definition or citation on first use for broader accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of result presentation and evaluation rigor. We address each major point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.

    Authors: We agree that error bars and statistical measures would strengthen the presentation. The reported numbers reflect single-run evaluations on fixed benchmarks, which is standard practice for large multimodal models given the high computational cost of repeated full evaluations. In the revision we will add explicit discussion of this limitation in the results section, note that gains are consistent with prior single-run reports in the literature, and include any available variance from prompt-sensitivity checks performed during development. revision: partial

  2. Referee: MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.

    Authors: We will add a targeted ablation in the revised manuscript (main text or supplementary) that isolates the uncertainty-guided search by comparing it directly against a standard search baseline and a chain-of-thought baseline using the identical Med-Gemini model, prompt template, and retrieval setup. This will quantify the incremental contribution of the uncertainty component. revision: yes

  3. Referee: Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.

    Authors: We will expand the multimodal results section to include a full per-benchmark table listing absolute accuracies for both Med-Gemini and GPT-4V on each of the seven tasks, together with the relative improvement for every individual benchmark. Any variance estimates obtainable from the evaluation logs will be reported; otherwise we will note the single-run nature consistently with the first comment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are self-contained measurements

full rationale

The paper presents Med-Gemini as a family of specialized multimodal models evaluated directly on 14 fixed medical benchmarks (MedQA, NEJM Image Challenges, MMMU health subset, etc.). All reported results—91.1% MedQA accuracy via uncertainty-guided search, 44.5% relative gains over GPT-4V, long-context retrieval, and human-expert comparisons—are obtained by running the models on these curated test sets and recording accuracy or other metrics. No equations, parameter fits, uniqueness theorems, or ansatzes are invoked whose outputs are then relabeled as predictions; the central claims reduce only to the empirical measurements themselves. Self-citations, if present, are limited to prior Gemini work and do not carry the load of the performance numbers. The derivation chain is therefore empty of circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard large-language-model training assumptions and the premise that public medical benchmarks are representative; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5921 in / 1252 out tokens · 42701 ms · 2026-05-15T17:08:19.092721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

    cs.LG 2026-04 unverdicted novelty 7.0

    ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.

  2. BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    BLEG enhances GNNs for fMRI brain network analysis by prompting LLMs for text augmentation, using cost-effective instruction tuning, and applying alignment losses during joint training.

  3. ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

    cs.CL 2026-05 conditional novelty 6.0

    Intent-aware retrieval over assertion-labeled knowledge graphs improves clinical QA accuracy by 22 percentage points on a new MIMIC-IV benchmark that stresses negation, temporality, and attribution.

  4. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  5. SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

    cs.AI 2026-05 conditional novelty 6.0

    In a large real-world randomized study, SymptomAI agents achieved higher differential diagnosis accuracy (OR 2.47) than clinicians and showed stronger results with dedicated symptom interviews.

  6. SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

    cs.AI 2026-05 unverdicted novelty 6.0

    Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.

  7. GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

    cs.LG 2026-04 unverdicted novelty 6.0

    GAZE framework with viewer tools and literature retrieval achieves 58.2 mAP@0.3 lesion localization and 34.9% top-1 diagnostic accuracy on 906 rare brain MRI cases in zero-shot setting, with larger gains on rarest pat...

  8. Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

  9. MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

    cs.CL 2026-04 unverdicted novelty 6.0

    MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.

  10. Towards an AI co-scientist

    cs.AI 2025-02 unverdicted novelty 6.0

    A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

  11. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    cs.CL 2024-12 unverdicted novelty 6.0

    HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.

  12. LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

    cs.CL 2026-04 unverdicted novelty 5.0

    Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.

  13. Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

    cs.LG 2026-04 unverdicted novelty 5.0

    BMBE separates LLM language handling from a standalone Bayesian diagnostic engine, producing calibrated selective diagnosis, a performance gap over frontier LLMs, and robustness to adversarial inputs.

  14. Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.

  15. Predicting Post-Traumatic Epilepsy from Clinical Records using Large Language Model Embeddings

    cs.LG 2026-04 unverdicted novelty 5.0

    LLM embeddings from clinical records, fused with tabular data via gradient-boosted trees, predict post-traumatic epilepsy at AUC-ROC 0.892 and AUPRC 0.798.

  16. CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs

    cs.CY 2026-04 unverdicted novelty 5.0

    CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...

  17. Measuring the metacognition of AI

    cs.AI 2026-03 unverdicted novelty 5.0

    Meta-d' and signal detection theory provide quantitative tools to assess metacognitive sensitivity and risk-based regulation in large language models.

  18. Medical Reasoning with Large Language Models: A Survey and MR-Bench

    cs.CL 2026-03 accept novelty 5.0

    LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

  19. Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.

  20. Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines

    cs.CL 2026-05 unverdicted novelty 4.0

    A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.

Reference graph

Works this paper leans on

269 extracted references · 269 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    M. D. Abr \`a moff, M. E. Tarver, N. Loyo-Berrios, S. Trujillo, D. Char, Z. Obermeyer, M. B. Eydelman, F. P. of Ophthalmic Imaging, D. Algorithmic Interpretation Working Group of the Collaborative Community for Ophthalmic Imaging Foundation, Washington, and W. H. Maisel. Considerations for addressing bias in artificial intelligence for health equity. NPJ ...

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022

  4. [4]

    R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  5. [5]

    Antaki, D

    F. Antaki, D. Milad, M. A. Chia, C.- \'E . Gigu \`e re, S. Touma, J. El-Khoury, P. A. Keane, and R. Duval. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. British Journal of Ophthalmology, 2023

  6. [6]

    Azizi, L

    S. Azizi, L. Culp, J. Freyberg, B. Mustafa, S. Baur, S. Kornblith, T. Chen, N. Tomasev, J. Mitrovi \'c , P. Strachan, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7 0 (6): 0 756--779, 2023

  7. [7]

    Barham, A

    P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, et al. Pathways: Asynchronous distributed dataflow for ML . Proceedings of Machine Learning and Systems, 4: 0 430--449, 2022

  8. [8]

    Besta, N

    M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, and T. Hoefler. Graph of thoughts: Solving elaborate problems with large language models, 2024

  9. [9]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  10. [11]

    \'A . A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern, and D. H. Chau. Fairvis: Visual analytics for discovering intersectional bias in machine learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 46--56. IEEE, 2019

  11. [13]

    D. S. Char, N. H. Shah, and D. Magnus. Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine, 378 0 (11): 0 981, 2018

  12. [14]

    W. Chen, J. Feng, J. Lu, and J. Zhou. Endo3d: Online workflow analysis for endoscopic surgeries based on 3d cnn and lstm. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th Internatio...

  13. [15]

    X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI : A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

  14. [16]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM : Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

  15. [17]

    Cirillo, S

    D. Cirillo, S. Catuara-Solarz, C. Morey, E. Guney, L. Subirats, S. Mellino, A. Gigante, A. Valencia, M. J. Rementeria, A. S. Chadha, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine, 3 0 (1): 0 1--11, 2020

  16. [18]

    Claussnitzer, S

    M. Claussnitzer, S. N. Dankel, K.-H. Kim, G. Quon, W. Meuleman, C. Haugen, V. Glunk, I. S. Sousa, J. L. Beaudry, V. Puviindran, et al. Fto obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine, 373 0 (10): 0 895--907, 2015

  17. [19]

    Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014

    Cochrane. Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014. https://methods.cochrane.org/sites/methods.cochrane.org.sdt/files/uploads/Draft PLS document.pdf

  18. [20]

    E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702--703, 2020

  19. [22]

    Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

  20. [23]

    P. Densen. Challenges and opportunities facing medical education. Transactions of the American Clinical and Climatological Association, 122: 0 48, 2011

  21. [24]

    Devaraj, I

    A. Devaraj, I. Marshall, B. Wallace, and J. J. Li. Paragraph-level simplification of medical texts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4972--4984. Association for Computational Linguistics, June 2021. URL https://www.aclweb.org/anthology/2021.naacl-main.395

  22. [25]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  23. [26]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  24. [27]

    A. V. Eriksen, S. M \"o ller, and J. Ryg. Use of GPT-4 to diagnose complex clinical cases, 2023

  25. [28]

    Feder, I

    A. Feder, I. Laish, S. Agarwal, U. Lerner, A. Atias, C. Cheung, P. Clardy, A. Peled-Cohen, R. Fellinger, H. Liu, et al. Building a clinically-focused problem list from medical notes. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 60--68, 2022

  26. [29]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022

  27. [31]

    E. Ford, J. A. Carroll, H. E. Smith, D. Scott, and J. A. Cassell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23 0 (5): 0 1007--1015, 2016

  28. [33]

    Ganapathi, J

    S. Ganapathi, J. Palmer, J. E. Alderman, M. Calvert, C. Espinoza, J. Gath, M. Ghassemi, K. Heller, F. Mckay, A. Karthikesalingam, et al. Tackling bias in ai health datasets through the standing together initiative. Nature Medicine, 28 0 (11): 0 2232--2233, 2022

  29. [34]

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024

  30. [36]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

  31. [37]

    J. W. Gichoya, I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L.-C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S.-C. Huang, et al. Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 4 0 (6): 0 e406--e414, 2022

  32. [38]

    Golany, A

    T. Golany, A. Aides, D. Freedman, N. Rabani, Y. Liu, E. Rivlin, G. S. Corrado, Y. Matias, W. Khoury, H. Kashtan, et al. Artificial intelligence for phase recognition in complex laparoscopic cholecystectomy. Surgical Endoscopy, 36 0 (12): 0 9215--9223, 2022

  33. [40]

    E. D. Goodman, K. K. Patel, Y. Zhang, W. Locke, C. J. Kennedy, R. Mehrotra, S. Ren, M. Guan, O. Zohar, M. Downing, et al. Analyzing surgical technique in diverse open surgical videos with multitask machine learning. JAMA surgery, 159 0 (2): 0 185--192, 2024

  34. [41]

    K. K. Grandage, D. C. Slawson, and A. F. Shaughnessy. When less is more: a practical approach to searching for evidence-based answers. Journal of the Medical Library Association, 90 0 (3): 0 298, 2002

  35. [42]

    L. D. Gruppen. Clinical reasoning: defining it, teaching it, assessing it, studying it. Western Journal of Emergency Medicine, 18 0 (1): 0 4, 2017

  36. [44]

    Gupta, K

    D. Gupta, K. Attal, and D. Demner-Fushman. A dataset for medical instructional video classification and question answering. Scientific Data, 10 0 (1): 0 158, 2023

  37. [45]

    S. Hao, T. Liu, Z. Wang, and Z. Hu. ToolkenGPT : Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024

  38. [46]

    X. He, Z. Cai, W. Wei, Y. Zhang, L. Mou, E. Xing, and P. Xie. PathVQA : 30000+ questions for medical visual question answering. arXiv preprint arXiv:2010.12435, 2020

  39. [47]

    Horvitz, D

    E. Horvitz, D. Heckerman, B. N. Nathwani, and L. M. Fagan. Diagnostic strategies in the hypothesis-directed pathfinder system. pages 630--636, January 1984. URL https://www.microsoft.com/en-us/research/publication/diagnostic-strategies-hypothesis-directed-pathfinder-system/

  40. [48]

    Hou and Z

    W. Hou and Z. Ji. GeneTuring tests GPT models in genomics. BioRxiv, 2023

  41. [49]

    Huang and K

    J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey, 2023

  42. [50]

    Huang, L

    J. Huang, L. Neill, M. Wittbrodt, D. Melnick, M. Klug, M. Thompson, J. Bailitz, T. Loftus, S. Malik, A. Phull, et al. Generative artificial intelligence for chest radiograph interpretation in the emergency department. JAMA Network Open, 6 0 (10): 0 e2336100--e2336100, 2023

  43. [52]

    Irvin, P

    J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. Che X pert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597, 2019

  44. [53]

    A. Iyer, G. Sen, and P. \"O stlin. The intersections of gender and class in health status and health care. Global public health, 3 0 (S1): 0 13--24, 2008

  45. [54]

    S. E. Jackson, R. A. Hackett, and A. Steptoe. Associations between age discrimination and health and wellbeing: cross-sectional and prospective analysis of the english longitudinal study of ageing. The Lancet Public Health, 4 0 (4): 0 e200--e208, 2019

  46. [55]

    P. B. Jensen, L. J. Jensen, and S. Brunak. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13 0 (6): 0 395--405, 2012

  47. [56]

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

  48. [57]

    Q. Jin, Y. Yang, Q. Chen, and Z. Lu. GeneGPT : Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 0 (2): 0 btae075, 2024

  49. [58]

    A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. MIMIC-III , a freely accessible critical care database. Scientific data, 3 0 (1): 0 1--9, 2016

  50. [59]

    A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng. MIMIC-CXR , a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6 0 (1): 0 317, 2019 a

  51. [60]

    A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng. MIMIC-CXR-JPG , a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019 b

  52. [61]

    Kanjee, B

    Z. Kanjee, B. Crowe, and A. Rodman. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama, 330 0 (1): 0 78--80, 2023

  53. [62]

    J. A. Kent, V. Patel, and N. A. Varela. Gender disparities in health care. Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine, 79 0 (5): 0 555--559, 2012

  54. [63]

    u r Evidenz, Fortbildung und Qualit \

    I. Klerings, A. S. Weinhandl, and K. J. Thaler. Information overload in healthcare: too much of a good thing? Zeitschrift f \"u r Evidenz, Fortbildung und Qualit \"a t im Gesundheitswesen , 109 0 (4-5): 0 285--290, 2015

  55. [64]

    Kouzy, J

    R. Kouzy, J. Abi Jaoude, A. Kraitem, M. B. El Alam, B. Karam, E. Adib, J. Zarka, C. Traboulsi, E. W. Akl, and K. Baddour. Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter. Cureus, 12 0 (3), 2020

  56. [65]

    Laber, S

    S. Laber, S. Forcisi, L. Bentley, J. Petzold, F. Moritz, K. S. Smirnov, L. Al Sadat, I. Williamson, S. Strobel, T. Agnew, et al. Linking the fto obesity rs1421085 variant circuitry to cellular, metabolic, and organismal phenotypes in vivo. Science advances, 7 0 (30): 0 eabg0108, 2021

  57. [66]

    Le Scao, A

    T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili \'c , D. Hesslow, R. Castagn \'e , A. S. Luccioni, F. Yvon, M. Gall \'e , et al. Bloom: A 176b-parameter open-access multilingual language model. 2022

  58. [67]

    Leifman, A

    G. Leifman, A. Aides, T. Golany, D. Freedman, and E. Rivlin. Pixel-accurate segmentation of surgical tools based on bounding box annotations. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 5096--5103. IEEE, 2022

  59. [69]

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. LLaVa-Med : Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024

  60. [70]

    Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo. A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30 0 (2): 0 340--347, 2023

  61. [71]

    Liu, L.-M

    B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650--1654. IEEE, 2021

  62. [72]

    M. Liu, Y. Ning, S. Teixayavong, M. Mertens, J. Xu, D. S. W. Ting, L. T.-E. Cheng, J. C. L. Ong, Z. L. Teo, T. F. Tan, et al. A translational perspective towards clinical ai fairness. NPJ Digital Medicine, 6 0 (1): 0 172, 2023

  63. [73]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

  64. [74]

    R. J. Loos and G. S. Yeo. The genetics of obesity: from discovery to biology. Nature Reviews Genetics, 23 0 (2): 0 120--133, 2022

  65. [75]

    L \'o pez and V

    N. L \'o pez and V. L. Gadsden. Health inequities, social determinants, and intersectionality. In Perspectives on health equity and social determinants of health. National Academies Press (US), 2017

  66. [77]

    R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23 0 (6): 0 bbac409, 2022

  67. [81]

    Medina-Mart \' nez, C

    J. Medina-Mart \' nez, C. Saus-Ortega, M. M. S \'a nchez-Lorente, E. M. Sosa-Palanca, P. Garc \' a-Mart \' nez, and M. I. M \'a rmol-L \'o pez. Health inequities in lgbt people and nursing interventions to reduce them: A systematic review. International Journal of Environmental Research and Public Health, 18 0 (22): 0 11801, 2021

  68. [82]

    Papers with code - medical, 2024

    Meta. Papers with code - medical, 2024. URL https://paperswithcode.com/area/medical. Accessed: 2024-04-26

  69. [83]

    M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616 0 (7956): 0 259--265, 2023 a

  70. [84]

    M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353--367. PMLR, 2023 b

  71. [85]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  72. [87]

    Novin and E

    A. Novin and E. Meyers. Making sense of conflicting science information: Exploring bias in the search engine result page. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 175--184, 2017

  73. [88]

    C. I. Nwoye, D. Mutter, J. Marescaux, and N. Padoy. Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14: 0 1059--1067, 2019

  74. [89]

    Obermeyer, B

    Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366 0 (6464): 0 447--453, 2019

  75. [90]

    J. Oh, G. Lee, S. Bae, J.-m. Kwon, and E. Choi. Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 66277--66288. Curran Associates, Inc., 2023. URL https://proceedings.neurips...

  76. [91]

    J. A. Omiye, J. C. Lester, S. Spichak, V. Rotemberg, and R. Daneshjou. Large language models propagate race-based medicine. NPJ Digital Medicine, 6 0 (1): 0 195, 2023

  77. [92]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  78. [93]

    A. G. Pacheco, G. R. Lima, A. S. Salomao, B. Krohling, I. P. Biral, G. G. de Angelo, F. C. Alves Jr, J. G. Esgario, A. C. Simora, P. B. Castro, et al. PAD-UFES-20 : A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32: 0 106221, 2020

  79. [94]

    Parmar, A

    M. Parmar, A. Naik, H. Gupta, D. Agrawal, and C. Baral. LongBoX : Evaluating transformers on long-sequence clinical tasks, 2023

  80. [95]

    Pelka, S

    O. Pelka, S. Koitka, J. R \"u ckert, F. Nensa, and C. M. Friedrich. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 201...

Showing first 80 references.