Recognition: no theorem link
Capabilities of Gemini Models in Medicine
Pith reviewed 2026-05-15 17:08 UTC · model grok-4.3
The pith
Med-Gemini models reach 91.1 percent accuracy on USMLE medical questions and surpass GPT-4 on medical benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Med-Gemini models, specialized for medicine with seamless web search and custom encoders, establish new state-of-the-art performance on 10 out of 14 medical benchmarks. The best model achieves 91.1 percent accuracy on the MedQA USMLE benchmark through a novel uncertainty-guided search strategy, surpasses the GPT-4 model family on every benchmark with direct comparison, and improves over GPT-4V by an average relative margin of 44.5 percent across seven multimodal benchmarks including NEJM Image Challenges and the health subset of MMMU. The models further demonstrate long-context strengths by achieving state-of-the-art results on retrieval from long de-identified health records and medical-vid
What carries the argument
The Med-Gemini family of multimodal models that add medical specialization to Gemini's core strengths via web search access and custom encoders for novel modalities.
If this is right
- AI systems could match or exceed human performance on medical text summarization and multimodal image interpretation tasks.
- Uncertainty-guided search offers a practical way to raise accuracy on medical question answering without additional training.
- Long-context capabilities enable effective in-context use of full patient histories and video data for research and education.
- The same specialization pattern could be applied to other high-stakes domains that need up-to-date knowledge and multimodal reasoning.
Where Pith is reading between the lines
- Real-world deployment would still require separate safety testing on noisy clinical data, since benchmark gains do not automatically guarantee clinical reliability.
- The 44.5 percent multimodal margin suggests similar gains may appear in other visual-heavy medical workflows such as radiology or pathology.
- Because the models rely on web search, their outputs could be kept current with new guidelines more easily than static fine-tuned systems.
- Strong needle-in-a-haystack retrieval performance opens the possibility of using the models to surface relevant prior cases from large hospital archives.
Load-bearing premise
High accuracy on curated medical benchmarks will translate to reliable and safe performance when the models encounter noisy, incomplete, or out-of-distribution patient data in actual clinical settings.
What would settle it
A side-by-side comparison of Med-Gemini outputs against board-certified physicians on a held-out set of real de-identified hospital cases, measuring diagnostic accuracy and rate of unsafe recommendations.
read the original abstract
Excellence in a wide variety of medical applications poses considerable challenges for AI, requiring advanced reasoning, access to up-to-date medical knowledge and understanding of complex multimodal data. Gemini models, with strong general capabilities in multimodal and long-context reasoning, offer exciting possibilities in medicine. Building on these core strengths of Gemini, we introduce Med-Gemini, a family of highly capable multimodal models that are specialized in medicine with the ability to seamlessly use web search, and that can be efficiently tailored to novel modalities using custom encoders. We evaluate Med-Gemini on 14 medical benchmarks, establishing new state-of-the-art (SoTA) performance on 10 of them, and surpass the GPT-4 model family on every benchmark where a direct comparison is viable, often by a wide margin. On the popular MedQA (USMLE) benchmark, our best-performing Med-Gemini model achieves SoTA performance of 91.1% accuracy, using a novel uncertainty-guided search strategy. On 7 multimodal benchmarks including NEJM Image Challenges and MMMU (health & medicine), Med-Gemini improves over GPT-4V by an average relative margin of 44.5%. We demonstrate the effectiveness of Med-Gemini's long-context capabilities through SoTA performance on a needle-in-a-haystack retrieval task from long de-identified health records and medical video question answering, surpassing prior bespoke methods using only in-context learning. Finally, Med-Gemini's performance suggests real-world utility by surpassing human experts on tasks such as medical text summarization, alongside demonstrations of promising potential for multimodal medical dialogue, medical research and education. Taken together, our results offer compelling evidence for Med-Gemini's potential, although further rigorous evaluation will be crucial before real-world deployment in this safety-critical domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Med-Gemini, a family of multimodal models specialized for medicine by building on Gemini's core strengths in multimodal and long-context reasoning. It reports new state-of-the-art results on 10 of 14 medical benchmarks, including 91.1% accuracy on MedQA (USMLE) via a novel uncertainty-guided search strategy, consistent outperformance over the GPT-4 family on all directly comparable benchmarks, and a 44.5% average relative improvement over GPT-4V across 7 multimodal benchmarks (NEJM Image Challenges, MMMU health subset, etc.). Additional results highlight long-context retrieval from de-identified health records, medical video QA, and surpassing human experts on medical text summarization, while noting the need for further rigorous evaluation before clinical deployment.
Significance. If the benchmark results prove robust, the work establishes a meaningful advance in medical AI capabilities by showing how general multimodal models can be efficiently specialized for high-stakes domains. The combination of web search integration, custom encoders, and in-context long-context handling without bespoke training is a notable strength. These results provide concrete evidence of progress toward AI support for medical reasoning, education, and research, while the paper's explicit caution about real-world deployment aligns with the safety-critical context.
major comments (3)
- Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.
- MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.
- Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.
minor comments (2)
- The abstract could explicitly list the exact prior SoTA scores and the specific GPT-4 variants used for each direct comparison to improve transparency.
- Terminology such as 'needle-in-a-haystack retrieval' in the long-context experiments would benefit from a short definition or citation on first use for broader accessibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of result presentation and evaluation rigor. We address each major point below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: Abstract and results sections: The reported SoTA figures (91.1% on MedQA, 44.5% relative margin on multimodal tasks) are presented without error bars, confidence intervals, or multiple-run statistics. This omission makes it impossible to determine whether the gains over prior methods are statistically significant or sensitive to random seeds, directly affecting the reliability of the central performance claims.
Authors: We agree that error bars and statistical measures would strengthen the presentation. The reported numbers reflect single-run evaluations on fixed benchmarks, which is standard practice for large multimodal models given the high computational cost of repeated full evaluations. In the revision we will add explicit discussion of this limitation in the results section, note that gains are consistent with prior single-run reports in the literature, and include any available variance from prompt-sensitivity checks performed during development. revision: partial
-
Referee: MedQA evaluation: The uncertainty-guided search strategy is described as key to reaching 91.1% accuracy, yet no ablation studies compare it against standard search or report its contribution in isolation. Without these controls, it remains unclear whether the performance stems from the strategy itself or from other unstated factors such as prompt engineering or data access.
Authors: We will add a targeted ablation in the revised manuscript (main text or supplementary) that isolates the uncertainty-guided search by comparing it directly against a standard search baseline and a chain-of-thought baseline using the identical Med-Gemini model, prompt template, and retrieval setup. This will quantify the incremental contribution of the uncertainty component. revision: yes
-
Referee: Multimodal benchmarks section: The average 44.5% relative improvement over GPT-4V is given across 7 tasks, but per-benchmark breakdowns with variance or standard deviations are not provided. This prevents assessment of whether the gains are consistent or driven by a subset of easier tasks, weakening the broad outperformance claim.
Authors: We will expand the multimodal results section to include a full per-benchmark table listing absolute accuracies for both Med-Gemini and GPT-4V on each of the seven tasks, together with the relative improvement for every individual benchmark. Any variance estimates obtainable from the evaluation logs will be reported; otherwise we will note the single-run nature consistently with the first comment. revision: yes
Circularity Check
No circularity: empirical benchmark results are self-contained measurements
full rationale
The paper presents Med-Gemini as a family of specialized multimodal models evaluated directly on 14 fixed medical benchmarks (MedQA, NEJM Image Challenges, MMMU health subset, etc.). All reported results—91.1% MedQA accuracy via uncertainty-guided search, 44.5% relative gains over GPT-4V, long-context retrieval, and human-expert comparisons—are obtained by running the models on these curated test sets and recording accuracy or other metrics. No equations, parameter fits, uniqueness theorems, or ansatzes are invoked whose outputs are then relabeled as predictions; the central claims reduce only to the empirical measurements themselves. Self-citations, if present, are limited to prior Gemini work and do not carry the load of the performance numbers. The derivation chain is therefore empty of circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
-
BLEG: LLM Functions as Powerful fMRI Graph-Enhancer for Brain Network Analysis
BLEG enhances GNNs for fMRI brain network analysis by prompting LLMs for text augmentation, using cost-effective instruction tuning, and applying alignment losses during joint training.
-
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
Intent-aware retrieval over assertion-labeled knowledge graphs improves clinical QA accuracy by 22 percentage points on a new MIMIC-IV benchmark that stresses negation, temporality, and attribution.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment
In a large real-world randomized study, SymptomAI agents achieved higher differential diagnosis accuracy (OR 2.47) than clinicians and showed stronger results with dedicated symptom interviews.
-
SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment
Large real-world deployment found conversational AI agents for everyday symptom assessment more accurate than clinicians and improved by structured interviewing.
-
GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
GAZE framework with viewer tools and literature retrieval achieves 58.2 mAP@0.3 lesion localization and 34.9% top-1 diagnostic accuracy on 906 rare brain MRI cases in zero-shot setting, with larger gains on rarest pat...
-
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
-
MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
-
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
-
Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
BMBE separates LLM language handling from a standalone Bayesian diagnostic engine, producing calibrated selective diagnosis, a performance gap over frontier LLMs, and robustness to adversarial inputs.
-
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
Multimodal LLMs performed similarly across models and better than standard care on diagnostic accuracy and patient safety in a real-world LMIC hospital dataset.
-
Predicting Post-Traumatic Epilepsy from Clinical Records using Large Language Model Embeddings
LLM embeddings from clinical records, fused with tabular data via gradient-boosted trees, predict post-traumatic epilepsy at AUC-ROC 0.892 and AUPRC 0.798.
-
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...
-
Measuring the metacognition of AI
Meta-d' and signal detection theory provide quantitative tools to assess metacognitive sensitivity and risk-based regulation in large language models.
-
Medical Reasoning with Large Language Models: A Survey and MR-Bench
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
-
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
Tag-based few-shot selection yields higher precision and stability than random or similarity-based methods when using LLMs to analyze medical incidents.
-
Teaching LLMs Brazilian Healthcare: Injecting Knowledge from Official Clinical Guidelines
A 14B model trained on synthetic data from Brazilian clinical guidelines outperforms larger LLMs on new benchmarks for Brazilian healthcare protocols.
Reference graph
Works this paper leans on
-
[1]
M. D. Abr \`a moff, M. E. Tarver, N. Loyo-Berrios, S. Trujillo, D. Char, Z. Obermeyer, M. B. Eydelman, F. P. of Ophthalmic Imaging, D. Algorithmic Interpretation Working Group of the Collaborative Community for Ophthalmic Imaging Foundation, Washington, and W. H. Maisel. Considerations for addressing bias in artificial intelligence for health equity. NPJ ...
work page 2023
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 0 23716--23736, 2022
work page 2022
-
[4]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
F. Antaki, D. Milad, M. A. Chia, C.- \'E . Gigu \`e re, S. Touma, J. El-Khoury, P. A. Keane, and R. Duval. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. British Journal of Ophthalmology, 2023
work page 2023
-
[6]
S. Azizi, L. Culp, J. Freyberg, B. Mustafa, S. Baur, S. Kornblith, T. Chen, N. Tomasev, J. Mitrovi \'c , P. Strachan, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7 0 (6): 0 756--779, 2023
work page 2023
- [7]
- [8]
- [9]
-
[11]
\'A . A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern, and D. H. Chau. Fairvis: Visual analytics for discovering intersectional bias in machine learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 46--56. IEEE, 2019
work page 2019
-
[13]
D. S. Char, N. H. Shah, and D. Magnus. Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine, 378 0 (11): 0 981, 2018
work page 2018
-
[14]
W. Chen, J. Feng, J. Lu, and J. Zhou. Endo3d: Online workflow analysis for endoscopic surgeries based on 3d cnn and lstm. In OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th Internatio...
work page 2018
- [15]
-
[16]
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM : Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023
work page 2023
-
[17]
D. Cirillo, S. Catuara-Solarz, C. Morey, E. Guney, L. Subirats, S. Mellino, A. Gigante, A. Valencia, M. J. Rementeria, A. S. Chadha, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ digital medicine, 3 0 (1): 0 1--11, 2020
work page 2020
-
[18]
M. Claussnitzer, S. N. Dankel, K.-H. Kim, G. Quon, W. Meuleman, C. Haugen, V. Glunk, I. S. Sousa, J. L. Beaudry, V. Puviindran, et al. Fto obesity variant circuitry and adipocyte browning in humans. New England Journal of Medicine, 373 0 (10): 0 895--907, 2015
work page 2015
-
[19]
Cochrane. Standards for reporting plain language summaries (pls) for cochrane diagnostic test accuracy reviews, 2014. https://methods.cochrane.org/sites/methods.cochrane.org.sdt/files/uploads/Draft PLS document.pdf
work page 2014
-
[20]
E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702--703, 2020
work page 2020
-
[22]
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[23]
P. Densen. Challenges and opportunities facing medical education. Transactions of the American Clinical and Climatological Association, 122: 0 48, 2011
work page 2011
-
[24]
A. Devaraj, I. Marshall, B. Wallace, and J. J. Li. Paragraph-level simplification of medical texts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4972--4984. Association for Computational Linguistics, June 2021. URL https://www.aclweb.org/anthology/2021.naacl-main.395
work page 2021
-
[25]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E : An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
A. V. Eriksen, S. M \"o ller, and J. Ryg. Use of GPT-4 to diagnose complex clinical cases, 2023
work page 2023
-
[28]
A. Feder, I. Laish, S. Agarwal, U. Lerner, A. Atias, C. Cheung, P. Clardy, A. Peled-Cohen, R. Fellinger, H. Liu, et al. Building a clinically-focused problem list from medical notes. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pages 60--68, 2022
work page 2022
- [29]
-
[31]
E. Ford, J. A. Carroll, H. E. Smith, D. Scott, and J. A. Cassell. Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23 0 (5): 0 1007--1015, 2016
work page 2016
-
[33]
S. Ganapathi, J. Palmer, J. E. Alderman, M. Calvert, C. Espinoza, J. Gath, M. Ghassemi, K. Heller, F. Mckay, A. Karthikesalingam, et al. Tackling bias in ai health datasets through the standing together initiative. Nature Medicine, 28 0 (11): 0 2232--2233, 2022
work page 2022
-
[34]
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang. Retrieval-augmented generation for large language models: A survey, 2024
work page 2024
-
[36]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Google . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
work page 2024
-
[37]
J. W. Gichoya, I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L.-C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S.-C. Huang, et al. Ai recognition of patient race in medical imaging: a modelling study. The Lancet Digital Health, 4 0 (6): 0 e406--e414, 2022
work page 2022
- [38]
-
[40]
E. D. Goodman, K. K. Patel, Y. Zhang, W. Locke, C. J. Kennedy, R. Mehrotra, S. Ren, M. Guan, O. Zohar, M. Downing, et al. Analyzing surgical technique in diverse open surgical videos with multitask machine learning. JAMA surgery, 159 0 (2): 0 185--192, 2024
work page 2024
-
[41]
K. K. Grandage, D. C. Slawson, and A. F. Shaughnessy. When less is more: a practical approach to searching for evidence-based answers. Journal of the Medical Library Association, 90 0 (3): 0 298, 2002
work page 2002
-
[42]
L. D. Gruppen. Clinical reasoning: defining it, teaching it, assessing it, studying it. Western Journal of Emergency Medicine, 18 0 (1): 0 4, 2017
work page 2017
- [44]
-
[45]
S. Hao, T. Liu, Z. Wang, and Z. Hu. ToolkenGPT : Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36, 2024
work page 2024
- [46]
-
[47]
E. Horvitz, D. Heckerman, B. N. Nathwani, and L. M. Fagan. Diagnostic strategies in the hypothesis-directed pathfinder system. pages 630--636, January 1984. URL https://www.microsoft.com/en-us/research/publication/diagnostic-strategies-hypothesis-directed-pathfinder-system/
work page 1984
- [48]
-
[49]
J. Huang and K. C.-C. Chang. Towards reasoning in large language models: A survey, 2023
work page 2023
- [50]
-
[52]
J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. Che X pert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590--597, 2019
work page 2019
-
[53]
A. Iyer, G. Sen, and P. \"O stlin. The intersections of gender and class in health status and health care. Global public health, 3 0 (S1): 0 13--24, 2008
work page 2008
-
[54]
S. E. Jackson, R. A. Hackett, and A. Steptoe. Associations between age discrimination and health and wellbeing: cross-sectional and prospective analysis of the english longitudinal study of ageing. The Lancet Public Health, 4 0 (4): 0 e200--e208, 2019
work page 2019
-
[55]
P. B. Jensen, L. J. Jensen, and S. Brunak. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13 0 (6): 0 395--405, 2012
work page 2012
-
[56]
D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021
work page 2021
-
[57]
Q. Jin, Y. Yang, Q. Chen, and Z. Lu. GeneGPT : Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics, 40 0 (2): 0 btae075, 2024
work page 2024
-
[58]
A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark. MIMIC-III , a freely accessible critical care database. Scientific data, 3 0 (1): 0 1--9, 2016
work page 2016
-
[59]
A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng. MIMIC-CXR , a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6 0 (1): 0 317, 2019 a
work page 2019
- [60]
- [61]
-
[62]
J. A. Kent, V. Patel, and N. A. Varela. Gender disparities in health care. Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine, 79 0 (5): 0 555--559, 2012
work page 2012
-
[63]
u r Evidenz, Fortbildung und Qualit \
I. Klerings, A. S. Weinhandl, and K. J. Thaler. Information overload in healthcare: too much of a good thing? Zeitschrift f \"u r Evidenz, Fortbildung und Qualit \"a t im Gesundheitswesen , 109 0 (4-5): 0 285--290, 2015
work page 2015
- [64]
-
[65]
S. Laber, S. Forcisi, L. Bentley, J. Petzold, F. Moritz, K. S. Smirnov, L. Al Sadat, I. Williamson, S. Strobel, T. Agnew, et al. Linking the fto obesity rs1421085 variant circuitry to cellular, metabolic, and organismal phenotypes in vivo. Science advances, 7 0 (30): 0 eabg0108, 2021
work page 2021
-
[66]
T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili \'c , D. Hesslow, R. Castagn \'e , A. S. Luccioni, F. Yvon, M. Gall \'e , et al. Bloom: A 176b-parameter open-access multilingual language model. 2022
work page 2022
-
[67]
G. Leifman, A. Aides, T. Golany, D. Freedman, and E. Rivlin. Pixel-accurate segmentation of surgical tools based on bounding box annotations. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 5096--5103. IEEE, 2022
work page 2022
-
[69]
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. LLaVa-Med : Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[70]
Y. Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y. Luo. A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30 0 (2): 0 340--347, 2023
work page 2023
- [71]
-
[72]
M. Liu, Y. Ning, S. Teixayavong, M. Mertens, J. Xu, D. S. W. Ting, L. T.-E. Cheng, J. C. L. Ong, Z. L. Teo, T. F. Tan, et al. A translational perspective towards clinical ai fairness. NPJ Digital Medicine, 6 0 (1): 0 172, 2023
work page 2023
-
[73]
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024
work page 2024
-
[74]
R. J. Loos and G. S. Yeo. The genetics of obesity: from discovery to biology. Nature Reviews Genetics, 23 0 (2): 0 120--133, 2022
work page 2022
-
[75]
N. L \'o pez and V. L. Gadsden. Health inequities, social determinants, and intersectionality. In Perspectives on health equity and social determinants of health. National Academies Press (US), 2017
work page 2017
-
[77]
R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23 0 (6): 0 bbac409, 2022
work page 2022
-
[81]
J. Medina-Mart \' nez, C. Saus-Ortega, M. M. S \'a nchez-Lorente, E. M. Sosa-Palanca, P. Garc \' a-Mart \' nez, and M. I. M \'a rmol-L \'o pez. Health inequities in lgbt people and nursing interventions to reduce them: A systematic review. International Journal of Environmental Research and Public Health, 18 0 (22): 0 11801, 2021
work page 2021
-
[82]
Papers with code - medical, 2024
Meta. Papers with code - medical, 2024. URL https://paperswithcode.com/area/medical. Accessed: 2024-04-26
work page 2024
-
[83]
M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616 0 (7956): 0 259--265, 2023 a
work page 2023
-
[84]
M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353--367. PMLR, 2023 b
work page 2023
-
[85]
WebGPT: Browser-assisted question-answering with human feedback
R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[87]
A. Novin and E. Meyers. Making sense of conflicting science information: Exploring bias in the search engine result page. In Proceedings of the 2017 conference on conference human information interaction and retrieval, pages 175--184, 2017
work page 2017
-
[88]
C. I. Nwoye, D. Mutter, J. Marescaux, and N. Padoy. Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery, 14: 0 1059--1067, 2019
work page 2019
-
[89]
Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366 0 (6464): 0 447--453, 2019
work page 2019
-
[90]
J. Oh, G. Lee, S. Bae, J.-m. Kwon, and E. Choi. Ecg-qa: A comprehensive question answering dataset combined with electrocardiogram. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 66277--66288. Curran Associates, Inc., 2023. URL https://proceedings.neurips...
work page 2023
-
[91]
J. A. Omiye, J. C. Lester, S. Spichak, V. Rotemberg, and R. Daneshjou. Large language models propagate race-based medicine. NPJ Digital Medicine, 6 0 (1): 0 195, 2023
work page 2023
- [92]
-
[93]
A. G. Pacheco, G. R. Lima, A. S. Salomao, B. Krohling, I. P. Biral, G. G. de Angelo, F. C. Alves Jr, J. G. Esgario, A. C. Simora, P. B. Castro, et al. PAD-UFES-20 : A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32: 0 106221, 2020
work page 2020
- [94]
-
[95]
O. Pelka, S. Koitka, J. R \"u ckert, F. Nensa, and C. M. Friedrich. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 201...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.