MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
MedGUIDE: Benchmarking clinical decision-making in large language models
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.
citing papers explorer
-
MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
-
LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment
Scoping review of 134 studies on LLM-as-a-Judge in healthcare finds concentration in clinical decision support and NLP, frequent use of OpenAI models with prompt engineering, and moderate-to-strong human alignment where validated.