MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
Alistair Johnson, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
An agentic LLM reasoning system reached 79.6% agreement with expert consensus on myeloma care questions from longitudinal records, outperforming iterative RAG and full-context baselines by 3.8-4.2 points with larger gains on complex cases.
RDMA equips small LLMs with abbreviation resolution, phenotype reasoning, and ontology tools to mine rare diseases from EHR notes, outperforming fine-tuned and RAG baselines at up to 10x lower inference cost.
MedMIX combines intra-modality expert fusion, learned inter-modality fusion, and training-only large-small collaboration to deliver robust multimodal medical prediction under incomplete modalities across three benchmarks.
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
citing papers explorer
-
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
-
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
-
Agentic clinical reasoning over longitudinal myeloma records: a retrospective evaluation against expert consensus
An agentic LLM reasoning system reached 79.6% agreement with expert consensus on myeloma care questions from longitudinal records, outperforming iterative RAG and full-context baselines by 3.8-4.2 points with larger gains on complex cases.
-
RDMA: Cost Effective Agent-Driven Rare Disease Mining from Electronic Health Records
RDMA equips small LLMs with abbreviation resolution, phenotype reasoning, and ontology tools to mine rare diseases from EHR notes, outperforming fine-tuned and RAG baselines at up to 10x lower inference cost.
-
MedMIX: Modality-Internal Expert Fusion for Multimodal Medical Diagnosis
MedMIX combines intra-modality expert fusion, learned inter-modality fusion, and training-only large-small collaboration to deliver robust multimodal medical prediction under incomplete modalities across three benchmarks.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.