MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
and Rudie, Jeffrey D
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
dataset 1polarities
use dataset 1representative citing papers
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
Patient identity and clinical features predict brain tumor segmentation accuracy more strongly than model choice, with localized spatial biases consistent across models and no formal fairness guarantees in any.
Radiomics TabPFN matches or outperforms image foundation models for IDH prediction in glioma MRI, with results sensitive to cohort shifts and representation type.
Empirical comparison of graded MRI preprocessing levels for MAE and JEPA pretraining on brain scans shows moderate levels (P2) are often sufficient, with limited additional utility from stronger preprocessing on downstream tasks.
citing papers explorer
-
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
-
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
-
Fairboard: a quantitative framework for equity assessment of healthcare models
Patient identity and clinical features predict brain tumor segmentation accuracy more strongly than model choice, with localized spatial biases consistent across models and no formal fairness guarantees in any.
-
A Benchmark of (MRI-) Foundation Models to Predict IDH Mutational Status in Glioma
Radiomics TabPFN matches or outperforms image foundation models for IDH prediction in glioma MRI, with results sensitive to cohort shifts and representation type.
-
How Much MRI Preprocessing Is Enough? A Cost-Utility Study for Brain MRI Foundation Models
Empirical comparison of graded MRI preprocessing levels for MAE and JEPA pretraining on brain scans shows moderate levels (P2) are often sufficient, with limited additional utility from stronger preprocessing on downstream tasks.