MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
and Rudie, Jeffrey D
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
dataset 1polarities
use dataset 1representative citing papers
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
Patient identity and clinical features predict brain tumor segmentation accuracy more strongly than model choice, with localized spatial biases consistent across models and no formal fairness guarantees in any.
citing papers explorer
-
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
-
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
-
Fairboard: a quantitative framework for equity assessment of healthcare models
Patient identity and clinical features predict brain tumor segmentation accuracy more strongly than model choice, with localized spatial biases consistent across models and no formal fairness guarantees in any.