MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
arXiv preprint arXiv:2505.24173 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
citing papers explorer
-
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
- SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?