Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4years
2026 4representative citing papers
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
CT-SpatialVQA benchmark reveals that eight 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks from CT data, frequently below random performance.
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.
citing papers explorer
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
Beyond Masks: The Case for Medical Image Parsing
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.