NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
hub Canonical reference
arXiv preprint arXiv:2404.00578 (2024)
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 18representative citing papers
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
Agentic LLMs autonomously execute complex neuro-radiological workflows like glioma segmentation and multi-timepoint response assessment by directing off-the-shelf tools, without any model training.
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
CA-GCL adds global contrastive separation and clinical text augmentation to fine-grained vision-language pretraining, reducing textual embedding collapse and prompt variance in 3D medical image tasks.
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
MedScribe reformulates CT radiology reporting as an agentic evidence-acquisition workflow using LLM-invoked diagnostic tools and pathology-aligned retrieval, yielding higher clinical accuracy and consistency than standard VLMs on CT-RATE and RadChestCT.
RadAgent generates stepwise, tool-augmented chest CT reports with traceable decisions, improving accuracy, robustness, and adding a 37% faithfulness score absent in standard 3D VLMs.
Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visual question answering.
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
A multi-center whole-body FDG PET/CT foundation model with early fusion and masked autoencoding pretraining achieves label-efficient tumor segmentation on downstream tasks.
TIF-GRPO uses integral feedback on pseudo-temporal trajectories to regulate anatomy-aware rewards in RL for clinical faithfulness in volumetric CT analysis.
M3Net achieves state-of-the-art accuracies of 86.96% on LIDC-IDRI and 84.24% on USTC-FHLN for pulmonary nodule classification using a hierarchical multi-scale 3D network with cross-scale consistency.
citing papers explorer
-
NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
-
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
-
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis
Agentic LLMs autonomously execute complex neuro-radiological workflows like glioma segmentation and multi-timepoint response assessment by directing off-the-shelf tools, without any model training.
-
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
-
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
-
CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding
CA-GCL adds global contrastive separation and clinical text augmentation to fine-grained vision-language pretraining, reducing textual embedding collapse and prompt variance in 3D medical image tasks.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
-
MedScribe: Clinically Grounded CT Reporting through Agentic Workflows
MedScribe reformulates CT radiology reporting as an agentic evidence-acquisition workflow using LLM-invoked diagnostic tools and pathology-aligned retrieval, yielding higher clinical accuracy and consistency than standard VLMs on CT-RATE and RadChestCT.
-
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography
RadAgent generates stepwise, tool-augmented chest CT reports with traceable decisions, improving accuracy, robustness, and adding a 37% faithfulness score absent in standard 3D VLMs.
-
Representation geometry shapes task performance in vision-language modeling for CT enterography
Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.
-
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visual question answering.
-
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
-
Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks
LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
-
An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation
A multi-center whole-body FDG PET/CT foundation model with early fusion and masked autoencoding pretraining achieves label-efficient tumor segmentation on downstream tasks.
-
Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis
TIF-GRPO uses integral feedback on pseudo-temporal trajectories to regulate anatomy-aware rewards in RL for clinical faithfulness in volumetric CT analysis.
-
M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification
M3Net achieves state-of-the-art accuracies of 86.96% on LIDC-IDRI and 84.24% on USTC-FHLN for pulmonary nodule classification using a hierarchical multi-scale 3D network with cross-scale consistency.