ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.
arXiv preprint arXiv:2510.10052 (2025)
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 7years
2026 7roles
background 3polarities
background 3representative citing papers
Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.
Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.
citing papers explorer
-
Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning
ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.
-
CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs
Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.
-
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
-
Grounded 3D-Aware Spatial Vision-Language Modeling
GR3D is a VLM that combines explicit 2D, implicit 2D, and monocular 3D grounding mechanisms to improve performance on spatial understanding benchmarks.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five other benchmarks.
- MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution