RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

· 2026 · cs.CV · arXiv 2605.15561

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

representative citing papers

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQA at less than 20 percent the size of MedVInT-TD.

citing papers explorer

Showing 1 of 1 citing paper.

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding cs.CV · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQA at less than 20 percent the size of MedVInT-TD.

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

fields

years

verdicts

representative citing papers

citing papers explorer