Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi · 2025 · cs.CV · arXiv 2503.02597

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

representative citing papers

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

cs.CV · 2026-04-17 · unverdicted · novelty 5.0

An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

cs.CV · 2026-05-18

citing papers explorer

Showing 2 of 2 citing papers.

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow cs.CV · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
RAVE: Re-Allocating Visual Attention in Large Multimodal Models cs.CV · 2026-05-18 · unreviewed · ref 10 · internal anchor

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

fields

years

verdicts

representative citing papers

citing papers explorer