Rethinking Causal Mask Attention for Vision-Language Inference

Chang Xu; Tao Huang; Xiaohuan Pei; Yanxiang Ma

arxiv: 2505.18605 · v1 · pith:EWGQECJFnew · submitted 2025-05-24 · 💻 cs.CV · cs.AI

Rethinking Causal Mask Attention for Vision-Language Inference

Xiaohuan Pei , Tao Huang , YanXiang Ma , Chang Xu This is my paper

classification 💻 cs.CV cs.AI

keywords causalfutureinferencevision-languageattentioncontextmaskingrepresentations

0 comments

read the original abstract

Causal attention has become a foundational mechanism in autoregressive vision-language models (VLMs), unifying textual and visual inputs under a single generative framework. However, existing causal mask-based strategies are inherited from large language models (LLMs) where they are tailored for text-only decoding, and their adaptation to vision tokens is insufficiently addressed in the prefill stage. Strictly masking future positions for vision queries introduces overly rigid constraints, which hinder the model's ability to leverage future context that often contains essential semantic cues for accurate inference. In this work, we empirically investigate how different causal masking strategies affect vision-language inference and then propose a family of future-aware attentions tailored for this setting. We first empirically analyze the effect of previewing future tokens for vision queries and demonstrate that rigid masking undermines the model's capacity to capture useful contextual semantic representations. Based on these findings, we propose a lightweight attention family that aggregates future visual context into past representations via pooling, effectively preserving the autoregressive structure while enhancing cross-token dependencies. We evaluate a range of causal masks across diverse vision-language inference settings and show that selectively compressing future semantic context into past representations benefits the inference.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Differentiable Efficient Operator Search
cs.LG 2026-06 unverdicted novelty 7.0

Introduces Efficient Operator Search, a differentiable framework that jointly optimizes token reduction locations, retention budgets, and operator behaviors in multimodal models under cost constraints, recovering manu...
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
cs.CV 2026-03 unverdicted novelty 6.0

PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.