ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Anhao Zhao; Hao Wu; Wenjie Liu; Xiaoyu Shen; Xin Qiu; Xudong Wang; Yihan Zhang; Yingqi Fan; Yunpu Ma

arxiv: 2602.07574 · v2 · pith:NG6ZB7LZnew · submitted 2026-02-07 · 💻 cs.CV · cs.CL

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu , Hao Wu , Xin Qiu , Xudong Wang , Yingqi Fan , Yihan Zhang , Anhao Zhao , Yunpu Ma

show 1 more author

Xiaoyu Shen

This is my paper

classification 💻 cs.CV cs.CL

keywords vicavisualcross-attentioninferencelayersmultimodallanguagellms

0 comments

read the original abstract

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs
cs.CL 2026-06 unverdicted novelty 7.0

Introduces the TSCognition benchmark for cognitive time series reasoning tasks and the TSAlign alignment framework, reporting outperformance over LLM, VLM, and time-series baselines on TSCognition and TimerBed with lo...
ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference
cs.CV 2026-06 unverdicted novelty 5.0

ViCoStream is a new coordinated pipeline framework for streaming VideoLLMs that achieves 134 FPS video throughput and less than 50 ms TTFT on A100 while keeping accuracy near full-history baselines.