ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
read the original abstract
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs
Introduces the TSCognition benchmark for cognitive time series reasoning tasks and the TSAlign alignment framework, reporting outperformance over LLM, VLM, and time-series baselines on TSCognition and TimerBed with lo...
-
ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference
ViCoStream is a new coordinated pipeline framework for streaming VideoLLMs that achieves 134 FPS video throughput and less than 50 ms TTFT on A100 while keeping accuracy near full-history baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.