AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
Attention Sink
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
citing papers explorer
-
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
-
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models
Mechanistic analysis of GLMs shows graph sink tokens have high activation but low importance for predictions, indicating decoupling between saliency and graph-semantic utility.
-
The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Centroid erasure shows language representations overshadow vision in multimodal models, and text-centroid contrastive decoding recovers substantial accuracy on visual reasoning tasks.
-
InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation
InCoM achieves 23-28% higher success rates in mobile manipulation tasks by inferring motion intent for adaptive perception and decoupling base-arm action generation.
- When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models