pith. sign in

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it
abstract

Modality following is the ability to selectively leverage multimodal contexts based on user instructions. It is fundamental to the safety and reliability of multimodal large language models (MLLMs) in real-world deployments. However, the internal mechanisms governing this decision-making process remain largely under-explored. In this work, we investigate the mechanism underlying modality following through an information flow perspective. Our findings reveal that instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer, aggregating multimodal cues to instruction tokens as a latent buffer; in contrast, deep attention layers selectively strengthen the instruction-compliant subspace and resolve modality arbitration according to the instruction-specified intent, with a sparse subset of attention heads driving this process. Targeted attention-head interventions further validate the functional specificity of these heads: blocking only $5\%$ of the identified heads substantially degrades modality following while preserving general visual and language capabilities, whereas targeted amplification can restore failed modality-following samples by up to approximately $60\%$. Together, this work provides a mechanistic account of modality following and informs future efforts to improve how MLLMs integrate and utilize multimodal evidence under user instructions.

citation-role summary

background 2

citation-polarity summary

years

2026 5

verdicts

UNVERDICTED 5

roles

background 2

polarities

background 2

representative citing papers

citing papers explorer

Showing 5 of 5 citing papers.