AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
While the original VCD applies noise solely to the image modality, we extend this approach to the audio-visual domain
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Probing Cross-modal Information Hubs in Audio-Visual LLMs
AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.