VISA ranks 2nd in the Interspeech 2026 ARC Agent Track by adding multi-modal feature extraction, consistency-checked model voting, and rubric-aligned routing to large audio language models, reaching 66.23% Rubrics score and 77.40% accuracy.
Rubrics” denotes the official CoT quality score, and “Acc
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track
VISA ranks 2nd in the Interspeech 2026 ARC Agent Track by adding multi-modal feature extraction, consistency-checked model voting, and rubric-aligned routing to large audio language models, reaching 66.23% Rubrics score and 77.40% accuracy.