To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Abbas Haider; Chi-Ho Chan; Erfan Loweimi; Guanfeng Wu; Hui Wang; Josef Kittler; Kate Knill; Mark Gales; Mengjie Qian; Muhammad Awan

arxiv: 2606.05931 · v1 · pith:M4R5HCMMnew · submitted 2026-06-04 · 💻 cs.CL · cs.AI· cs.CV· cs.IR· cs.LG· cs.MM· eess.AS

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

Erfan Loweimi , Mengjie Qian , Kate Knill , Guanfeng Wu , Chi-Ho Chan , Abbas Haider , Muhammad Awan , Josef Kittler

show 2 more authors

Hui Wang Mark Gales

This is my paper

classification 💻 cs.CL cs.AIcs.CVcs.IRcs.LGcs.MMeess.AS

keywords modalityactivesystemwhenabsentbroadcastcross-modaldetection

0 comments

read the original abstract

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

This paper has not been read by Pith yet.

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

discussion (0)