To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

· 2026 · cs.CL · arXiv 2606.05931

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

representative citing papers

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Query-adaptive audio-visual person retrieval detects active modalities via cross-modal score consistency, achieving 94.2% P@1 on BBC Rewind corpus and outperforming unimodal and fixed-fusion baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection cs.CL · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
Query-adaptive audio-visual person retrieval detects active modalities via cross-modal score consistency, achieving 94.2% P@1 on BBC Rewind corpus and outperforming unimodal and fixed-fusion baselines.

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

fields

years

verdicts

representative citing papers

citing papers explorer