Learning Audio-Visual embedding for Person Verification in the Wild

Honggang Zhang; Peiwen Sun; Pengfei Hu; Shanshan Zhang; Taotao Zhang; Yougen Yuan; Zishan Liu

arxiv: 2209.04093 · v2 · pith:LJ7DJNBKnew · submitted 2022-09-09 · 💻 cs.CV · cs.MM· cs.SD· eess.AS

Learning Audio-Visual embedding for Person Verification in the Wild

Peiwen Sun , Shanshan Zhang , Zishan Liu , Yougen Yuan , Taotao Zhang , Honggang Zhang , Pengfei Hu This is my paper

classification 💻 cs.CV cs.MMcs.SDeess.AS

keywords audio-visualembeddingverificationpersonpoolingproposedattentivefirst

0 comments

read the original abstract

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

This paper has not been read by Pith yet.

Learning Audio-Visual embedding for Person Verification in the Wild

discussion (0)