Who said that?: Audio-visual speaker diarisation of real-world meetings

Bong-Jin Lee; Icksang Han; Joon Son Chung

arxiv: 1906.10042 · v1 · pith:NIWJHI77new · submitted 2019-06-24 · 💻 cs.SD · cs.CV· eess.AS

Who said that?: Audio-visual speaker diarisation of real-world meetings

Joon Son Chung , Bong-Jin Lee , Icksang Han This is my paper

Pith reviewed 2026-05-25 16:52 UTC · model grok-4.3

classification 💻 cs.SD cs.CVeess.AS

keywords audio-visual speaker diarisationreal-world meetingsspeaker enrollmentactive speaker detectionAMI corpusbeamformingmulti-channel audio

0 comments

The pith

An iterative audio-visual method enrolls speaker models via video-audio correspondence to determine who spoke when in meetings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system to figure out who is speaking at each moment during meetings by using both video and audio. It starts by linking faces in the video to voices in the audio to create models for each speaker. Then it uses those models plus visual cues to spot the active speaker. The goal is to make this work reliably even when meetings have background noise or people talking over each other. If successful, it would allow more accurate automatic transcripts and analysis of conversations from everyday recordings.

Core claim

The paper claims that an iterative process of enrolling speaker models via audio-visual correspondence, followed by using those models with visual information to identify the active speaker, produces robust diarisation outputs on real-world meetings and surpasses all comparable methods on the AMI meeting corpus. Beamforming with video can further enhance performance with multi-channel audio.

What carries the argument

Iterative enrollment of speaker models using audio-visual correspondence

If this is right

Generates robust outputs on real-world meeting data.
Exceeds comparable methods on the AMI corpus.
Improves further when beamforming is applied to multi-channel audio.
Provides both strong quantitative and qualitative results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may extend to other multi-modal settings like video conferencing with multiple participants.
It suggests that visual information can compensate for audio ambiguities in speaker identification.
Future work could test the approach in fully online streaming scenarios without full video access.
Integration with speech recognition could yield speaker-attributed transcripts for meetings.

Load-bearing premise

Audio-visual correspondence can reliably enroll speaker models without significant errors from noise or multiple simultaneous speakers.

What would settle it

Evaluation on a dataset containing many instances of overlapping speech and high noise levels showing no improvement over audio-only diarisation methods.

Figures

Figures reproduced from arXiv: 1906.10042 by Bong-Jin Lee, Icksang Han, Joon Son Chung.

**Figure 1.** Figure 1: Pipeline overview. 2.2.1. Audio-to-video correlation Cross-modal embeddings of the audio and the mouth motion are used to represent the respective signals. The strategy to train this joint embedding is described in [28], but we give a brief overview here. The network consists of two streams: the audio stream that encodes Mel-frequency cepstral coefficients (MFCC) inputs into 512-dimensional vectors; and t… view at source ↗

**Figure 2.** Figure 2: Still image from the internal meeting dataset [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Still images from the AMI corpus. the authors and are not set up in any way with the diarisation task in mind. A large proportion of the dataset consists of very short utterances with frequent speaker changes, providing an extremely challenging condition. The video is recorded using a GoPro Fusion camera, which captures 360° videos of the meeting with two fish-eye lenses. The videos are stitched together i… view at source ↗

read the original abstract

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker. We show strong quantitative and qualitative performance on a dataset of real-world meetings. The method is also evaluated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. We also show that beamforming can be used together with the video to further improve the performance when multi-channel audio is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's iterative AV enrollment for diarization gives usable gains on AMI but the enrollment reliability under overlap is not shown in enough detail to support the real-world claims.

read the letter

The core contribution is an iterative two-stage method: first enroll speaker models by matching audio to visual tracks, then combine those models with visual active-speaker detection to produce the diarization. This is presented as new relative to the cited prior work, and the authors report it exceeds comparable methods on the AMI corpus while also running on their own real-meeting recordings. The addition of beamforming when multi-channel audio is present is a straightforward practical step that fits the setting. Those elements are the parts that actually move the needle for someone who needs a working system on meeting data. The evaluation includes both a public benchmark and a private real-world set, which is the right mix for this kind of applied work. The main weakness is that the enrollment stage is load-bearing. The abstract states that AV correspondence produces the models, yet there is no reported breakdown of enrollment error rates, no explicit handling of overlap during enrollment, and no ablation showing how downstream diarization degrades when enrollment is imperfect. If even moderate fractions of segments produce bad models under realistic noise or simultaneous speech, the claimed robustness on real meetings rests on an untested precondition. The quantitative results are described as strong but without error bars, confidence intervals, or per-condition breakdowns it is hard to judge how much of the improvement is stable. This paper is aimed at researchers and engineers who build or benchmark audio-visual diarization systems for meetings. A reader looking for a concrete baseline that already runs on AMI and real data will get something usable from it. It is coherent enough and grounded enough in a public corpus that a serious editor should send it to referees rather than desk-reject; the revisions would mainly need to add the missing enrollment diagnostics and ablations.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel iterative audio-visual method for speaker diarisation in real-world meetings. It first enrolls speaker models via audio-visual correspondence from surround-view video and single/multi-channel audio inputs, then uses the enrolled models together with visual information to determine active speakers. Strong quantitative and qualitative results are reported on a custom real-world meeting dataset; the method also exceeds all comparable approaches on the public AMI corpus, with further gains shown when beamforming is combined with video on multi-channel audio.

Significance. If the central claims hold, the work offers a practical advance in audio-visual diarisation by leveraging AV correspondence for enrollment in challenging real-world conditions, with demonstrated gains over prior methods on AMI. The dual evaluation on both proprietary real-world data and a public benchmark is a strength; the beamforming integration is a useful engineering contribution.

major comments (2)

[Method description (iterative enrollment stage)] The load-bearing claim that AV correspondence produces reliable speaker models for downstream diarisation (even under overlap or noise) is not supported by any quantitative enrollment error rates, overlap-handling description, or ablation showing diarisation degradation when enrollment is imperfect. This directly affects the reported gains on both the real-world dataset and AMI.
[AMI evaluation results] The claim of exceeding 'all comparable methods' on AMI lacks a table or section that lists the exact baselines, their diarisation error rates (DER), and error bars; without these, the superiority cannot be verified against the abstract's assertion.

minor comments (2)

[Method] Notation for the enrolled speaker models and the iterative update rule should be defined more explicitly (e.g., with equations) to aid reproducibility.
[Experiments] The real-world dataset description should include details on number of meetings, total duration, and overlap statistics to contextualize the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to be made in the next version of the manuscript.

read point-by-point responses

Referee: [Method description (iterative enrollment stage)] The load-bearing claim that AV correspondence produces reliable speaker models for downstream diarisation (even under overlap or noise) is not supported by any quantitative enrollment error rates, overlap-handling description, or ablation showing diarisation degradation when enrollment is imperfect. This directly affects the reported gains on both the real-world dataset and AMI.

Authors: We agree that the manuscript would be strengthened by explicit quantitative analysis of the enrollment stage. The end-to-end diarisation results on both the real-world meetings and AMI provide indirect support for the reliability of the AV correspondence, but we acknowledge the value of direct metrics. In the revision we will add enrollment accuracy figures computed on held-out data, a description of overlap handling during enrollment, and an ablation that measures diarisation degradation under controlled enrollment errors. revision: yes
Referee: [AMI evaluation results] The claim of exceeding 'all comparable methods' on AMI lacks a table or section that lists the exact baselines, their diarisation error rates (DER), and error bars; without these, the superiority cannot be verified against the abstract's assertion.

Authors: The manuscript reports comparisons against prior methods on AMI, yet we accept that a single consolidated table listing every baseline, its DER, and any available error bars would improve verifiability. We will insert this table (with references to the original papers) into the AMI evaluation section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in method description or evaluation

full rationale

The paper proposes an iterative audio-visual enrollment and diarisation pipeline evaluated on external corpora (AMI and a real-world meeting dataset). No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The enrollment step is presented as a procedural component whose reliability is tested via downstream performance on held-out data, not derived from the target outputs by construction. This is a standard empirical ML pipeline with independent validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; central claim rests on unstated modeling assumptions typical of ML diarisation methods.

pith-pipeline@v0.9.0 · 5655 in / 917 out tokens · 23969 ms · 2026-05-25T16:52:39.278930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

[1]

Introduction Over the recent years, there has been a growing demand to be able to record and search human communications in a machine readable format. There has been signiﬁcant advances in auto- matic speech recognition due to the availability of large-scale datasets [1, 2] and the accessibility of deep learning frame- works [3, 4, 5], but to give the tra...

work page
[2]

Who said that?: Audio-visual speaker diarisation of real-world meetings

System description 2.1. Audio-only baseline The baseline system provided for the second DIHARD chal- lenge is used as our audio-only baseline. The system takes key components from the top-scoring systems in the ﬁrst DIHARD challenge and shows state-of-the-art performance on audio-only diarisation. 2.1.1. Speech enhancement The speech enhancement is based ...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[3]

Each will be de- scribed in the following paragraphs

Experiments The proposed method is evaluated on two independent datasets: our internal dataset of meetings recorded with 360◦ camera, and the publicly available AMI meeting corpus. Each will be de- scribed in the following paragraphs. 3.1. Internal meeting dataset The internal meeting dataset consists of audio-visual recording of regular meetings in which...

work page
[4]

We have shown that speaker modelling with audio-visual enrollment have signiﬁcant advantages over clus- tering methods typically used for diarisation

Conclusion In this paper, we have introduced a multi-modal system which takes advantage of audio-visual correspondence to enroll speaker models. We have shown that speaker modelling with audio-visual enrollment have signiﬁcant advantages over clus- tering methods typically used for diarisation. Areas for further research include learnable methods for mult...

work page
[5]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP. IEEE, 2015, pp. 5206–5210

work page 2015
[6]

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth’chime’speech separation and recognition challenge: Dataset, task and baselines,” arXiv preprint arXiv:1803.10609, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Automatic differ- entiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ- entiation in pytorch,” 2017

work page 2017
[9]

Matconvnet: Convolutional neural net- works for matlab,

A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural net- works for matlab,” in Proc. ACMM, 2015

work page 2015
[10]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[11]

Probabilistic linear dis- criminant analysis of i-vector posterior distributions,

S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear dis- criminant analysis of i-vector posterior distributions,” in Proc. ICASSP. IEEE, 2013, pp. 7644–7648

work page 2013
[12]

Full-covariance ubm and heavy-tailed plda in i-vector speaker veriﬁcation,

P. Mat ˇejka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, and J. ˇCernocky, “Full-covariance ubm and heavy-tailed plda in i-vector speaker veriﬁcation,” in Proc. ICASSP. IEEE, 2011, pp. 4828–4831

work page 2011
[13]

Deep neural networks for small footprint text- dependent speaker veriﬁcation,

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation,” inProc. ICASSP. IEEE, 2014, pp. 4052–4056

work page 2014
[14]

A novel scheme for speaker recognition using a phonetically-aware deep neural network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

work page 2014
[15]

Deep bottleneck features for i-vector based text-independent speaker veriﬁcation,

S. H. Ghalehjegh and R. C. Rose, “Deep bottleneck features for i-vector based text-independent speaker veriﬁcation,” in Au- tomatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 555–560

work page 2015
[16]

Deep neural network embeddings for text-independent speaker veriﬁcation,

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker veriﬁcation,”Proc. Interspeech, pp. 999–1003, 2017

work page 2017
[17]

X-vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” ICASSP , Calgary, 2018

work page 2018
[18]

Towards audio-visual on-line di- arization of participants in group meetings,

H. Hung and G. Friedland, “Towards audio-visual on-line di- arization of participants in group meetings,” in Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2 2008, 2008

work page 2008
[19]

Robust speaker identiﬁcation in a meeting with short audio seg- ments,

G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, and C. Turchetti, “Robust speaker identiﬁcation in a meeting with short audio seg- ments,” in Intelligent Decision Technologies 2016 . Springer, 2016, pp. 465–477

work page 2016
[20]

The icsi rt-09 speaker diarization system,

G. Friedland, A. Janin, D. Imseng, X. Anguera, L. Gottlieb, M. Huijbregts, M. T. Knox, and O. Vinyals, “The icsi rt-09 speaker diarization system,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 371–381, 2012

work page 2012
[21]

Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,

G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V . Manohar, N. Dehak, D. Povey, S. Watanabe et al., “Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,” inProc. Inter- speech, 2018, pp. 2808–2812

work page 2018
[22]

Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,

G. Friedland, H. Hung, and C. Yeo, “Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,” in Proc. ICASSP. IEEE, 2009, pp. 4069–4072

work page 2009
[23]

Audio-visual speaker diarization using ﬁsher linear semi-discriminant analy- sis,

N. Saraﬁanos, T. Giannakopoulos, and S. Petridis, “Audio-visual speaker diarization using ﬁsher linear semi-discriminant analy- sis,” Multimedia Tools and Applications, vol. 75, no. 1, pp. 115– 130, 2016

work page 2016
[24]

Mul- timodal speaker segmentation and identiﬁcation in presence of overlapped speech segments,

V . Rozgic, K. J. Han, P. G. Georgiou, and S. Narayanan, “Mul- timodal speaker segmentation and identiﬁcation in presence of overlapped speech segments,” Journal of Multimedia , vol. 5, no. 4, p. 322, 2010

work page 2010
[25]

J. H. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone ar- rays. Brown University Providence, RI, 2000

work page 2000
[26]

Fusing audio and video information for online speaker diarization,

J. Schmalenstroeer, M. Kelling, V . Leutnant, and R. Haeb- Umbach, “Fusing audio and video information for online speaker diarization,” in Proc. Interspeech, 2009

work page 2009
[27]

Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,

P. Caba ˜nas-Molero, M. Lucena, J. Fuertes, P. Vera-Candeas, and N. Ruiz-Reyes, “Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,” Multimedia Tools and Applications, vol. 77, no. 20, pp. 27 685–27 707, 2018

work page 2018
[28]

Speaker diarization with enhancing speech for the ﬁrst dihard challenge,

L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee, “Speaker diarization with enhancing speech for the ﬁrst dihard challenge,” Proc. Interspeech, pp. 2793–2797, 2018

work page 2018
[29]

A. B. Johnston and D. C. Burnett, WebRTC: APIs and RTCWEB protocols of the HTML5 real-time web . Digital Codex LLC, 2012

work page 2012
[30]

V oxceleb: a large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset,” inINTERSPEECH, 2017

work page 2017
[31]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018

work page 2018
[32]

Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,

S.-W. Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,” in Proc. ICASSP, 2019

work page 2019
[33]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016

work page 2016
[34]

Acoustic beamform- ing for speaker diarization of meetings,

X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform- ing for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 15, no. 7, pp. 2011–2021, September 2007

work page 2011
[35]

SSD: Single shot multibox detector,

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. ECCV. Springer, 2016, pp. 21–37

work page 2016
[36]

VG- GFace2: a dataset for recognising faces across pose and age,

Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VG- GFace2: a dataset for recognising faces across pose and age,” in Proc. Int. Conf. Autom. Face and Gesture Recog., 2018

work page 2018
[37]

Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,

D. Istrate, C. Fredouille, S. Meignier, L. Besacier, and J. F. Bonastre, “Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,” in Inter- national Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 428–439

work page 2005
[38]

The ami meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthalet al., “The ami meeting corpus: A pre-announcement,” in Interna- tional Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 28–39

work page 2005
[39]

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,

G. Friedland, C. Yeo, and H. Hung, “Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,” ACM Transactions on Multimedia Computing, Commu- nications, and Applications (TOMM), vol. 6, no. 4, p. 27, 2010

work page 2010

[1] [1]

Introduction Over the recent years, there has been a growing demand to be able to record and search human communications in a machine readable format. There has been signiﬁcant advances in auto- matic speech recognition due to the availability of large-scale datasets [1, 2] and the accessibility of deep learning frame- works [3, 4, 5], but to give the tra...

work page

[2] [2]

Who said that?: Audio-visual speaker diarisation of real-world meetings

System description 2.1. Audio-only baseline The baseline system provided for the second DIHARD chal- lenge is used as our audio-only baseline. The system takes key components from the top-scoring systems in the ﬁrst DIHARD challenge and shows state-of-the-art performance on audio-only diarisation. 2.1.1. Speech enhancement The speech enhancement is based ...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[3] [3]

Each will be de- scribed in the following paragraphs

Experiments The proposed method is evaluated on two independent datasets: our internal dataset of meetings recorded with 360◦ camera, and the publicly available AMI meeting corpus. Each will be de- scribed in the following paragraphs. 3.1. Internal meeting dataset The internal meeting dataset consists of audio-visual recording of regular meetings in which...

work page

[4] [4]

We have shown that speaker modelling with audio-visual enrollment have signiﬁcant advantages over clus- tering methods typically used for diarisation

Conclusion In this paper, we have introduced a multi-modal system which takes advantage of audio-visual correspondence to enroll speaker models. We have shown that speaker modelling with audio-visual enrollment have signiﬁcant advantages over clus- tering methods typically used for diarisation. Areas for further research include learnable methods for mult...

work page

[5] [5]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP. IEEE, 2015, pp. 5206–5210

work page 2015

[6] [6]

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth’chime’speech separation and recognition challenge: Dataset, task and baselines,” arXiv preprint arXiv:1803.10609, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorﬂow: Large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Automatic differ- entiation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ- entiation in pytorch,” 2017

work page 2017

[9] [9]

Matconvnet: Convolutional neural net- works for matlab,

A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural net- works for matlab,” in Proc. ACMM, 2015

work page 2015

[10] [10]

Front-end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[11] [11]

Probabilistic linear dis- criminant analysis of i-vector posterior distributions,

S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear dis- criminant analysis of i-vector posterior distributions,” in Proc. ICASSP. IEEE, 2013, pp. 7644–7648

work page 2013

[12] [12]

Full-covariance ubm and heavy-tailed plda in i-vector speaker veriﬁcation,

P. Mat ˇejka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, and J. ˇCernocky, “Full-covariance ubm and heavy-tailed plda in i-vector speaker veriﬁcation,” in Proc. ICASSP. IEEE, 2011, pp. 4828–4831

work page 2011

[13] [13]

Deep neural networks for small footprint text- dependent speaker veriﬁcation,

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation,” inProc. ICASSP. IEEE, 2014, pp. 4052–4056

work page 2014

[14] [14]

A novel scheme for speaker recognition using a phonetically-aware deep neural network,

Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

work page 2014

[15] [15]

Deep bottleneck features for i-vector based text-independent speaker veriﬁcation,

S. H. Ghalehjegh and R. C. Rose, “Deep bottleneck features for i-vector based text-independent speaker veriﬁcation,” in Au- tomatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 555–560

work page 2015

[16] [16]

Deep neural network embeddings for text-independent speaker veriﬁcation,

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker veriﬁcation,”Proc. Interspeech, pp. 999–1003, 2017

work page 2017

[17] [17]

X-vectors: Robust dnn embeddings for speaker recognition,

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” ICASSP , Calgary, 2018

work page 2018

[18] [18]

Towards audio-visual on-line di- arization of participants in group meetings,

H. Hung and G. Friedland, “Towards audio-visual on-line di- arization of participants in group meetings,” in Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2 2008, 2008

work page 2008

[19] [19]

Robust speaker identiﬁcation in a meeting with short audio seg- ments,

G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, and C. Turchetti, “Robust speaker identiﬁcation in a meeting with short audio seg- ments,” in Intelligent Decision Technologies 2016 . Springer, 2016, pp. 465–477

work page 2016

[20] [20]

The icsi rt-09 speaker diarization system,

G. Friedland, A. Janin, D. Imseng, X. Anguera, L. Gottlieb, M. Huijbregts, M. T. Knox, and O. Vinyals, “The icsi rt-09 speaker diarization system,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 371–381, 2012

work page 2012

[21] [21]

Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,

G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V . Manohar, N. Dehak, D. Povey, S. Watanabe et al., “Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,” inProc. Inter- speech, 2018, pp. 2808–2812

work page 2018

[22] [22]

Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,

G. Friedland, H. Hung, and C. Yeo, “Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,” in Proc. ICASSP. IEEE, 2009, pp. 4069–4072

work page 2009

[23] [23]

Audio-visual speaker diarization using ﬁsher linear semi-discriminant analy- sis,

N. Saraﬁanos, T. Giannakopoulos, and S. Petridis, “Audio-visual speaker diarization using ﬁsher linear semi-discriminant analy- sis,” Multimedia Tools and Applications, vol. 75, no. 1, pp. 115– 130, 2016

work page 2016

[24] [24]

Mul- timodal speaker segmentation and identiﬁcation in presence of overlapped speech segments,

V . Rozgic, K. J. Han, P. G. Georgiou, and S. Narayanan, “Mul- timodal speaker segmentation and identiﬁcation in presence of overlapped speech segments,” Journal of Multimedia , vol. 5, no. 4, p. 322, 2010

work page 2010

[25] [25]

J. H. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone ar- rays. Brown University Providence, RI, 2000

work page 2000

[26] [26]

Fusing audio and video information for online speaker diarization,

J. Schmalenstroeer, M. Kelling, V . Leutnant, and R. Haeb- Umbach, “Fusing audio and video information for online speaker diarization,” in Proc. Interspeech, 2009

work page 2009

[27] [27]

Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,

P. Caba ˜nas-Molero, M. Lucena, J. Fuertes, P. Vera-Candeas, and N. Ruiz-Reyes, “Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,” Multimedia Tools and Applications, vol. 77, no. 20, pp. 27 685–27 707, 2018

work page 2018

[28] [28]

Speaker diarization with enhancing speech for the ﬁrst dihard challenge,

L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee, “Speaker diarization with enhancing speech for the ﬁrst dihard challenge,” Proc. Interspeech, pp. 2793–2797, 2018

work page 2018

[29] [29]

A. B. Johnston and D. C. Burnett, WebRTC: APIs and RTCWEB protocols of the HTML5 real-time web . Digital Codex LLC, 2012

work page 2012

[30] [30]

V oxceleb: a large- scale speaker identiﬁcation dataset,

A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset,” inINTERSPEECH, 2017

work page 2017

[31] [31]

V oxCeleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018

work page 2018

[32] [32]

Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,

S.-W. Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,” in Proc. ICASSP, 2019

work page 2019

[33] [33]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016

work page 2016

[34] [34]

Acoustic beamform- ing for speaker diarization of meetings,

X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform- ing for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 15, no. 7, pp. 2011–2021, September 2007

work page 2011

[35] [35]

SSD: Single shot multibox detector,

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. ECCV. Springer, 2016, pp. 21–37

work page 2016

[36] [36]

VG- GFace2: a dataset for recognising faces across pose and age,

Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VG- GFace2: a dataset for recognising faces across pose and age,” in Proc. Int. Conf. Autom. Face and Gesture Recog., 2018

work page 2018

[37] [37]

Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,

D. Istrate, C. Fredouille, S. Meignier, L. Besacier, and J. F. Bonastre, “Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,” in Inter- national Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 428–439

work page 2005

[38] [38]

The ami meeting corpus: A pre-announcement,

J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthalet al., “The ami meeting corpus: A pre-announcement,” in Interna- tional Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 28–39

work page 2005

[39] [39]

Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,

G. Friedland, C. Yeo, and H. Hung, “Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,” ACM Transactions on Multimedia Computing, Commu- nications, and Applications (TOMM), vol. 6, no. 4, p. 27, 2010

work page 2010