pith. sign in

arxiv: 1906.10042 · v1 · pith:NIWJHI77new · submitted 2019-06-24 · 💻 cs.SD · cs.CV· eess.AS

Who said that?: Audio-visual speaker diarisation of real-world meetings

Pith reviewed 2026-05-25 16:52 UTC · model grok-4.3

classification 💻 cs.SD cs.CVeess.AS
keywords audio-visual speaker diarisationreal-world meetingsspeaker enrollmentactive speaker detectionAMI corpusbeamformingmulti-channel audio
0
0 comments X

The pith

An iterative audio-visual method enrolls speaker models via video-audio correspondence to determine who spoke when in meetings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a system to figure out who is speaking at each moment during meetings by using both video and audio. It starts by linking faces in the video to voices in the audio to create models for each speaker. Then it uses those models plus visual cues to spot the active speaker. The goal is to make this work reliably even when meetings have background noise or people talking over each other. If successful, it would allow more accurate automatic transcripts and analysis of conversations from everyday recordings.

Core claim

The paper claims that an iterative process of enrolling speaker models via audio-visual correspondence, followed by using those models with visual information to identify the active speaker, produces robust diarisation outputs on real-world meetings and surpasses all comparable methods on the AMI meeting corpus. Beamforming with video can further enhance performance with multi-channel audio.

What carries the argument

Iterative enrollment of speaker models using audio-visual correspondence

If this is right

  • Generates robust outputs on real-world meeting data.
  • Exceeds comparable methods on the AMI corpus.
  • Improves further when beamforming is applied to multi-channel audio.
  • Provides both strong quantitative and qualitative results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend to other multi-modal settings like video conferencing with multiple participants.
  • It suggests that visual information can compensate for audio ambiguities in speaker identification.
  • Future work could test the approach in fully online streaming scenarios without full video access.
  • Integration with speech recognition could yield speaker-attributed transcripts for meetings.

Load-bearing premise

Audio-visual correspondence can reliably enroll speaker models without significant errors from noise or multiple simultaneous speakers.

What would settle it

Evaluation on a dataset containing many instances of overlapping speech and high noise levels showing no improvement over audio-only diarisation methods.

Figures

Figures reproduced from arXiv: 1906.10042 by Bong-Jin Lee, Icksang Han, Joon Son Chung.

Figure 1
Figure 1. Figure 1: Pipeline overview. 2.2.1. Audio-to-video correlation Cross-modal embeddings of the audio and the mouth motion are used to represent the respective signals. The strategy to train this joint embedding is described in [28], but we give a brief overview here. The network consists of two streams: the audio stream that encodes Mel-frequency cepstral coefficients (MFCC) in￾puts into 512-dimensional vectors; and t… view at source ↗
Figure 2
Figure 2. Figure 2: Still image from the internal meeting dataset [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Still images from the AMI corpus. the authors and are not set up in any way with the diarisation task in mind. A large proportion of the dataset consists of very short utterances with frequent speaker changes, providing an extremely challenging condition. The video is recorded using a GoPro Fusion camera, which captures 360° videos of the meeting with two fish-eye lenses. The videos are stitched together i… view at source ↗
read the original abstract

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative approach that first enrolls speaker models using audio-visual correspondence, then uses the enrolled models together with the visual information to determine the active speaker. We show strong quantitative and qualitative performance on a dataset of real-world meetings. The method is also evaluated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. We also show that beamforming can be used together with the video to further improve the performance when multi-channel audio is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a novel iterative audio-visual method for speaker diarisation in real-world meetings. It first enrolls speaker models via audio-visual correspondence from surround-view video and single/multi-channel audio inputs, then uses the enrolled models together with visual information to determine active speakers. Strong quantitative and qualitative results are reported on a custom real-world meeting dataset; the method also exceeds all comparable approaches on the public AMI corpus, with further gains shown when beamforming is combined with video on multi-channel audio.

Significance. If the central claims hold, the work offers a practical advance in audio-visual diarisation by leveraging AV correspondence for enrollment in challenging real-world conditions, with demonstrated gains over prior methods on AMI. The dual evaluation on both proprietary real-world data and a public benchmark is a strength; the beamforming integration is a useful engineering contribution.

major comments (2)
  1. [Method description (iterative enrollment stage)] The load-bearing claim that AV correspondence produces reliable speaker models for downstream diarisation (even under overlap or noise) is not supported by any quantitative enrollment error rates, overlap-handling description, or ablation showing diarisation degradation when enrollment is imperfect. This directly affects the reported gains on both the real-world dataset and AMI.
  2. [AMI evaluation results] The claim of exceeding 'all comparable methods' on AMI lacks a table or section that lists the exact baselines, their diarisation error rates (DER), and error bars; without these, the superiority cannot be verified against the abstract's assertion.
minor comments (2)
  1. [Method] Notation for the enrolled speaker models and the iterative update rule should be defined more explicitly (e.g., with equations) to aid reproducibility.
  2. [Experiments] The real-world dataset description should include details on number of meetings, total duration, and overlap statistics to contextualize the qualitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to be made in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Method description (iterative enrollment stage)] The load-bearing claim that AV correspondence produces reliable speaker models for downstream diarisation (even under overlap or noise) is not supported by any quantitative enrollment error rates, overlap-handling description, or ablation showing diarisation degradation when enrollment is imperfect. This directly affects the reported gains on both the real-world dataset and AMI.

    Authors: We agree that the manuscript would be strengthened by explicit quantitative analysis of the enrollment stage. The end-to-end diarisation results on both the real-world meetings and AMI provide indirect support for the reliability of the AV correspondence, but we acknowledge the value of direct metrics. In the revision we will add enrollment accuracy figures computed on held-out data, a description of overlap handling during enrollment, and an ablation that measures diarisation degradation under controlled enrollment errors. revision: yes

  2. Referee: [AMI evaluation results] The claim of exceeding 'all comparable methods' on AMI lacks a table or section that lists the exact baselines, their diarisation error rates (DER), and error bars; without these, the superiority cannot be verified against the abstract's assertion.

    Authors: The manuscript reports comparisons against prior methods on AMI, yet we accept that a single consolidated table listing every baseline, its DER, and any available error bars would improve verifiability. We will insert this table (with references to the original papers) into the AMI evaluation section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in method description or evaluation

full rationale

The paper proposes an iterative audio-visual enrollment and diarisation pipeline evaluated on external corpora (AMI and a real-world meeting dataset). No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The enrollment step is presented as a procedural component whose reliability is tested via downstream performance on held-out data, not derived from the target outputs by construction. This is a standard empirical ML pipeline with independent validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; central claim rests on unstated modeling assumptions typical of ML diarisation methods.

pith-pipeline@v0.9.0 · 5655 in / 917 out tokens · 23969 ms · 2026-05-25T16:52:39.278930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Over the recent years, there has been a growing demand to be able to record and search human communications in a machine readable format. There has been significant advances in auto- matic speech recognition due to the availability of large-scale datasets [1, 2] and the accessibility of deep learning frame- works [3, 4, 5], but to give the tra...

  2. [2]

    Who said that?: Audio-visual speaker diarisation of real-world meetings

    System description 2.1. Audio-only baseline The baseline system provided for the second DIHARD chal- lenge is used as our audio-only baseline. The system takes key components from the top-scoring systems in the first DIHARD challenge and shows state-of-the-art performance on audio-only diarisation. 2.1.1. Speech enhancement The speech enhancement is based ...

  3. [3]

    Each will be de- scribed in the following paragraphs

    Experiments The proposed method is evaluated on two independent datasets: our internal dataset of meetings recorded with 360◦ camera, and the publicly available AMI meeting corpus. Each will be de- scribed in the following paragraphs. 3.1. Internal meeting dataset The internal meeting dataset consists of audio-visual recording of regular meetings in which...

  4. [4]

    We have shown that speaker modelling with audio-visual enrollment have significant advantages over clus- tering methods typically used for diarisation

    Conclusion In this paper, we have introduced a multi-modal system which takes advantage of audio-visual correspondence to enroll speaker models. We have shown that speaker modelling with audio-visual enrollment have significant advantages over clus- tering methods typically used for diarisation. Areas for further research include learnable methods for mult...

  5. [5]

    Lib- rispeech: an asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP. IEEE, 2015, pp. 5206–5210

  6. [6]

    The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

    J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth’chime’speech separation and recognition challenge: Dataset, task and baselines,” arXiv preprint arXiv:1803.10609, 2018

  7. [7]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed sys- tems,” arXiv preprint arXiv:1603.04467, 2016

  8. [8]

    Automatic differ- entiation in pytorch,

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differ- entiation in pytorch,” 2017

  9. [9]

    Matconvnet: Convolutional neural net- works for matlab,

    A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural net- works for matlab,” in Proc. ACMM, 2015

  10. [10]

    Front-end factor analysis for speaker verification,

    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

  11. [11]

    Probabilistic linear dis- criminant analysis of i-vector posterior distributions,

    S. Cumani, O. Plchot, and P. Laface, “Probabilistic linear dis- criminant analysis of i-vector posterior distributions,” in Proc. ICASSP. IEEE, 2013, pp. 7644–7648

  12. [12]

    Full-covariance ubm and heavy-tailed plda in i-vector speaker verification,

    P. Mat ˇejka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, and J. ˇCernocky, “Full-covariance ubm and heavy-tailed plda in i-vector speaker verification,” in Proc. ICASSP. IEEE, 2011, pp. 4828–4831

  13. [13]

    Deep neural networks for small footprint text- dependent speaker verification,

    E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker verification,” inProc. ICASSP. IEEE, 2014, pp. 4052–4056

  14. [14]

    A novel scheme for speaker recognition using a phonetically-aware deep neural network,

    Y . Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Proc. ICASSP. IEEE, 2014, pp. 1695–1699

  15. [15]

    Deep bottleneck features for i-vector based text-independent speaker verification,

    S. H. Ghalehjegh and R. C. Rose, “Deep bottleneck features for i-vector based text-independent speaker verification,” in Au- tomatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 555–560

  16. [16]

    Deep neural network embeddings for text-independent speaker verification,

    D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,”Proc. Interspeech, pp. 999–1003, 2017

  17. [17]

    X-vectors: Robust dnn embeddings for speaker recognition,

    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- pur, “X-vectors: Robust dnn embeddings for speaker recognition,” ICASSP , Calgary, 2018

  18. [18]

    Towards audio-visual on-line di- arization of participants in group meetings,

    H. Hung and G. Friedland, “Towards audio-visual on-line di- arization of participants in group meetings,” in Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SFA2 2008, 2008

  19. [19]

    Robust speaker identification in a meeting with short audio seg- ments,

    G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, and C. Turchetti, “Robust speaker identification in a meeting with short audio seg- ments,” in Intelligent Decision Technologies 2016 . Springer, 2016, pp. 465–477

  20. [20]

    The icsi rt-09 speaker diarization system,

    G. Friedland, A. Janin, D. Imseng, X. Anguera, L. Gottlieb, M. Huijbregts, M. T. Knox, and O. Vinyals, “The icsi rt-09 speaker diarization system,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 371–381, 2012

  21. [21]

    Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,

    G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V . Manohar, N. Dehak, D. Povey, S. Watanabe et al., “Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge,” inProc. Inter- speech, 2018, pp. 2808–2812

  22. [22]

    Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,

    G. Friedland, H. Hung, and C. Yeo, “Multi-modal speaker diariza- tion of real-world meetings using compressed-domain video fea- tures,” in Proc. ICASSP. IEEE, 2009, pp. 4069–4072

  23. [23]

    Audio-visual speaker diarization using fisher linear semi-discriminant analy- sis,

    N. Sarafianos, T. Giannakopoulos, and S. Petridis, “Audio-visual speaker diarization using fisher linear semi-discriminant analy- sis,” Multimedia Tools and Applications, vol. 75, no. 1, pp. 115– 130, 2016

  24. [24]

    Mul- timodal speaker segmentation and identification in presence of overlapped speech segments,

    V . Rozgic, K. J. Han, P. G. Georgiou, and S. Narayanan, “Mul- timodal speaker segmentation and identification in presence of overlapped speech segments,” Journal of Multimedia , vol. 5, no. 4, p. 322, 2010

  25. [25]

    J. H. DiBiase, A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone ar- rays. Brown University Providence, RI, 2000

  26. [26]

    Fusing audio and video information for online speaker diarization,

    J. Schmalenstroeer, M. Kelling, V . Leutnant, and R. Haeb- Umbach, “Fusing audio and video information for online speaker diarization,” in Proc. Interspeech, 2009

  27. [27]

    Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,

    P. Caba ˜nas-Molero, M. Lucena, J. Fuertes, P. Vera-Candeas, and N. Ruiz-Reyes, “Multimodal speaker diarization for meetings us- ing volume-evaluated srp-phat and video analysis,” Multimedia Tools and Applications, vol. 77, no. 20, pp. 27 685–27 707, 2018

  28. [28]

    Speaker diarization with enhancing speech for the first dihard challenge,

    L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee, “Speaker diarization with enhancing speech for the first dihard challenge,” Proc. Interspeech, pp. 2793–2797, 2018

  29. [29]

    A. B. Johnston and D. C. Burnett, WebRTC: APIs and RTCWEB protocols of the HTML5 real-time web . Digital Codex LLC, 2012

  30. [30]

    V oxceleb: a large- scale speaker identification dataset,

    A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset,” inINTERSPEECH, 2017

  31. [31]

    V oxCeleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018

  32. [32]

    Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,

    S.-W. Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion,” in Proc. ICASSP, 2019

  33. [33]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016

  34. [34]

    Acoustic beamform- ing for speaker diarization of meetings,

    X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform- ing for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 15, no. 7, pp. 2011–2021, September 2007

  35. [35]

    SSD: Single shot multibox detector,

    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y . Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. ECCV. Springer, 2016, pp. 21–37

  36. [36]

    VG- GFace2: a dataset for recognising faces across pose and age,

    Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VG- GFace2: a dataset for recognising faces across pose and age,” in Proc. Int. Conf. Autom. Face and Gesture Recog., 2018

  37. [37]

    Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,

    D. Istrate, C. Fredouille, S. Meignier, L. Besacier, and J. F. Bonastre, “Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings,” in Inter- national Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 428–439

  38. [38]

    The ami meeting corpus: A pre-announcement,

    J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V . Karaiskos, W. Kraaij, M. Kronenthalet al., “The ami meeting corpus: A pre-announcement,” in Interna- tional Workshop on Machine Learning for Multimodal Interac- tion. Springer, 2005, pp. 28–39

  39. [39]

    Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,

    G. Friedland, C. Yeo, and H. Hung, “Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem,” ACM Transactions on Multimedia Computing, Commu- nications, and Applications (TOMM), vol. 6, no. 4, p. 27, 2010