pith. sign in

arxiv: 1906.10096 · v1 · pith:EPRIZN6Bnew · submitted 2019-06-24 · 💻 cs.CV

Audio-Visual Kinship Verification

Pith reviewed 2026-05-25 17:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords kinship verificationaudio-visual fusionSiamese networkface recognitionvoice recognitionmulti-modal learningTALKIN datasetcontrastive loss
0
0 comments X

The pith

A Siamese network fusing face and voice modalities raises kinship verification accuracy on low-quality internet videos above uni-modal baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that human voice carries kin-related cues that are complementary to facial appearance, allowing multi-modal fusion to overcome the difficulties of uncontrolled pose, lighting, blur and occlusion in found video data. It introduces the TALKIN dataset of talking pairs and evaluates several verification approaches before proposing a deep Siamese network trained with contrastive loss that performs early or late fusion of the two streams. Experiments indicate this network outperforms both single-modality methods and standard fusion techniques on the new data. If correct, the result would mean kinship verification systems can become more reliable without requiring higher-quality visual input alone. The work treats the complementarity of voice and face as an empirical finding demonstrated on the collected pairs.

Core claim

The authors establish that audio-visual information from face and voice can be combined in a deep Siamese fusion network with contrastive loss to verify kinship relations, and that this approach yields significantly higher accuracy on the TALKIN dataset than uni-modal baselines or conventional early and late fusion methods. They further show that vocal information supplies complementary cues to facial information for this task.

What carries the argument

The deep Siamese fusion network with contrastive loss that merges face and voice feature streams for kinship verification.

If this is right

  • Kinship verification accuracy improves when voice is added to face input on internet-quality video.
  • The TALKIN dataset serves as a benchmark that reveals the limitations of purely visual methods.
  • Contrastive-loss Siamese fusion outperforms both traditional feature-based and statistical multi-modal baselines.
  • Vocal cues remain useful even when visual conditions are uncontrolled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion architecture could be tested on other pair-wise verification tasks such as speaker diarization or person re-identification where voice is available.
  • If voice cues prove stable across age or language, the method might extend to cross-generational or cross-lingual kinship checks.
  • Collecting controlled studio recordings of the same speaker pairs would isolate whether the reported gains depend on the noisy internet domain.

Load-bearing premise

That voice signals contain genuine kin-related cues that are independent of and additive to facial cues, and that the internet-sourced TALKIN videos provide an unbiased test of this complementarity.

What would settle it

Re-running the identical Siamese fusion architecture on a fresh, independently collected set of kin and non-kin talking pairs and obtaining no statistically significant accuracy lift from the addition of the audio stream would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.10096 by Eric Granger, Xiaoting Wu, Xiaoyi Feng.

Figure 1
Figure 1. Figure 1: The pipeline employed to collect and analyse the TALKIN database. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Kinship verification from a single modality. In 2(a) we determine [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the proposed uni-modal methods. Both the face and voice modalities use the similar but specialized convolutional architectures trained [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Kinship verification from both face and voice modalities. We propose to fuse both visual information from face appearance and dynamics and vocal [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of the proposed deep Siamese fusion network. The facial [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ROC curves uni- and multi-modal techniques for kinship verification [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Visual kinship verification entails confirming whether or not two individuals in a given pair of images or videos share a hypothesized kin relation. As a generalized face verification task, visual kinship verification is particularly difficult with low-quality found Internet data. Due to uncontrolled variations in background, pose, facial expression, blur, illumination and occlusion, state-of-the-art methods fail to provide high level of recognition accuracy. As with many other visual recognition tasks, kinship verification may benefit from combining visual and audio signals. However, voice-based kinship verification has received very little prior attention. We hypothesize that the human voice contains kin-related cues that are complementary to visual cues. In this paper, we address, for the first time, the use of audio-visual information from face and voice modalities to perform kinship verification. We first propose a new multi-modal kinship dataset, called TALking KINship (TALKIN), that contains several pairs of Internet-quality video sequences. Using TALKIN, we then study the utility of various kinship verification methods including traditional local feature based methods, statistical methods and more recent deep learning approaches. Then, early and late fusion methods are evaluated on the TALKIN dataset for the study of kinship verification with both face and voice modalities. Finally, we propose a deep Siamese fusion network with contrastive loss for multi-modal fusion of kinship relations. Extensive experiments on the TALKIN dataset indicate that by combining face and voice modalities, the proposed Siamese network can provide a significantly higher level of accuracy compared to baseline uni-modal and multi-modal fusion techniques. Experimental results also indicate that audio (vocal) information is complementary (to facial information) and useful for kinship verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the TALKIN dataset of internet-quality video pairs for kinship verification, evaluates uni-modal and multi-modal methods including local features, statistical approaches, and deep learning, and proposes a deep Siamese fusion network with contrastive loss. It claims that combining face and voice modalities via this network yields significantly higher accuracy than baselines on TALKIN, establishing that vocal cues are complementary to facial ones for kinship verification.

Significance. If the empirical results prove robust after proper statistical controls and bias checks, the work would constitute the first systematic exploration of audio-visual kinship verification and introduce a new multi-modal dataset. This could open a direction in biometrics where voice provides independent kinship signals. However, the absence of any reported quantitative results, error bars, ablations, or dataset validation in the manuscript makes the complementarity claim difficult to evaluate at present.

major comments (3)
  1. [Abstract] Abstract: The central claim that the proposed Siamese network 'can provide a significantly higher level of accuracy' and that 'audio (vocal) information is complementary' is presented without any numerical results, error bars, statistical significance tests, dataset size/statistics, or ablation details. This renders the load-bearing empirical assertion unverifiable from the manuscript and leaves open the possibility that reported gains arise from dataset artifacts rather than true modality complementarity.
  2. [Abstract] Abstract: The hypothesis that 'the human voice contains kin-related cues that are complementary to visual cues' is asserted without citation to prior evidence, independent validation, or controls for confounds (e.g., shared recording conditions, demographics, or selection effects in the internet-sourced TALKIN pairs). Because the entire contribution rests on this untested assumption, the lack of such grounding is a load-bearing gap.
  3. [Abstract] Abstract (TALKIN dataset description): No details are provided on collection protocol, pair selection criteria, demographic balance, or quality controls for the 'several pairs of Internet-quality video sequences.' Without these, it is impossible to assess whether the dataset is representative or whether fusion gains could be explained by spurious correlations rather than kinship cues.
minor comments (1)
  1. [Abstract] The expansion of the TALKIN acronym ('TALking KINship') contains inconsistent capitalization that should be standardized.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the proposed Siamese network 'can provide a significantly higher level of accuracy' and that 'audio (vocal) information is complementary' is presented without any numerical results, error bars, statistical significance tests, dataset size/statistics, or ablation details. This renders the load-bearing empirical assertion unverifiable from the manuscript and leaves open the possibility that reported gains arise from dataset artifacts rather than true modality complementarity.

    Authors: We agree that the abstract would be strengthened by including key quantitative indicators. The full manuscript reports extensive experimental results with accuracy comparisons across uni-modal and multi-modal methods, ablations, and dataset statistics. We will revise the abstract to incorporate specific accuracy figures, dataset size, and references to the relevant tables and figures in the experimental section. Error bars and any statistical tests will be emphasized in the main text. revision: yes

  2. Referee: [Abstract] Abstract: The hypothesis that 'the human voice contains kin-related cues that are complementary to visual cues' is asserted without citation to prior evidence, independent validation, or controls for confounds (e.g., shared recording conditions, demographics, or selection effects in the internet-sourced TALKIN pairs). Because the entire contribution rests on this untested assumption, the lack of such grounding is a load-bearing gap.

    Authors: The hypothesis is motivated by the broader literature on heritable vocal traits in biometrics. We will add citations to relevant studies on voice heritability and include a discussion of potential confounds such as shared recording conditions and demographics. We will also report additional controls and analyses in the experiments to address selection effects in the TALKIN pairs. revision: yes

  3. Referee: [Abstract] Abstract (TALKIN dataset description): No details are provided on collection protocol, pair selection criteria, demographic balance, or quality controls for the 'several pairs of Internet-quality video sequences.' Without these, it is impossible to assess whether the dataset is representative or whether fusion gains could be explained by spurious correlations rather than kinship cues.

    Authors: We agree that expanded dataset documentation is required. The manuscript contains an introduction to TALKIN, but we will add a detailed subsection covering the collection protocol from internet sources, pair selection criteria, demographic balance, and quality controls applied to the video sequences. This will allow readers to better evaluate potential biases or spurious correlations. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental comparison on new dataset is self-contained

full rationale

The paper introduces the TALKIN dataset and evaluates uni-modal, multi-modal, and a proposed Siamese fusion network via direct accuracy comparisons on that dataset. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claim rests on empirical results rather than any derivation that reduces to its own inputs by construction. The hypothesis about complementary cues is stated explicitly but is tested experimentally rather than assumed as a mathematical premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities; the central claim rests on the unstated assumption that the collected video pairs contain measurable kin cues in both modalities.

pith-pipeline@v0.9.0 · 5826 in / 1090 out tokens · 28485 ms · 2026-05-25T17:20:35.205465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

  1. [1]

    Kin recognition signals in adult faces,

    L. M. DeBruine, F. G. Smith, B. C. Jones, S. C. Roberts, M. Petrie, and T. D. Spector, “Kin recognition signals in adult faces,” Vision research, vol. 49, no. 1, pp. 38–43, 2009

  2. [2]

    Lateralization of kin recognition signals in the human face,

    M. DalMartello and L. Maloney, “Lateralization of kin recognition signals in the human face,” Journal of vision, vol. 10, no. 8, p. 9, 2010

  3. [3]

    Where are kin recognition signals in the human face?

    M. Dal-Martello and L. Maloney, “Where are kin recognition signals in the human face?” Journal of Vision, vol. 6, no. 12, p. 2, 2006

  4. [4]

    The male advantage in child facial resemblance detection: Behavioral and erp evidence,

    H. Wu, S. Yang, S. Sun, C. Liu, and Y .-J. Luo, “The male advantage in child facial resemblance detection: Behavioral and erp evidence,” Social neuroscience, vol. 8, no. 6, pp. 555–567, 2013

  5. [5]

    Towards computational models of kinship verification,

    R. Fang, K. D. Tang, N. Snavely, and T. Chen, “Towards computational models of kinship verification,” in ICIP 2010

  6. [6]

    Neighborhood repulsed metric learning for kinship verification,

    J. Lu, X. Zhou, Y .-P. Tan, Y . Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship verification,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 2, pp. 331–345, 2014

  7. [7]

    Super- vised mixed norm autoencoder for kinship verification in unconstrained videos,

    N. Kohli, D. Yadav, M. Vatsa, R. Singh, and A. Noore, “Super- vised mixed norm autoencoder for kinship verification in unconstrained videos,” IEEE Transactions on Image Processing , 2018

  8. [8]

    A test of the effectiveness of speaker verification for differentiating between identical twins,

    A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker verification for differentiating between identical twins,” Science & Justice: Journal of the Forensic Science Society, vol. 48, no. 4, pp. 182–186, Dec. 2008

  9. [9]

    Automatic Speaker Recognition of Identical Twins,

    H. Knzel, “Automatic Speaker Recognition of Identical Twins,” International Journal of Speech Language and the Law , vol. 17, no. 2, Feb. 2011. [Online]. Available: http://www.equinoxjournals.com/IJSLL/ article/view/7829

  10. [10]

    End-to-end multimodal emotion recognition using deep neural networks,

    P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

  11. [11]

    Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,

    A. Chowdhury, Y . Atoum, L. Tran, X. Liu, and A. Ross, “Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,” in ICPR 2018. IEEE, 2018

  12. [12]

    Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

    J. Liu, Z. Yuan, X. Wang, and C. Wang, “Towards good practices for multi-modal fusion in large-scale video classification,” arXiv preprint arXiv:1809.05848, 2018

  13. [13]

    Moddrop: adaptive multi-modal gesture recognition,

    N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analy- sis and Machine Intelligence , vol. 38, no. 8, pp. 1692–1706, 2016

  14. [14]

    Audio-visual kinship verification in the wild,

    X. Wu, E. Granger, T. Kinnunen, X. Feng, and A. Hadid, “Audio-visual kinship verification in the wild,” in ICB 2019

  15. [15]

    Kinship verification through transfer learning,

    S. Xia, M. Shao, and Y . Fu, “Kinship verification through transfer learning,” in IJCAI 2011

  16. [16]

    Genealogical face recognition based on ub kinface database,

    M. Shao, S. Xia, and Y . Fu, “Genealogical face recognition based on ub kinface database,” in CVPRw 2011

  17. [17]

    Understanding kin relationships in a photo,

    S. Xia, M. Shao, J. Luo, and Y . Fu, “Understanding kin relationships in a photo,” Multimedia, IEEE Transactions on , vol. 14, no. 4, pp. 1046– 1056, 2012

  18. [18]

    Are you really smiling at me? spontaneous versus posed enjoyment smiles,

    H. Dibeklio ˘glu, A. Salah, and T. Gevers, “Are you really smiling at me? spontaneous versus posed enjoyment smiles,” in ECCV 2012

  19. [19]

    Like father, like son: Facial expression dynamics for kinship verification,

    ——, “Like father, like son: Facial expression dynamics for kinship verification,” in ICCV 2013

  20. [20]

    Tri-Subject Kinship Verification: Understanding the Core of A Family

    X. Qin, X. Tan, and S. Chen, “Tri-subject kinship verification: Under- standing the core of a family,” arXiv preprint arXiv:1501.02555 , 2015

  21. [21]

    Video-based kinship verification using distance metric learning,

    H. Yan and J. Hu, “Video-based kinship verification using distance metric learning,” Pattern Recognition, vol. 75, pp. 15–24, 2018

  22. [22]

    Visual kinship recognition of families in the wild,

    J. P. Robinson, M. Shao, Y . Wu, H. Liu, T. Gillis, and Y . Fu, “Visual kinship recognition of families in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

  23. [23]

    Kin- ship verification in the wild: The first kinship verification competition,

    J. Lu, J. Hu, X. Zhou, J. Zhou, M. Castrill ´on-Santana, J. Lorenzo- Navarro, L. Kou, Y . Shang, A. Bottino, and T. Figuieiredo Vieira, “Kin- ship verification in the wild: The first kinship verification competition,” in IJCB 2014

  24. [24]

    The fg 2015 kinship verification in the wild evaluation,

    J. Lu, J. Hu, V . E. Liong, X. Zhou, A. Bottino, I. U. Islam, T. F. Vieira, X. Qin, X. Tan, S. Chen et al. , “The fg 2015 kinship verification in the wild evaluation,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on , vol. 1. IEEE, 2015, pp. 1–7

  25. [25]

    Rfiw: Large-scale kinship recognition challenge,

    J. P. Robinson, M. Shao, H. Zhao, Y . Wu, T. Gillis, and Y . Fu, “Rfiw: Large-scale kinship recognition challenge,” 2017, pp. 1971–1973

  26. [26]

    Recognizing families in the wild (rfiw): Data challenge workshop in conjunction with acm mm 2017,

    ——, “Recognizing families in the wild (rfiw): Data challenge workshop in conjunction with acm mm 2017,” in WRFW 2017

  27. [27]

    Prototype-based discriminative feature learning for kinship verification,

    H. Yan, J. Lu, and X. Zhou, “Prototype-based discriminative feature learning for kinship verification,” IEEE Transactions on Cybernetics , vol. 45, no. 11, pp. 2535–2545, Nov 2015

  28. [28]

    On the usefulness of color for kinship verification from face images,

    X. Wu, E. Boutellaa, M. B. L ´opez, X. Feng, and A. Hadid, “On the usefulness of color for kinship verification from face images,” in WIFS 2016

  29. [29]

    Status-aware projection metric learning for kinship verification,

    H. Liu and C. Zhu, “Status-aware projection metric learning for kinship verification,” inMultimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 319–324

  30. [30]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org

  31. [31]

    Kinship verification with deep convolutional neural networks,

    K. Zhang, Y . Huang, C. Song, H. Wu, and L. Wang, “Kinship verification with deep convolutional neural networks,” in BMVC 2015

  32. [32]

    Discriminative deep metric learning for face and kinship verification,

    J. Lu, J. Hu, and Y .-P. Tan, “Discriminative deep metric learning for face and kinship verification,” IEEE Transactions on Image Processing , vol. 26, no. 9, pp. 4269–4282, 2017

  33. [33]

    Discriminative multimetric learning for kinship verification,

    H. Yan, J. Lu, W. Deng, and X. Zhou, “Discriminative multimetric learning for kinship verification,” Information Forensics and Security, IEEE Transactions on , vol. 9, no. 7, pp. 1169–1178, 2014

  34. [34]

    Hereditary family signature of facial expression,

    G. Peleg, G. Katzir, O. Peleg, M. Kamara, L. Brodsky, H. Hel- Or, D. Keren, and E. Nevo, “Hereditary family signature of facial expression,” Proceedings of the National Academy of Sciences, vol. 103, no. 43, pp. 15 921–15 926, 2006

  35. [35]

    Kinship verification from videos using texture spatio-temporal features and deep learning features,

    E. Boutellaa, M. Bordallo, S. Ait-Aoudia, X. Feng, and A. Hadid, “Kinship verification from videos using texture spatio-temporal features and deep learning features,” in International Conference on Biometrics (ICB’16), 2016

  36. [36]

    Listener performance in speaker verification tasks,

    A. Rosenberg, “Listener performance in speaker verification tasks,” IEEE Transactions on Audio and Electroacoustics , vol. 21, no. 3, pp. 221–225, Jun. 1973. 12

  37. [37]

    Speaker Verification Using Adapted Gaussian Mixture Models,

    D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200499903615

  38. [38]

    Formant dynamics of bilingual identical twins,

    D. Zuo and P. P. K. Mok, “Formant dynamics of bilingual identical twins,” Journal of Phonetics , vol. 52, pp. 1–12, Sep

  39. [39]

    Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

    [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

  40. [40]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

  41. [41]

    Canonical correlation analysis for data fusion and group inferences,

    N. M. Correa, T. Adali, Y .-O. Li, and V . D. Calhoun, “Canonical correlation analysis for data fusion and group inferences,” IEEE signal processing magazine, vol. 27, no. 4, pp. 39–50, 2010

  42. [42]

    Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,

    Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in ICCV 2017

  43. [43]

    Comments on the

    B.-L. M. and B. E. . H. A., “Comments on the ”kinship face in the wild” data sets.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016

  44. [44]

    From Same Photo: Cheating on Visual Kinship Challenges

    M. Dawson, A. Zisserman, and C. Nell ˚aker, “From same photo: Cheating on visual kinship challenges,” arXiv preprint arXiv:1809.06200 , 2018

  45. [45]

    Joint face detection and alignment using multitask cascaded convolutional networks,

    K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct 2016

  46. [46]

    BSIF: Binarized statistical image features,

    J. Kannala and E. Rahtu, “BSIF: Binarized statistical image features,” in ICPR 2012

  47. [47]

    Blur insensitive texture classification using local phase quantization,

    V . Ojansivu and J. Heikkil¨a, “Blur insensitive texture classification using local phase quantization,” in Image and Signal Processing , vol. 5099, 2008, pp. 236–243

  48. [48]

    A comparative study of texture measures with classification based on featured distributions,

    T. Ojala, M. Pietik ¨ainen, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51–59, 1996

  49. [49]

    Face description with local binary patterns: Application to face recognition,

    T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE transactions on pattern analysis and machine intelligence , vol. 28, no. 12, pp. 2037– 2041, 2006

  50. [50]

    Dynamic texture recognition using local binary patterns with an application to facial expressions,

    G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE trans- actions on pattern analysis and machine intelligence , vol. 29, no. 6, pp. 915–928, 2007

  51. [51]

    Hyv ¨arinen, J

    A. Hyv ¨arinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. Springer Science & Business Media, 2009, vol. 39

  52. [52]

    Deep face recognition,

    O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conf. , 2015

  53. [53]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  54. [54]

    Kinship verification from faces via similarity metric based convolutional neural network,

    L. Li, X. Feng, X. Wu, Z. Xia, and A. Hadid, “Kinship verification from faces via similarity metric based convolutional neural network,” in International Conference Image Analysis and Recognition . Springer, 2016, pp. 539–548

  55. [55]

    Front- end factor analysis for speaker verification,

    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

  56. [56]

    A small footprint i-vector extractor,

    P. Kenny, “A small footprint i-vector extractor,” in Odyssey 2012: The Speaker and Language Recognition Workshop, Singapore, June 25-28, 2012, 2012, pp. 1–6. [Online]. Available: http://www.isca-speech.org/ archive/odyssey 2012/od12 001.html

  57. [57]

    V oxceleb2: Deep speaker recognition,

    J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in Interspeech 2018

  58. [58]

    Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,

    S. O. Sadjadi, M. Slaney, and L. Heck, “Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,” Tech. Rep., September 2013. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2/

  59. [59]

    Tensorflow: Large-scale machine learning on heterogeneous systems, 2015,

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: Large-scale machine learning on heterogeneous systems, 2015,” Software available from tensorflow. org, vol. 1, no. 2, 2015

  60. [60]

    Matconvnet – convolutional neural networks for matlab,

    A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia , 2015. Xiaoting Wu is currently working toward the Ph.D. degree in the Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland and School of Electronics and Information, Northwestern Polytechnical Uni...