Audio-Visual Kinship Verification

Eric Granger; Xiaoting Wu; Xiaoyi Feng

arxiv: 1906.10096 · v1 · pith:EPRIZN6Bnew · submitted 2019-06-24 · 💻 cs.CV

Audio-Visual Kinship Verification

Xiaoting Wu , Eric Granger , Xiaoyi Feng This is my paper

Pith reviewed 2026-05-25 17:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords kinship verificationaudio-visual fusionSiamese networkface recognitionvoice recognitionmulti-modal learningTALKIN datasetcontrastive loss

0 comments

The pith

A Siamese network fusing face and voice modalities raises kinship verification accuracy on low-quality internet videos above uni-modal baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that human voice carries kin-related cues that are complementary to facial appearance, allowing multi-modal fusion to overcome the difficulties of uncontrolled pose, lighting, blur and occlusion in found video data. It introduces the TALKIN dataset of talking pairs and evaluates several verification approaches before proposing a deep Siamese network trained with contrastive loss that performs early or late fusion of the two streams. Experiments indicate this network outperforms both single-modality methods and standard fusion techniques on the new data. If correct, the result would mean kinship verification systems can become more reliable without requiring higher-quality visual input alone. The work treats the complementarity of voice and face as an empirical finding demonstrated on the collected pairs.

Core claim

The authors establish that audio-visual information from face and voice can be combined in a deep Siamese fusion network with contrastive loss to verify kinship relations, and that this approach yields significantly higher accuracy on the TALKIN dataset than uni-modal baselines or conventional early and late fusion methods. They further show that vocal information supplies complementary cues to facial information for this task.

What carries the argument

The deep Siamese fusion network with contrastive loss that merges face and voice feature streams for kinship verification.

If this is right

Kinship verification accuracy improves when voice is added to face input on internet-quality video.
The TALKIN dataset serves as a benchmark that reveals the limitations of purely visual methods.
Contrastive-loss Siamese fusion outperforms both traditional feature-based and statistical multi-modal baselines.
Vocal cues remain useful even when visual conditions are uncontrolled.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion architecture could be tested on other pair-wise verification tasks such as speaker diarization or person re-identification where voice is available.
If voice cues prove stable across age or language, the method might extend to cross-generational or cross-lingual kinship checks.
Collecting controlled studio recordings of the same speaker pairs would isolate whether the reported gains depend on the noisy internet domain.

Load-bearing premise

That voice signals contain genuine kin-related cues that are independent of and additive to facial cues, and that the internet-sourced TALKIN videos provide an unbiased test of this complementarity.

What would settle it

Re-running the identical Siamese fusion architecture on a fresh, independently collected set of kin and non-kin talking pairs and obtaining no statistically significant accuracy lift from the addition of the audio stream would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.10096 by Eric Granger, Xiaoting Wu, Xiaoyi Feng.

**Figure 2.** Figure 2: Kinship verification from a single modality. In 2(a) we determine [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the proposed uni-modal methods. Both the face and voice modalities use the similar but specialized convolutional architectures trained [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Kinship verification from both face and voice modalities. We propose to fuse both visual information from face appearance and dynamics and vocal [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Architecture of the proposed deep Siamese fusion network. The facial [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: ROC curves uni- and multi-modal techniques for kinship verification [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Visual kinship verification entails confirming whether or not two individuals in a given pair of images or videos share a hypothesized kin relation. As a generalized face verification task, visual kinship verification is particularly difficult with low-quality found Internet data. Due to uncontrolled variations in background, pose, facial expression, blur, illumination and occlusion, state-of-the-art methods fail to provide high level of recognition accuracy. As with many other visual recognition tasks, kinship verification may benefit from combining visual and audio signals. However, voice-based kinship verification has received very little prior attention. We hypothesize that the human voice contains kin-related cues that are complementary to visual cues. In this paper, we address, for the first time, the use of audio-visual information from face and voice modalities to perform kinship verification. We first propose a new multi-modal kinship dataset, called TALking KINship (TALKIN), that contains several pairs of Internet-quality video sequences. Using TALKIN, we then study the utility of various kinship verification methods including traditional local feature based methods, statistical methods and more recent deep learning approaches. Then, early and late fusion methods are evaluated on the TALKIN dataset for the study of kinship verification with both face and voice modalities. Finally, we propose a deep Siamese fusion network with contrastive loss for multi-modal fusion of kinship relations. Extensive experiments on the TALKIN dataset indicate that by combining face and voice modalities, the proposed Siamese network can provide a significantly higher level of accuracy compared to baseline uni-modal and multi-modal fusion techniques. Experimental results also indicate that audio (vocal) information is complementary (to facial information) and useful for kinship verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First audio-visual kinship paper with a new dataset and Siamese fusion network, but the gains rest on untested assumptions about voice cues and dataset quality.

read the letter

The main thing here is that the authors are the first to apply audio-visual fusion to kinship verification. They release the TALKIN dataset of internet videos and test a contrastive Siamese network alongside standard early and late fusion baselines, reporting better accuracy when both modalities are used. That is a reasonable next step from existing visual kinship work and from multi-modal methods in other tasks. They correctly note that visual-only methods struggle with low-quality found data and that voice might carry complementary family resemblance signals. The survey of local features, statistical methods, and deep approaches on their data is straightforward and useful as a baseline comparison. The soft spots are in the evaluation. The abstract claims significantly higher accuracy but supplies no numbers, error bars, significance tests, or ablation results on the fusion components. More importantly, there is no discussion of possible confounds in TALKIN: shared recording conditions, demographics, or selection effects could produce apparent complementarity without voice actually encoding independent kinship information. The hypothesis is stated cleanly but not backed by prior evidence or controls. This paper is mainly for people already working on kinship verification or multi-modal biometrics who want a new dataset and an initial fusion architecture to build on. It is worth sending to peer review so the experimental details and dataset can be examined properly.

Referee Report

3 major / 1 minor

Summary. The paper introduces the TALKIN dataset of internet-quality video pairs for kinship verification, evaluates uni-modal and multi-modal methods including local features, statistical approaches, and deep learning, and proposes a deep Siamese fusion network with contrastive loss. It claims that combining face and voice modalities via this network yields significantly higher accuracy than baselines on TALKIN, establishing that vocal cues are complementary to facial ones for kinship verification.

Significance. If the empirical results prove robust after proper statistical controls and bias checks, the work would constitute the first systematic exploration of audio-visual kinship verification and introduce a new multi-modal dataset. This could open a direction in biometrics where voice provides independent kinship signals. However, the absence of any reported quantitative results, error bars, ablations, or dataset validation in the manuscript makes the complementarity claim difficult to evaluate at present.

major comments (3)

[Abstract] Abstract: The central claim that the proposed Siamese network 'can provide a significantly higher level of accuracy' and that 'audio (vocal) information is complementary' is presented without any numerical results, error bars, statistical significance tests, dataset size/statistics, or ablation details. This renders the load-bearing empirical assertion unverifiable from the manuscript and leaves open the possibility that reported gains arise from dataset artifacts rather than true modality complementarity.
[Abstract] Abstract: The hypothesis that 'the human voice contains kin-related cues that are complementary to visual cues' is asserted without citation to prior evidence, independent validation, or controls for confounds (e.g., shared recording conditions, demographics, or selection effects in the internet-sourced TALKIN pairs). Because the entire contribution rests on this untested assumption, the lack of such grounding is a load-bearing gap.
[Abstract] Abstract (TALKIN dataset description): No details are provided on collection protocol, pair selection criteria, demographic balance, or quality controls for the 'several pairs of Internet-quality video sequences.' Without these, it is impossible to assess whether the dataset is representative or whether fusion gains could be explained by spurious correlations rather than kinship cues.

minor comments (1)

[Abstract] The expansion of the TALKIN acronym ('TALking KINship') contains inconsistent capitalization that should be standardized.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the proposed Siamese network 'can provide a significantly higher level of accuracy' and that 'audio (vocal) information is complementary' is presented without any numerical results, error bars, statistical significance tests, dataset size/statistics, or ablation details. This renders the load-bearing empirical assertion unverifiable from the manuscript and leaves open the possibility that reported gains arise from dataset artifacts rather than true modality complementarity.

Authors: We agree that the abstract would be strengthened by including key quantitative indicators. The full manuscript reports extensive experimental results with accuracy comparisons across uni-modal and multi-modal methods, ablations, and dataset statistics. We will revise the abstract to incorporate specific accuracy figures, dataset size, and references to the relevant tables and figures in the experimental section. Error bars and any statistical tests will be emphasized in the main text. revision: yes
Referee: [Abstract] Abstract: The hypothesis that 'the human voice contains kin-related cues that are complementary to visual cues' is asserted without citation to prior evidence, independent validation, or controls for confounds (e.g., shared recording conditions, demographics, or selection effects in the internet-sourced TALKIN pairs). Because the entire contribution rests on this untested assumption, the lack of such grounding is a load-bearing gap.

Authors: The hypothesis is motivated by the broader literature on heritable vocal traits in biometrics. We will add citations to relevant studies on voice heritability and include a discussion of potential confounds such as shared recording conditions and demographics. We will also report additional controls and analyses in the experiments to address selection effects in the TALKIN pairs. revision: yes
Referee: [Abstract] Abstract (TALKIN dataset description): No details are provided on collection protocol, pair selection criteria, demographic balance, or quality controls for the 'several pairs of Internet-quality video sequences.' Without these, it is impossible to assess whether the dataset is representative or whether fusion gains could be explained by spurious correlations rather than kinship cues.

Authors: We agree that expanded dataset documentation is required. The manuscript contains an introduction to TALKIN, but we will add a detailed subsection covering the collection protocol from internet sources, pair selection criteria, demographic balance, and quality controls applied to the video sequences. This will allow readers to better evaluate potential biases or spurious correlations. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental comparison on new dataset is self-contained

full rationale

The paper introduces the TALKIN dataset and evaluates uni-modal, multi-modal, and a proposed Siamese fusion network via direct accuracy comparisons on that dataset. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claim rests on empirical results rather than any derivation that reduces to its own inputs by construction. The hypothesis about complementary cues is stated explicitly but is tested experimentally rather than assumed as a mathematical premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities; the central claim rests on the unstated assumption that the collected video pairs contain measurable kin cues in both modalities.

pith-pipeline@v0.9.0 · 5826 in / 1090 out tokens · 28485 ms · 2026-05-25T17:20:35.205465+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 3 internal anchors

[1]

Kin recognition signals in adult faces,

L. M. DeBruine, F. G. Smith, B. C. Jones, S. C. Roberts, M. Petrie, and T. D. Spector, “Kin recognition signals in adult faces,” Vision research, vol. 49, no. 1, pp. 38–43, 2009

work page 2009
[2]

Lateralization of kin recognition signals in the human face,

M. DalMartello and L. Maloney, “Lateralization of kin recognition signals in the human face,” Journal of vision, vol. 10, no. 8, p. 9, 2010

work page 2010
[3]

Where are kin recognition signals in the human face?

M. Dal-Martello and L. Maloney, “Where are kin recognition signals in the human face?” Journal of Vision, vol. 6, no. 12, p. 2, 2006

work page 2006
[4]

The male advantage in child facial resemblance detection: Behavioral and erp evidence,

H. Wu, S. Yang, S. Sun, C. Liu, and Y .-J. Luo, “The male advantage in child facial resemblance detection: Behavioral and erp evidence,” Social neuroscience, vol. 8, no. 6, pp. 555–567, 2013

work page 2013
[5]

Towards computational models of kinship veriﬁcation,

R. Fang, K. D. Tang, N. Snavely, and T. Chen, “Towards computational models of kinship veriﬁcation,” in ICIP 2010

work page 2010
[6]

Neighborhood repulsed metric learning for kinship veriﬁcation,

J. Lu, X. Zhou, Y .-P. Tan, Y . Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship veriﬁcation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 2, pp. 331–345, 2014

work page 2014
[7]

Super- vised mixed norm autoencoder for kinship veriﬁcation in unconstrained videos,

N. Kohli, D. Yadav, M. Vatsa, R. Singh, and A. Noore, “Super- vised mixed norm autoencoder for kinship veriﬁcation in unconstrained videos,” IEEE Transactions on Image Processing , 2018

work page 2018
[8]

A test of the effectiveness of speaker veriﬁcation for differentiating between identical twins,

A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker veriﬁcation for differentiating between identical twins,” Science & Justice: Journal of the Forensic Science Society, vol. 48, no. 4, pp. 182–186, Dec. 2008

work page 2008
[9]

Automatic Speaker Recognition of Identical Twins,

H. Knzel, “Automatic Speaker Recognition of Identical Twins,” International Journal of Speech Language and the Law , vol. 17, no. 2, Feb. 2011. [Online]. Available: http://www.equinoxjournals.com/IJSLL/ article/view/7829

work page 2011
[10]

End-to-end multimodal emotion recognition using deep neural networks,

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

work page 2017
[11]

Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,

A. Chowdhury, Y . Atoum, L. Tran, X. Liu, and A. Ross, “Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,” in ICPR 2018. IEEE, 2018

work page 2018
[12]

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

J. Liu, Z. Yuan, X. Wang, and C. Wang, “Towards good practices for multi-modal fusion in large-scale video classiﬁcation,” arXiv preprint arXiv:1809.05848, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Moddrop: adaptive multi-modal gesture recognition,

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analy- sis and Machine Intelligence , vol. 38, no. 8, pp. 1692–1706, 2016

work page 2016
[14]

Audio-visual kinship veriﬁcation in the wild,

X. Wu, E. Granger, T. Kinnunen, X. Feng, and A. Hadid, “Audio-visual kinship veriﬁcation in the wild,” in ICB 2019

work page 2019
[15]

Kinship veriﬁcation through transfer learning,

S. Xia, M. Shao, and Y . Fu, “Kinship veriﬁcation through transfer learning,” in IJCAI 2011

work page 2011
[16]

Genealogical face recognition based on ub kinface database,

M. Shao, S. Xia, and Y . Fu, “Genealogical face recognition based on ub kinface database,” in CVPRw 2011

work page 2011
[17]

Understanding kin relationships in a photo,

S. Xia, M. Shao, J. Luo, and Y . Fu, “Understanding kin relationships in a photo,” Multimedia, IEEE Transactions on , vol. 14, no. 4, pp. 1046– 1056, 2012

work page 2012
[18]

Are you really smiling at me? spontaneous versus posed enjoyment smiles,

H. Dibeklio ˘glu, A. Salah, and T. Gevers, “Are you really smiling at me? spontaneous versus posed enjoyment smiles,” in ECCV 2012

work page 2012
[19]

Like father, like son: Facial expression dynamics for kinship veriﬁcation,

——, “Like father, like son: Facial expression dynamics for kinship veriﬁcation,” in ICCV 2013

work page 2013
[20]

Tri-Subject Kinship Verification: Understanding the Core of A Family

X. Qin, X. Tan, and S. Chen, “Tri-subject kinship veriﬁcation: Under- standing the core of a family,” arXiv preprint arXiv:1501.02555 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[21]

Video-based kinship veriﬁcation using distance metric learning,

H. Yan and J. Hu, “Video-based kinship veriﬁcation using distance metric learning,” Pattern Recognition, vol. 75, pp. 15–24, 2018

work page 2018
[22]

Visual kinship recognition of families in the wild,

J. P. Robinson, M. Shao, Y . Wu, H. Liu, T. Gillis, and Y . Fu, “Visual kinship recognition of families in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018
[23]

Kin- ship veriﬁcation in the wild: The ﬁrst kinship veriﬁcation competition,

J. Lu, J. Hu, X. Zhou, J. Zhou, M. Castrill ´on-Santana, J. Lorenzo- Navarro, L. Kou, Y . Shang, A. Bottino, and T. Figuieiredo Vieira, “Kin- ship veriﬁcation in the wild: The ﬁrst kinship veriﬁcation competition,” in IJCB 2014

work page 2014
[24]

The fg 2015 kinship veriﬁcation in the wild evaluation,

J. Lu, J. Hu, V . E. Liong, X. Zhou, A. Bottino, I. U. Islam, T. F. Vieira, X. Qin, X. Tan, S. Chen et al. , “The fg 2015 kinship veriﬁcation in the wild evaluation,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on , vol. 1. IEEE, 2015, pp. 1–7

work page 2015
[25]

Rﬁw: Large-scale kinship recognition challenge,

J. P. Robinson, M. Shao, H. Zhao, Y . Wu, T. Gillis, and Y . Fu, “Rﬁw: Large-scale kinship recognition challenge,” 2017, pp. 1971–1973

work page 2017
[26]

Recognizing families in the wild (rﬁw): Data challenge workshop in conjunction with acm mm 2017,

——, “Recognizing families in the wild (rﬁw): Data challenge workshop in conjunction with acm mm 2017,” in WRFW 2017

work page 2017
[27]

Prototype-based discriminative feature learning for kinship veriﬁcation,

H. Yan, J. Lu, and X. Zhou, “Prototype-based discriminative feature learning for kinship veriﬁcation,” IEEE Transactions on Cybernetics , vol. 45, no. 11, pp. 2535–2545, Nov 2015

work page 2015
[28]

On the usefulness of color for kinship veriﬁcation from face images,

X. Wu, E. Boutellaa, M. B. L ´opez, X. Feng, and A. Hadid, “On the usefulness of color for kinship veriﬁcation from face images,” in WIFS 2016

work page 2016
[29]

Status-aware projection metric learning for kinship veriﬁcation,

H. Liu and C. Zhu, “Status-aware projection metric learning for kinship veriﬁcation,” inMultimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 319–324

work page 2017
[30]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org

work page 2016
[31]

Kinship veriﬁcation with deep convolutional neural networks,

K. Zhang, Y . Huang, C. Song, H. Wu, and L. Wang, “Kinship veriﬁcation with deep convolutional neural networks,” in BMVC 2015

work page 2015
[32]

Discriminative deep metric learning for face and kinship veriﬁcation,

J. Lu, J. Hu, and Y .-P. Tan, “Discriminative deep metric learning for face and kinship veriﬁcation,” IEEE Transactions on Image Processing , vol. 26, no. 9, pp. 4269–4282, 2017

work page 2017
[33]

Discriminative multimetric learning for kinship veriﬁcation,

H. Yan, J. Lu, W. Deng, and X. Zhou, “Discriminative multimetric learning for kinship veriﬁcation,” Information Forensics and Security, IEEE Transactions on , vol. 9, no. 7, pp. 1169–1178, 2014

work page 2014
[34]

Hereditary family signature of facial expression,

G. Peleg, G. Katzir, O. Peleg, M. Kamara, L. Brodsky, H. Hel- Or, D. Keren, and E. Nevo, “Hereditary family signature of facial expression,” Proceedings of the National Academy of Sciences, vol. 103, no. 43, pp. 15 921–15 926, 2006

work page 2006
[35]

Kinship veriﬁcation from videos using texture spatio-temporal features and deep learning features,

E. Boutellaa, M. Bordallo, S. Ait-Aoudia, X. Feng, and A. Hadid, “Kinship veriﬁcation from videos using texture spatio-temporal features and deep learning features,” in International Conference on Biometrics (ICB’16), 2016

work page 2016
[36]

Listener performance in speaker veriﬁcation tasks,

A. Rosenberg, “Listener performance in speaker veriﬁcation tasks,” IEEE Transactions on Audio and Electroacoustics , vol. 21, no. 3, pp. 221–225, Jun. 1973. 12

work page 1973
[37]

Speaker Veriﬁcation Using Adapted Gaussian Mixture Models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Veriﬁcation Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200499903615

work page 2000
[38]

Formant dynamics of bilingual identical twins,

D. Zuo and P. P. K. Mok, “Formant dynamics of bilingual identical twins,” Journal of Phonetics , vol. 52, pp. 1–12, Sep

work page
[39]

Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

[Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

work page
[40]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018
[41]

Canonical correlation analysis for data fusion and group inferences,

N. M. Correa, T. Adali, Y .-O. Li, and V . D. Calhoun, “Canonical correlation analysis for data fusion and group inferences,” IEEE signal processing magazine, vol. 27, no. 4, pp. 39–50, 2010

work page 2010
[42]

Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,

Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in ICCV 2017

work page 2017
[43]

Comments on the

B.-L. M. and B. E. . H. A., “Comments on the ”kinship face in the wild” data sets.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016

work page 2016
[44]

From Same Photo: Cheating on Visual Kinship Challenges

M. Dawson, A. Zisserman, and C. Nell ˚aker, “From same photo: Cheating on visual kinship challenges,” arXiv preprint arXiv:1809.06200 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct 2016

work page 2016
[46]

BSIF: Binarized statistical image features,

J. Kannala and E. Rahtu, “BSIF: Binarized statistical image features,” in ICPR 2012

work page 2012
[47]

Blur insensitive texture classiﬁcation using local phase quantization,

V . Ojansivu and J. Heikkil¨a, “Blur insensitive texture classiﬁcation using local phase quantization,” in Image and Signal Processing , vol. 5099, 2008, pp. 236–243

work page 2008
[48]

A comparative study of texture measures with classiﬁcation based on featured distributions,

T. Ojala, M. Pietik ¨ainen, and D. Harwood, “A comparative study of texture measures with classiﬁcation based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51–59, 1996

work page 1996
[49]

Face description with local binary patterns: Application to face recognition,

T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE transactions on pattern analysis and machine intelligence , vol. 28, no. 12, pp. 2037– 2041, 2006

work page 2037
[50]

Dynamic texture recognition using local binary patterns with an application to facial expressions,

G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE trans- actions on pattern analysis and machine intelligence , vol. 29, no. 6, pp. 915–928, 2007

work page 2007
[51]

Hyv ¨arinen, J

A. Hyv ¨arinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. Springer Science & Business Media, 2009, vol. 39

work page 2009
[52]

Deep face recognition,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conf. , 2015

work page 2015
[53]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[54]

Kinship veriﬁcation from faces via similarity metric based convolutional neural network,

L. Li, X. Feng, X. Wu, Z. Xia, and A. Hadid, “Kinship veriﬁcation from faces via similarity metric based convolutional neural network,” in International Conference Image Analysis and Recognition . Springer, 2016, pp. 539–548

work page 2016
[55]

Front- end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011
[56]

A small footprint i-vector extractor,

P. Kenny, “A small footprint i-vector extractor,” in Odyssey 2012: The Speaker and Language Recognition Workshop, Singapore, June 25-28, 2012, 2012, pp. 1–6. [Online]. Available: http://www.isca-speech.org/ archive/odyssey 2012/od12 001.html

work page 2012
[57]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in Interspeech 2018

work page 2018
[58]

Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,

S. O. Sadjadi, M. Slaney, and L. Heck, “Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,” Tech. Rep., September 2013. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2/

work page 2013
[59]

Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015,

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015,” Software available from tensorﬂow. org, vol. 1, no. 2, 2015

work page 2015
[60]

Matconvnet – convolutional neural networks for matlab,

A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia , 2015. Xiaoting Wu is currently working toward the Ph.D. degree in the Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland and School of Electronics and Information, Northwestern Polytechnical Uni...

work page 2015

[1] [1]

Kin recognition signals in adult faces,

L. M. DeBruine, F. G. Smith, B. C. Jones, S. C. Roberts, M. Petrie, and T. D. Spector, “Kin recognition signals in adult faces,” Vision research, vol. 49, no. 1, pp. 38–43, 2009

work page 2009

[2] [2]

Lateralization of kin recognition signals in the human face,

M. DalMartello and L. Maloney, “Lateralization of kin recognition signals in the human face,” Journal of vision, vol. 10, no. 8, p. 9, 2010

work page 2010

[3] [3]

Where are kin recognition signals in the human face?

M. Dal-Martello and L. Maloney, “Where are kin recognition signals in the human face?” Journal of Vision, vol. 6, no. 12, p. 2, 2006

work page 2006

[4] [4]

The male advantage in child facial resemblance detection: Behavioral and erp evidence,

H. Wu, S. Yang, S. Sun, C. Liu, and Y .-J. Luo, “The male advantage in child facial resemblance detection: Behavioral and erp evidence,” Social neuroscience, vol. 8, no. 6, pp. 555–567, 2013

work page 2013

[5] [5]

Towards computational models of kinship veriﬁcation,

R. Fang, K. D. Tang, N. Snavely, and T. Chen, “Towards computational models of kinship veriﬁcation,” in ICIP 2010

work page 2010

[6] [6]

Neighborhood repulsed metric learning for kinship veriﬁcation,

J. Lu, X. Zhou, Y .-P. Tan, Y . Shang, and J. Zhou, “Neighborhood repulsed metric learning for kinship veriﬁcation,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 36, no. 2, pp. 331–345, 2014

work page 2014

[7] [7]

Super- vised mixed norm autoencoder for kinship veriﬁcation in unconstrained videos,

N. Kohli, D. Yadav, M. Vatsa, R. Singh, and A. Noore, “Super- vised mixed norm autoencoder for kinship veriﬁcation in unconstrained videos,” IEEE Transactions on Image Processing , 2018

work page 2018

[8] [8]

A test of the effectiveness of speaker veriﬁcation for differentiating between identical twins,

A. Ariyaeeinia, C. Morrison, A. Malegaonkar, and S. Black, “A test of the effectiveness of speaker veriﬁcation for differentiating between identical twins,” Science & Justice: Journal of the Forensic Science Society, vol. 48, no. 4, pp. 182–186, Dec. 2008

work page 2008

[9] [9]

Automatic Speaker Recognition of Identical Twins,

H. Knzel, “Automatic Speaker Recognition of Identical Twins,” International Journal of Speech Language and the Law , vol. 17, no. 2, Feb. 2011. [Online]. Available: http://www.equinoxjournals.com/IJSLL/ article/view/7829

work page 2011

[10] [10]

End-to-end multimodal emotion recognition using deep neural networks,

P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017

work page 2017

[11] [11]

Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,

A. Chowdhury, Y . Atoum, L. Tran, X. Liu, and A. Ross, “Msu-avis dataset: Fusing face and voice modalities for biometric recognition in indoor surveillance videos,” in ICPR 2018. IEEE, 2018

work page 2018

[12] [12]

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

J. Liu, Z. Yuan, X. Wang, and C. Wang, “Towards good practices for multi-modal fusion in large-scale video classiﬁcation,” arXiv preprint arXiv:1809.05848, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Moddrop: adaptive multi-modal gesture recognition,

N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analy- sis and Machine Intelligence , vol. 38, no. 8, pp. 1692–1706, 2016

work page 2016

[14] [14]

Audio-visual kinship veriﬁcation in the wild,

X. Wu, E. Granger, T. Kinnunen, X. Feng, and A. Hadid, “Audio-visual kinship veriﬁcation in the wild,” in ICB 2019

work page 2019

[15] [15]

Kinship veriﬁcation through transfer learning,

S. Xia, M. Shao, and Y . Fu, “Kinship veriﬁcation through transfer learning,” in IJCAI 2011

work page 2011

[16] [16]

Genealogical face recognition based on ub kinface database,

M. Shao, S. Xia, and Y . Fu, “Genealogical face recognition based on ub kinface database,” in CVPRw 2011

work page 2011

[17] [17]

Understanding kin relationships in a photo,

S. Xia, M. Shao, J. Luo, and Y . Fu, “Understanding kin relationships in a photo,” Multimedia, IEEE Transactions on , vol. 14, no. 4, pp. 1046– 1056, 2012

work page 2012

[18] [18]

Are you really smiling at me? spontaneous versus posed enjoyment smiles,

H. Dibeklio ˘glu, A. Salah, and T. Gevers, “Are you really smiling at me? spontaneous versus posed enjoyment smiles,” in ECCV 2012

work page 2012

[19] [19]

Like father, like son: Facial expression dynamics for kinship veriﬁcation,

——, “Like father, like son: Facial expression dynamics for kinship veriﬁcation,” in ICCV 2013

work page 2013

[20] [20]

Tri-Subject Kinship Verification: Understanding the Core of A Family

X. Qin, X. Tan, and S. Chen, “Tri-subject kinship veriﬁcation: Under- standing the core of a family,” arXiv preprint arXiv:1501.02555 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[21] [21]

Video-based kinship veriﬁcation using distance metric learning,

H. Yan and J. Hu, “Video-based kinship veriﬁcation using distance metric learning,” Pattern Recognition, vol. 75, pp. 15–24, 2018

work page 2018

[22] [22]

Visual kinship recognition of families in the wild,

J. P. Robinson, M. Shao, Y . Wu, H. Liu, T. Gillis, and Y . Fu, “Visual kinship recognition of families in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018

[23] [23]

Kin- ship veriﬁcation in the wild: The ﬁrst kinship veriﬁcation competition,

J. Lu, J. Hu, X. Zhou, J. Zhou, M. Castrill ´on-Santana, J. Lorenzo- Navarro, L. Kou, Y . Shang, A. Bottino, and T. Figuieiredo Vieira, “Kin- ship veriﬁcation in the wild: The ﬁrst kinship veriﬁcation competition,” in IJCB 2014

work page 2014

[24] [24]

The fg 2015 kinship veriﬁcation in the wild evaluation,

J. Lu, J. Hu, V . E. Liong, X. Zhou, A. Bottino, I. U. Islam, T. F. Vieira, X. Qin, X. Tan, S. Chen et al. , “The fg 2015 kinship veriﬁcation in the wild evaluation,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on , vol. 1. IEEE, 2015, pp. 1–7

work page 2015

[25] [25]

Rﬁw: Large-scale kinship recognition challenge,

J. P. Robinson, M. Shao, H. Zhao, Y . Wu, T. Gillis, and Y . Fu, “Rﬁw: Large-scale kinship recognition challenge,” 2017, pp. 1971–1973

work page 2017

[26] [26]

Recognizing families in the wild (rﬁw): Data challenge workshop in conjunction with acm mm 2017,

——, “Recognizing families in the wild (rﬁw): Data challenge workshop in conjunction with acm mm 2017,” in WRFW 2017

work page 2017

[27] [27]

Prototype-based discriminative feature learning for kinship veriﬁcation,

H. Yan, J. Lu, and X. Zhou, “Prototype-based discriminative feature learning for kinship veriﬁcation,” IEEE Transactions on Cybernetics , vol. 45, no. 11, pp. 2535–2545, Nov 2015

work page 2015

[28] [28]

On the usefulness of color for kinship veriﬁcation from face images,

X. Wu, E. Boutellaa, M. B. L ´opez, X. Feng, and A. Hadid, “On the usefulness of color for kinship veriﬁcation from face images,” in WIFS 2016

work page 2016

[29] [29]

Status-aware projection metric learning for kinship veriﬁcation,

H. Liu and C. Zhu, “Status-aware projection metric learning for kinship veriﬁcation,” inMultimedia and Expo (ICME), 2017 IEEE International Conference on. IEEE, 2017, pp. 319–324

work page 2017

[30] [30]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org

work page 2016

[31] [31]

Kinship veriﬁcation with deep convolutional neural networks,

K. Zhang, Y . Huang, C. Song, H. Wu, and L. Wang, “Kinship veriﬁcation with deep convolutional neural networks,” in BMVC 2015

work page 2015

[32] [32]

Discriminative deep metric learning for face and kinship veriﬁcation,

J. Lu, J. Hu, and Y .-P. Tan, “Discriminative deep metric learning for face and kinship veriﬁcation,” IEEE Transactions on Image Processing , vol. 26, no. 9, pp. 4269–4282, 2017

work page 2017

[33] [33]

Discriminative multimetric learning for kinship veriﬁcation,

H. Yan, J. Lu, W. Deng, and X. Zhou, “Discriminative multimetric learning for kinship veriﬁcation,” Information Forensics and Security, IEEE Transactions on , vol. 9, no. 7, pp. 1169–1178, 2014

work page 2014

[34] [34]

Hereditary family signature of facial expression,

G. Peleg, G. Katzir, O. Peleg, M. Kamara, L. Brodsky, H. Hel- Or, D. Keren, and E. Nevo, “Hereditary family signature of facial expression,” Proceedings of the National Academy of Sciences, vol. 103, no. 43, pp. 15 921–15 926, 2006

work page 2006

[35] [35]

Kinship veriﬁcation from videos using texture spatio-temporal features and deep learning features,

E. Boutellaa, M. Bordallo, S. Ait-Aoudia, X. Feng, and A. Hadid, “Kinship veriﬁcation from videos using texture spatio-temporal features and deep learning features,” in International Conference on Biometrics (ICB’16), 2016

work page 2016

[36] [36]

Listener performance in speaker veriﬁcation tasks,

A. Rosenberg, “Listener performance in speaker veriﬁcation tasks,” IEEE Transactions on Audio and Electroacoustics , vol. 21, no. 3, pp. 221–225, Jun. 1973. 12

work page 1973

[37] [37]

Speaker Veriﬁcation Using Adapted Gaussian Mixture Models,

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker Veriﬁcation Using Adapted Gaussian Mixture Models,” Digital Signal Processing, vol. 10, no. 1, pp. 19–41, Jan. 2000. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200499903615

work page 2000

[38] [38]

Formant dynamics of bilingual identical twins,

D. Zuo and P. P. K. Mok, “Formant dynamics of bilingual identical twins,” Journal of Phonetics , vol. 52, pp. 1–12, Sep

work page

[39] [39]

Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

[Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0095447015000182

work page

[40] [40]

Multimodal machine learning: A survey and taxonomy,

T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018

[41] [41]

Canonical correlation analysis for data fusion and group inferences,

N. M. Correa, T. Adali, Y .-O. Li, and V . D. Calhoun, “Canonical correlation analysis for data fusion and group inferences,” IEEE signal processing magazine, vol. 27, no. 4, pp. 39–50, 2010

work page 2010

[42] [42]

Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,

Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in ICCV 2017

work page 2017

[43] [43]

Comments on the

B.-L. M. and B. E. . H. A., “Comments on the ”kinship face in the wild” data sets.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016

work page 2016

[44] [44]

From Same Photo: Cheating on Visual Kinship Challenges

M. Dawson, A. Zisserman, and C. Nell ˚aker, “From same photo: Cheating on visual kinship challenges,” arXiv preprint arXiv:1809.06200 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [45]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct 2016

work page 2016

[46] [46]

BSIF: Binarized statistical image features,

J. Kannala and E. Rahtu, “BSIF: Binarized statistical image features,” in ICPR 2012

work page 2012

[47] [47]

Blur insensitive texture classiﬁcation using local phase quantization,

V . Ojansivu and J. Heikkil¨a, “Blur insensitive texture classiﬁcation using local phase quantization,” in Image and Signal Processing , vol. 5099, 2008, pp. 236–243

work page 2008

[48] [48]

A comparative study of texture measures with classiﬁcation based on featured distributions,

T. Ojala, M. Pietik ¨ainen, and D. Harwood, “A comparative study of texture measures with classiﬁcation based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51–59, 1996

work page 1996

[49] [49]

Face description with local binary patterns: Application to face recognition,

T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE transactions on pattern analysis and machine intelligence , vol. 28, no. 12, pp. 2037– 2041, 2006

work page 2037

[50] [50]

Dynamic texture recognition using local binary patterns with an application to facial expressions,

G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE trans- actions on pattern analysis and machine intelligence , vol. 29, no. 6, pp. 915–928, 2007

work page 2007

[51] [51]

Hyv ¨arinen, J

A. Hyv ¨arinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. Springer Science & Business Media, 2009, vol. 39

work page 2009

[52] [52]

Deep face recognition,

O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conf. , 2015

work page 2015

[53] [53]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[54] [54]

Kinship veriﬁcation from faces via similarity metric based convolutional neural network,

L. Li, X. Feng, X. Wu, Z. Xia, and A. Hadid, “Kinship veriﬁcation from faces via similarity metric based convolutional neural network,” in International Conference Image Analysis and Recognition . Springer, 2016, pp. 539–548

work page 2016

[55] [55]

Front- end factor analysis for speaker veriﬁcation,

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front- end factor analysis for speaker veriﬁcation,” IEEE Transactions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011

work page 2011

[56] [56]

A small footprint i-vector extractor,

P. Kenny, “A small footprint i-vector extractor,” in Odyssey 2012: The Speaker and Language Recognition Workshop, Singapore, June 25-28, 2012, 2012, pp. 1–6. [Online]. Available: http://www.isca-speech.org/ archive/odyssey 2012/od12 001.html

work page 2012

[57] [57]

V oxceleb2: Deep speaker recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxceleb2: Deep speaker recognition,” in Interspeech 2018

work page 2018

[58] [58]

Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,

S. O. Sadjadi, M. Slaney, and L. Heck, “Msr identity toolbox v1.0: A matlab toolbox for speaker recognition research,” Tech. Rep., September 2013. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ msr-identity-toolbox-v1-0-a-matlab-toolbox-for-speaker-recognition-research-2/

work page 2013

[59] [59]

Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015,

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scale machine learning on heterogeneous systems, 2015,” Software available from tensorﬂow. org, vol. 1, no. 2, 2015

work page 2015

[60] [60]

Matconvnet – convolutional neural networks for matlab,

A. Vedaldi and K. Lenc, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia , 2015. Xiaoting Wu is currently working toward the Ph.D. degree in the Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland and School of Electronics and Information, Northwestern Polytechnical Uni...

work page 2015