pith. sign in

arxiv: 1907.10081 · v1 · pith:KQW4HBWNnew · submitted 2019-07-23 · 💻 cs.CV

Multimodal Age and Gender Classification Using Ear and Profile Face Images

Pith reviewed 2026-05-24 17:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal biometricsage classificationgender classificationear imagesprofile face imagesdeep neural networksfusion strategiesdomain adaptation
0
0 comments X

The pith

Multimodal deep networks fusing ear and profile face images achieve higher age and gender classification accuracy than single-modality approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops end-to-end deep neural network frameworks that accept both a profile face image and an ear image to classify age and gender. It tests fusion at the data, feature, and score levels while adding domain adaptation and center loss to strengthen feature learning. Experiments on the UND-F, UND-J2, and FERET datasets show that profile faces already carry substantial age and gender cues, yet adding ear images produces higher accuracies that surpass prior single-modality methods. A sympathetic reader would care because the work targets practical soft-biometric extraction from side-view images where full frontal views may be unavailable. The central effort is to demonstrate that the two image types together yield better discrimination than either source alone.

Core claim

The authors establish that end-to-end multimodal deep neural network frameworks taking profile face and ear images as input, combined through data, feature, and score level fusion and strengthened by domain adaptation and center loss, attain very high age and gender classification accuracies on the UND-F, UND-J2, and FERET datasets while outperforming state-of-the-art methods based on profile face images or ear images alone.

What carries the argument

End-to-end multimodal deep learning frameworks that perform data, feature, and score level fusion of paired profile face and ear images, augmented by domain adaptation and center loss.

If this is right

  • Profile face images alone contain a rich source of information for age and gender classification.
  • The multimodal system using both ear and profile face images reaches superior results compared to single-modality baselines.
  • Domain adaptation and center loss improve the representation and discrimination capability of the networks.
  • Extensive tests on three standard datasets confirm very high classification accuracies.
  • The multimodal approach beats prior state-of-the-art methods that use only profile faces or only ears.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion strategy could be tested on other soft-biometric attributes such as ethnicity estimation from side views.
  • If the gains persist under domain shift, the approach may help surveillance systems that capture only profile images.
  • Alignment of ear and face regions across different cameras or resolutions would be a direct next step to check robustness.

Load-bearing premise

The ear images supply genuinely complementary information about age and gender that the profile face does not already provide.

What would settle it

A profile-face-only model trained and tested on identical data splits that matches or exceeds the multimodal accuracies would show that the ear modality adds no real value.

Figures

Figures reproduced from arXiv: 1907.10081 by Dogucan Yaman, Fevziye Irem Eyiokur, Haz{\i}m Kemal Ekenel.

Figure 1
Figure 1. Figure 1: Overview of the multimodal, multitask age and gender [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multimodal fusion methods. (a) presents employed three different data fusion methods. In the first one, named as intensity fusion, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of employed data fusion approaches. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

In this paper, we present multimodal deep neural network frameworks for age and gender classification, which take input a profile face image as well as an ear image. Our main objective is to enhance the accuracy of soft biometric trait extraction from profile face images by additionally utilizing a promising biometric modality: ear appearance. For this purpose, we provided end-to-end multimodal deep learning frameworks. We explored different multimodal strategies by employing data, feature, and score level fusion. To increase representation and discrimination capability of the deep neural networks, we benefited from domain adaptation and employed center loss besides softmax loss. We conducted extensive experiments on the UND-F, UND-J2, and FERET datasets. Experimental results indicated that profile face images contain a rich source of information for age and gender classification. We found that the presented multimodal system achieves very high age and gender classification accuracies. Moreover, we attained superior results compared to the state-of-the-art profile face image or ear image-based age and gender classification methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents multimodal deep neural network frameworks for age and gender classification that combine profile face images with ear images. It explores data, feature, and score level fusion strategies, incorporates center loss and domain adaptation, and evaluates on the UND-F, UND-J2, and FERET datasets, claiming superior performance over state-of-the-art single-modality methods.

Significance. If the multimodal fusion demonstrably provides complementary information leading to statistically significant improvements, the work would be of interest to the biometrics and computer vision community as it highlights the potential of ear images to enhance profile face-based soft biometrics. The use of multiple fusion strategies and loss functions is a positive aspect, but the absence of detailed experimental protocols reduces the potential impact.

major comments (3)
  1. [Abstract] Abstract: The claim of 'extensive experiments' and 'superior results' is not supported by any reported error bars, exact data splits, ablation details, or statistical tests, making it impossible to verify the superiority claims or assess whether gains exceed what single-modality baselines achieve.
  2. [Experiments] Experiments section: No ablation studies are described that compare the multimodal system against single-modality (profile face only and ear only) baselines using identical network architectures and training procedures, which is required to substantiate that ear images supply genuinely complementary information rather than redundant cues.
  3. [Methods] Methods: There is no description of how the train/test splits were performed (e.g., subject-disjoint or random) or the number of subjects/images per split, which is load-bearing for claims of high accuracy and superiority in biometric classification tasks.
minor comments (1)
  1. [Abstract] Abstract: The specific numerical accuracy improvements (e.g., percentage gains over SOTA) are not stated, which would help readers quickly gauge the magnitude of the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional experimental details and controls would strengthen the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of 'extensive experiments' and 'superior results' is not supported by any reported error bars, exact data splits, ablation details, or statistical tests, making it impossible to verify the superiority claims or assess whether gains exceed what single-modality baselines achieve.

    Authors: We agree that the abstract uses strong phrasing that is not backed by the requested statistical elements within the abstract itself. The body of the manuscript reports results on the three datasets and comparisons to prior work, but lacks the specific controls noted. We will revise the abstract to employ more precise language and will add error bars, ablation details, and any applicable statistical tests to the experiments section in the revision. revision: yes

  2. Referee: [Experiments] Experiments section: No ablation studies are described that compare the multimodal system against single-modality (profile face only and ear only) baselines using identical network architectures and training procedures, which is required to substantiate that ear images supply genuinely complementary information rather than redundant cues.

    Authors: The referee is correct that controlled ablations with identical architectures are needed to isolate the contribution of each modality. The current manuscript focuses on multimodal fusion strategies and comparisons to existing single-modality state-of-the-art methods but does not include these specific same-architecture ablations. We will perform and report the requested ablation studies in the revised version. revision: yes

  3. Referee: [Methods] Methods: There is no description of how the train/test splits were performed (e.g., subject-disjoint or random) or the number of subjects/images per split, which is load-bearing for claims of high accuracy and superiority in biometric classification tasks.

    Authors: We acknowledge that explicit details on the train/test partitioning protocol are essential for reproducibility and to support biometric claims. The manuscript does not currently provide this information. We will add a clear description of the splitting method (including whether splits are subject-disjoint), along with the exact numbers of subjects and images per split for each dataset. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical classification results

full rationale

The paper is an empirical ML study reporting classification accuracies from end-to-end training of multimodal DNNs on public datasets (UND-F, UND-J2, FERET) using data/feature/score fusion, center loss, and domain adaptation. No mathematical derivations, equations, or 'predictions' exist that could reduce to inputs by construction. Central claims rest on measured performance numbers, not on any self-referential fitting or uniqueness theorems. Any self-citations (if present for loss functions or prior methods) are not load-bearing for the reported results, which are directly falsifiable via the experiments described.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of three fusion strategies and the assumption that ear images add independent signal; network weights constitute the main fitted parameters.

free parameters (1)
  • Fusion hyperparameters and network weights
    All model parameters are fitted to the training portions of UND-F, UND-J2, and FERET.
axioms (1)
  • domain assumption Ear appearance supplies complementary age and gender information to profile face images
    Invoked to justify multimodal fusion as the route to higher accuracy.

pith-pipeline@v0.9.0 · 5708 in / 1027 out tokens · 20582 ms · 2026-05-24T17:15:55.965020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    Abaza, A

    A. Abaza, A. Ross, C. Hebert, M. A. F. Harrison, and M. S. Nixon. A survey on ear biometrics. ACM Computing Sur- veys, 45(2):22, 2013. 1

  2. [2]

    Bradski and A

    G. Bradski and A. Kaehler. OpenCV. Dr. Dobbs Journal of Software Tools, 3, 2000. 5

  3. [3]

    A. M. Bukar and H. Ugail. Automatic age estimation from facial profile view. IET Computer Vision , 11(8):650–655,

  4. [4]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition , pages 248–255. IEEE, 2009. 3

  5. [5]

    Emerˇsiˇc, V .ˇStruc, and P

    ˇZ. Emerˇsiˇc, V .ˇStruc, and P. Peer. Ear recognition: More than a survey. Neurocomputing, 255:26–39, 2017. 1

  6. [6]

    F. I. Eyiokur, D. Yaman, and H. K. Ekenel. Domain adapta- tion for ear recognition using deep convolutional neural net- works. IET Biometrics, 7(3):199–206, 2017. 2, 4, 5

  7. [7]

    Gnanasivam and S

    P. Gnanasivam and S. Muttan. Gender classification using ear biometrics. In International Conference on Signal and Image Processing, pages 137–148. Springer, 2013. 1, 2, 7

  8. [8]

    Gross, I

    R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing , 28(5):807–813,

  9. [9]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016. 1, 2, 3, 6

  10. [10]

    Y . He, M. Huang, Q. Miao, H. Guo, and J. Wang. Deep em- bedding network for robust age estimation. In International Conference on Image Processing , pages 1092–1096. IEEE,

  11. [11]

    Iannarelli

    A. Iannarelli. Ear identification, forensic identification se- ries. Paramont Publ Company, 1989. 1

  12. [12]

    A. K. Jain, S. C. Dass, and K. Nandakumar. Soft biometric traits for personal recognition systems. In Biometric Authen- tication, pages 731–738. Springer, 2004. 1

  13. [13]

    A. K. Jain and U. Park. Facial marks: Soft biometric for face recognition. In International Conference on Image Process- ing, pages 37–40. IEEE, 2009. 1

  14. [14]

    Khorsandi and M

    R. Khorsandi and M. Abdel-Mottaleb. Gender classification using 2-D ear images and sparse representation. InWorkshop on Applications of Computer Vision , pages 461–466. IEEE,

  15. [15]

    D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul):1755–1758, 2009. 5

  16. [16]

    J. Lei, J. Zhou, and M. Abdel-Mottaleb. Gender classifica- tion using automatically detected and aligned 3D ear range data. In International Conference on Biometrics, pages 1–7. IEEE, 2013. 1, 2, 5, 7

  17. [17]

    Levi and T

    G. Levi and T. Hassner. Age and gender classification us- ing convolutional neural networks. In Computer Vision and Pattern Recognition Workshops, pages 34–42, 2015. 1

  18. [18]

    Ozbulak, Y

    G. Ozbulak, Y . Aytar, and H. K. Ekenel. How transferable are CNN-based features for age and gender classification? In International Conference of the Biometrics Special Interest Group, pages 1–6. IEEE, 2016. 1, 3, 4

  19. [19]

    Pflug and C

    A. Pflug and C. Busch. Ear biometrics: A survey of detec- tion, feature extraction and recognition methods. IET Bio- metrics, 1(2):114–129, 2012. 1

  20. [20]

    P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The FERET evaluation methodology for face-recognition algo- rithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104, 2000. 5, 6, 7

  21. [21]

    Purkait and P

    R. Purkait and P. Singh. Anthropometry of the normal hu- man auricle: A study of adult Indian men. Aesthetic Plastic Surgery, 31(4):372–379, 2007. 1

  22. [22]

    Rothe, R

    R. Rothe, R. Timofte, and L. Van Gool. Deep expectation of real and apparent age from a single image without fa- cial landmarks. International Journal of Computer Vision , 126(2-4):144–157, 2018. 1

  23. [23]

    Saeed and M

    U. Saeed and M. M. Khan. Combining ear-based traditional and soft biometrics for unconstrained ear recognition. Jour- nal of Electronic Imaging, 27(5):051220, 2018. 1

  24. [24]

    Sforza, G

    C. Sforza, G. Grandi, M. Binelli, D. G. Tommasi, R. Rosati, and V . F. Ferrario. Age-and sex-related changes in the normal human ear. Forensic Science International, 187(1-3):110–e1,

  25. [25]

    Sharif Razavian, H

    A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls- son. CNN features off-the-shelf: An astounding baseline for recognition. In Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014. 3

  26. [26]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 2, 3, 6

  27. [27]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 2

  28. [28]

    D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In Workshop on Applications of Computer Vision, pages 1–8. IEEE, 2009. 1

  29. [29]

    Y . Wen, K. Zhang, Z. Li, and Y . Qiao. A discrimina- tive feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016. 2, 3, 6

  30. [30]

    Yaman, F

    D. Yaman, F. I. Eyiokur, N. Sezgin, and H. K. Ekenel. Age and gender classification from ear images. In International Workshop on Biometrics and Forensics. IEEE, 2018. 1, 2, 6, 7

  31. [31]

    Yan and K

    P. Yan and K. W. Bowyer. Empirical evaluation of advanced ear biometrics. In Computer Vision and Pattern Recognition Workshops, page 41. IEEE, 2005. 5, 7

  32. [32]

    Yosinski, J

    J. Yosinski, J. Clune, Y . Bengio, and H. Lipson. How trans- ferable are features in deep neural networks? In Advances in Neural Information Processing Systems , pages 3320–3328,

  33. [33]

    Zhang and Y

    G. Zhang and Y . Wang. Hierarchical and discriminative bag of features for face profile and ear based gender classifica- tion. In International Joint Conference on Biometrics, pages 1–8. IEEE, 2011. 1, 2, 5, 7

  34. [34]

    Zhang, N

    K. Zhang, N. Liu, X. Yuan, X. Guo, C. Gao, and Z. Zhao. Fine-grained age estimation in the wild with attention LSTM networks. arXiv preprint arXiv:1805.10445, 2018. 1