Multimodal Age and Gender Classification Using Ear and Profile Face Images
Pith reviewed 2026-05-24 17:15 UTC · model grok-4.3
The pith
Multimodal deep networks fusing ear and profile face images achieve higher age and gender classification accuracy than single-modality approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that end-to-end multimodal deep neural network frameworks taking profile face and ear images as input, combined through data, feature, and score level fusion and strengthened by domain adaptation and center loss, attain very high age and gender classification accuracies on the UND-F, UND-J2, and FERET datasets while outperforming state-of-the-art methods based on profile face images or ear images alone.
What carries the argument
End-to-end multimodal deep learning frameworks that perform data, feature, and score level fusion of paired profile face and ear images, augmented by domain adaptation and center loss.
If this is right
- Profile face images alone contain a rich source of information for age and gender classification.
- The multimodal system using both ear and profile face images reaches superior results compared to single-modality baselines.
- Domain adaptation and center loss improve the representation and discrimination capability of the networks.
- Extensive tests on three standard datasets confirm very high classification accuracies.
- The multimodal approach beats prior state-of-the-art methods that use only profile faces or only ears.
Where Pith is reading between the lines
- The same fusion strategy could be tested on other soft-biometric attributes such as ethnicity estimation from side views.
- If the gains persist under domain shift, the approach may help surveillance systems that capture only profile images.
- Alignment of ear and face regions across different cameras or resolutions would be a direct next step to check robustness.
Load-bearing premise
The ear images supply genuinely complementary information about age and gender that the profile face does not already provide.
What would settle it
A profile-face-only model trained and tested on identical data splits that matches or exceeds the multimodal accuracies would show that the ear modality adds no real value.
Figures
read the original abstract
In this paper, we present multimodal deep neural network frameworks for age and gender classification, which take input a profile face image as well as an ear image. Our main objective is to enhance the accuracy of soft biometric trait extraction from profile face images by additionally utilizing a promising biometric modality: ear appearance. For this purpose, we provided end-to-end multimodal deep learning frameworks. We explored different multimodal strategies by employing data, feature, and score level fusion. To increase representation and discrimination capability of the deep neural networks, we benefited from domain adaptation and employed center loss besides softmax loss. We conducted extensive experiments on the UND-F, UND-J2, and FERET datasets. Experimental results indicated that profile face images contain a rich source of information for age and gender classification. We found that the presented multimodal system achieves very high age and gender classification accuracies. Moreover, we attained superior results compared to the state-of-the-art profile face image or ear image-based age and gender classification methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents multimodal deep neural network frameworks for age and gender classification that combine profile face images with ear images. It explores data, feature, and score level fusion strategies, incorporates center loss and domain adaptation, and evaluates on the UND-F, UND-J2, and FERET datasets, claiming superior performance over state-of-the-art single-modality methods.
Significance. If the multimodal fusion demonstrably provides complementary information leading to statistically significant improvements, the work would be of interest to the biometrics and computer vision community as it highlights the potential of ear images to enhance profile face-based soft biometrics. The use of multiple fusion strategies and loss functions is a positive aspect, but the absence of detailed experimental protocols reduces the potential impact.
major comments (3)
- [Abstract] Abstract: The claim of 'extensive experiments' and 'superior results' is not supported by any reported error bars, exact data splits, ablation details, or statistical tests, making it impossible to verify the superiority claims or assess whether gains exceed what single-modality baselines achieve.
- [Experiments] Experiments section: No ablation studies are described that compare the multimodal system against single-modality (profile face only and ear only) baselines using identical network architectures and training procedures, which is required to substantiate that ear images supply genuinely complementary information rather than redundant cues.
- [Methods] Methods: There is no description of how the train/test splits were performed (e.g., subject-disjoint or random) or the number of subjects/images per split, which is load-bearing for claims of high accuracy and superiority in biometric classification tasks.
minor comments (1)
- [Abstract] Abstract: The specific numerical accuracy improvements (e.g., percentage gains over SOTA) are not stated, which would help readers quickly gauge the magnitude of the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional experimental details and controls would strengthen the manuscript. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim of 'extensive experiments' and 'superior results' is not supported by any reported error bars, exact data splits, ablation details, or statistical tests, making it impossible to verify the superiority claims or assess whether gains exceed what single-modality baselines achieve.
Authors: We agree that the abstract uses strong phrasing that is not backed by the requested statistical elements within the abstract itself. The body of the manuscript reports results on the three datasets and comparisons to prior work, but lacks the specific controls noted. We will revise the abstract to employ more precise language and will add error bars, ablation details, and any applicable statistical tests to the experiments section in the revision. revision: yes
-
Referee: [Experiments] Experiments section: No ablation studies are described that compare the multimodal system against single-modality (profile face only and ear only) baselines using identical network architectures and training procedures, which is required to substantiate that ear images supply genuinely complementary information rather than redundant cues.
Authors: The referee is correct that controlled ablations with identical architectures are needed to isolate the contribution of each modality. The current manuscript focuses on multimodal fusion strategies and comparisons to existing single-modality state-of-the-art methods but does not include these specific same-architecture ablations. We will perform and report the requested ablation studies in the revised version. revision: yes
-
Referee: [Methods] Methods: There is no description of how the train/test splits were performed (e.g., subject-disjoint or random) or the number of subjects/images per split, which is load-bearing for claims of high accuracy and superiority in biometric classification tasks.
Authors: We acknowledge that explicit details on the train/test partitioning protocol are essential for reproducibility and to support biometric claims. The manuscript does not currently provide this information. We will add a clear description of the splitting method (including whether splits are subject-disjoint), along with the exact numbers of subjects and images per split for each dataset. revision: yes
Circularity Check
No significant circularity in empirical classification results
full rationale
The paper is an empirical ML study reporting classification accuracies from end-to-end training of multimodal DNNs on public datasets (UND-F, UND-J2, FERET) using data/feature/score fusion, center loss, and domain adaptation. No mathematical derivations, equations, or 'predictions' exist that could reduce to inputs by construction. Central claims rest on measured performance numbers, not on any self-referential fitting or uniqueness theorems. Any self-citations (if present for loss functions or prior methods) are not load-bearing for the reported results, which are directly falsifiable via the experiments described.
Axiom & Free-Parameter Ledger
free parameters (1)
- Fusion hyperparameters and network weights
axioms (1)
- domain assumption Ear appearance supplies complementary age and gender information to profile face images
Reference graph
Works this paper leans on
- [1]
-
[2]
G. Bradski and A. Kaehler. OpenCV. Dr. Dobbs Journal of Software Tools, 3, 2000. 5
work page 2000
-
[3]
A. M. Bukar and H. Ugail. Automatic age estimation from facial profile view. IET Computer Vision , 11(8):650–655,
-
[4]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition , pages 248–255. IEEE, 2009. 3
work page 2009
-
[5]
ˇZ. Emerˇsiˇc, V .ˇStruc, and P. Peer. Ear recognition: More than a survey. Neurocomputing, 255:26–39, 2017. 1
work page 2017
-
[6]
F. I. Eyiokur, D. Yaman, and H. K. Ekenel. Domain adapta- tion for ear recognition using deep convolutional neural net- works. IET Biometrics, 7(3):199–206, 2017. 2, 4, 5
work page 2017
-
[7]
P. Gnanasivam and S. Muttan. Gender classification using ear biometrics. In International Conference on Signal and Image Processing, pages 137–148. Springer, 2013. 1, 2, 7
work page 2013
- [8]
-
[9]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Computer Vision and Pattern Recognition, pages 770–778. IEEE, 2016. 1, 2, 3, 6
work page 2016
-
[10]
Y . He, M. Huang, Q. Miao, H. Guo, and J. Wang. Deep em- bedding network for robust age estimation. In International Conference on Image Processing , pages 1092–1096. IEEE,
-
[11]
A. Iannarelli. Ear identification, forensic identification se- ries. Paramont Publ Company, 1989. 1
work page 1989
-
[12]
A. K. Jain, S. C. Dass, and K. Nandakumar. Soft biometric traits for personal recognition systems. In Biometric Authen- tication, pages 731–738. Springer, 2004. 1
work page 2004
-
[13]
A. K. Jain and U. Park. Facial marks: Soft biometric for face recognition. In International Conference on Image Process- ing, pages 37–40. IEEE, 2009. 1
work page 2009
-
[14]
R. Khorsandi and M. Abdel-Mottaleb. Gender classification using 2-D ear images and sparse representation. InWorkshop on Applications of Computer Vision , pages 461–466. IEEE,
-
[15]
D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul):1755–1758, 2009. 5
work page 2009
-
[16]
J. Lei, J. Zhou, and M. Abdel-Mottaleb. Gender classifica- tion using automatically detected and aligned 3D ear range data. In International Conference on Biometrics, pages 1–7. IEEE, 2013. 1, 2, 5, 7
work page 2013
-
[17]
G. Levi and T. Hassner. Age and gender classification us- ing convolutional neural networks. In Computer Vision and Pattern Recognition Workshops, pages 34–42, 2015. 1
work page 2015
-
[18]
G. Ozbulak, Y . Aytar, and H. K. Ekenel. How transferable are CNN-based features for age and gender classification? In International Conference of the Biometrics Special Interest Group, pages 1–6. IEEE, 2016. 1, 3, 4
work page 2016
-
[19]
A. Pflug and C. Busch. Ear biometrics: A survey of detec- tion, feature extraction and recognition methods. IET Bio- metrics, 1(2):114–129, 2012. 1
work page 2012
-
[20]
P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The FERET evaluation methodology for face-recognition algo- rithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1090–1104, 2000. 5, 6, 7
work page 2000
-
[21]
R. Purkait and P. Singh. Anthropometry of the normal hu- man auricle: A study of adult Indian men. Aesthetic Plastic Surgery, 31(4):372–379, 2007. 1
work page 2007
- [22]
-
[23]
U. Saeed and M. M. Khan. Combining ear-based traditional and soft biometrics for unconstrained ear recognition. Jour- nal of Electronic Imaging, 27(5):051220, 2018. 1
work page 2018
- [24]
-
[25]
A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls- son. CNN features off-the-shelf: An astounding baseline for recognition. In Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014. 3
work page 2014
-
[26]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 1, 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[27]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. 2
work page 1929
-
[28]
D. A. Vaquero, R. S. Feris, D. Tran, L. Brown, A. Hampapur, and M. Turk. Attribute-based people search in surveillance environments. In Workshop on Applications of Computer Vision, pages 1–8. IEEE, 2009. 1
work page 2009
-
[29]
Y . Wen, K. Zhang, Z. Li, and Y . Qiao. A discrimina- tive feature learning approach for deep face recognition. In European Conference on Computer Vision, pages 499–515. Springer, 2016. 2, 3, 6
work page 2016
- [30]
- [31]
-
[32]
J. Yosinski, J. Clune, Y . Bengio, and H. Lipson. How trans- ferable are features in deep neural networks? In Advances in Neural Information Processing Systems , pages 3320–3328,
-
[33]
G. Zhang and Y . Wang. Hierarchical and discriminative bag of features for face profile and ear based gender classifica- tion. In International Joint Conference on Biometrics, pages 1–8. IEEE, 2011. 1, 2, 5, 7
work page 2011
- [34]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.