Improving Borderline Adulthood Facial Age Estimation through Ensemble Learning
Pith reviewed 2026-05-25 11:02 UTC · model grok-4.3
The pith
An ensemble technique with a fine-tuned model reaches 68 percent accuracy for 16-17 year old faces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that an ensemble technique applied to their DS13K deep learning model, after fine-tuning on the DEX model, produces 68 percent accuracy for the 16 to 17 years old age group. This is stated as four times the accuracy achieved by the DEX model for the same group. The approach is motivated by the consistent difficulty existing methods have with borderline adulthood cases, and the paper includes side-by-side evaluation of commercial services such as Amazon Rekognition, Microsoft Azure, How-Old.net, and DEX.
What carries the argument
The ensemble technique applied after fine-tuning the DS13K model on DEX, which aggregates predictions to improve handling of the 16-17 age band.
If this is right
- The ensemble raises accuracy for 16-17 year olds to 68 percent.
- Accuracy in the target range is four times higher than that of the DEX model alone.
- Existing commercial services exhibit the same weakness in the borderline age range as the base DEX model.
- The method focuses specifically on underage estimation within the broader facial age estimation task.
Where Pith is reading between the lines
- The same ensemble construction might be tested on other narrow age intervals where single models also fail.
- Integration into age-verification pipelines could lower misclassification rates for young adults near the legal threshold.
- Re-running the experiments with publicly fixed train-test splits would clarify how much of the gain is due to the modeling choices versus data handling.
Load-bearing premise
The accuracy gain for the 16-17 group comes from the ensemble method and fine-tuning rather than from dataset selection, split choices, or unreported post-processing steps.
What would settle it
Evaluating the original DEX model on the identical test images and age labels used for the ensemble and obtaining accuracy close to 68 percent for the 16-17 group would falsify the claimed improvement.
Figures
read the original abstract
Achieving high performance for facial age estimation with subjects in the borderline between adulthood and non-adulthood has always been a challenge. Several studies have used different approaches from the age of a baby to an elder adult and different datasets have been employed to measure the mean absolute error (MAE) ranging between 1.47 to 8 years. The weakness of the algorithms specifically in the borderline has been a motivation for this paper. In our approach, we have developed an ensemble technique that improves the accuracy of underage estimation in conjunction with our deep learning model (DS13K) that has been fine-tuned on the Deep Expectation (DEX) model. We have achieved an accuracy of 68% for the age group 16 to 17 years old, which is 4 times better than the DEX accuracy for such age range. We also present an evaluation of existing cloud-based and offline facial age prediction services, such as Amazon Rekognition, Microsoft Azure Cognitive Services, How-Old.net and DEX.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an ensemble technique that integrates a fine-tuned DS13K deep learning model with the DEX baseline to improve facial age estimation accuracy specifically for the borderline 16-17 age group. It reports achieving 68% accuracy in this range (claimed to be 4x better than DEX) and provides an evaluation of several commercial cloud-based and offline age prediction services.
Significance. If the numerical improvement can be shown to result from the ensemble and fine-tuning rather than from dataset or split choices, the work would address a documented weakness in age estimation for near-adult subjects and could support more reliable systems for age-restricted content and verification tasks.
major comments (2)
- [Abstract] Abstract: The headline result of 68% accuracy for the 16-17 group (4x DEX) supplies no information on the size or composition of the test cohort for this bin, the exact definition of accuracy (exact-year match, ±1 year tolerance, etc.), the training/validation protocol, or whether the identical images and splits were used for the DEX baseline comparison. These omissions are load-bearing for the central claim that the gain is produced by the DS13K ensemble + fine-tuning.
- [Method / Experiments] Method and Experiments sections: No description is given of the DS13K architecture, the precise fine-tuning procedure and hyperparameters, the ensemble combination rule, or the source and labeling process for the 16-17 training and test images. Without these details the reported accuracy cannot be reproduced or attributed to the proposed technique.
minor comments (1)
- [Abstract] Abstract: The cited MAE range of 1.47–8 years from prior work is stated without references to the specific studies.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify omissions that weaken the presentation of our central result. We will revise the manuscript to supply the missing information on experimental protocol, data, and implementation details so that the accuracy gain can be properly attributed and reproduced.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result of 68% accuracy for the 16-17 group (4x DEX) supplies no information on the size or composition of the test cohort for this bin, the exact definition of accuracy (exact-year match, ±1 year tolerance, etc.), the training/validation protocol, or whether the identical images and splits were used for the DEX baseline comparison. These omissions are load-bearing for the central claim that the gain is produced by the DS13K ensemble + fine-tuning.
Authors: We agree that the abstract as written does not contain these load-bearing details. In the revised version we will expand the abstract to report the number and source distribution of test images in the 16-17 bin, state that accuracy is defined as exact-year match, describe the cross-validation protocol, and explicitly confirm that the DEX baseline was run on the identical images and splits. These additions will allow readers to evaluate whether the reported improvement is attributable to the ensemble and fine-tuning. revision: yes
-
Referee: [Method / Experiments] Method and Experiments sections: No description is given of the DS13K architecture, the precise fine-tuning procedure and hyperparameters, the ensemble combination rule, or the source and labeling process for the 16-17 training and test images. Without these details the reported accuracy cannot be reproduced or attributed to the proposed technique.
Authors: We acknowledge that the Method and Experiments sections omit these implementation specifics. The original submission emphasized the ensemble concept and the headline accuracy figure but did not include the required technical description. In revision we will insert a new subsection that specifies the DS13K architecture, the exact fine-tuning schedule and hyperparameters, the ensemble aggregation rule, and the provenance and labeling procedure for the 16-17 images. This will make the experimental protocol reproducible and permit attribution of the accuracy gain to the proposed method. revision: yes
Circularity Check
No significant circularity; empirical accuracy claims are self-contained experimental outcomes
full rationale
The paper presents an empirical ML study reporting measured accuracy on age estimation tasks after training an ensemble model fine-tuned from DEX. No derivation chain, equations, or first-principles results are claimed; the 68% figure is an observed performance metric on a test cohort, not a quantity defined in terms of itself or obtained by fitting a parameter that is then renamed as a prediction. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no ansatz is smuggled. The result is therefore independent of the inputs in the sense required by the circularity criteria and can be externally falsified on the same datasets.
Axiom & Free-Parameter Ledger
free parameters (2)
- Fine-tuning hyperparameters
- Ensemble combination rule
axioms (1)
- domain assumption The DEX model provides a viable starting point whose weaknesses can be mitigated by fine-tuning and ensembling
Reference graph
Works this paper leans on
-
[1]
Felix Anda, David Lillis, Aikaterini Kanta, Brett Becker, Elias Bou-Harb, Nhien An Le Khac, and Mark Scanlon. 2019. Improving the accuracy of automated facial age estimation to aid CSEM investigations. Digital Investigation 28 (2019), S142
work page 2019
-
[2]
Felix Anda, David Lillis, Nhien-An Le-Khac, and Mark Scanlon. 2018. Evaluating Automated Facial Age Estimation Techniques for Digital Forensics. In 12th In- ternational Workshop on Systematic Approaches to Digital Forensics Engineering (SADFE), IEEE Security & Privacy Workshops . IEEE
work page 2018
-
[3]
Modesto Castrillón-Santana, José Javier Lorenzo Navarro, and Cristina Freire Obregón. 2016. Boys2Men, an age estimation dataset with applications to detect enfants in pornography content. (2016)
work page 2016
-
[4]
Shixing Chen, Caojin Zhang, Ming Dong, Jialiang Le, and Mike Rao. 2017. Using Ranking-CNN for Age Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2017
-
[5]
Wenyuan Dai, Ou Jin, Gui-Rong Xue, Qiang Yang, and Yong Yu. 2009. Eigen- Transfer: A Unified Framework for Transfer Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09) . ACM, New York, NY, USA, 193–200. https://doi.org/10.1145/1553374.1553399
-
[6]
Antitza Dantcheva, Carmelo Velardo, Angela D’Angelo, and Jean-Luc Dugelay
-
[7]
Multimedia Tools and Appli- cations 51, 2 (01 Jan 2011), 739–777
Bag of soft biometrics for person identification. Multimedia Tools and Appli- cations 51, 2 (01 Jan 2011), 739–777. https://doi.org/10.1007/s11042-010-0635-7
-
[8]
Yuan Dong, Yinan Liu, and Shiguo Lian. 2016. Automatic age estimation based on deep learning algorithm. Neurocomputing 187 (2016), 4–10
work page 2016
-
[9]
Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170–2179
work page 2014
-
[10]
Jason Farina, Mark Scanlon, Nhien-An Le-Khac, and M-Tahar Kechadi. 2015. Overview of the Forensic Investigation of Cloud Services. In 10th International Conference on A vailability, Reliability and Security (ARES 2015). IEEE, Toulouse, France, 556–565. https://doi.org/10.1109/ARES.2015.81
-
[11]
Eilidh Ferguson and Caroline Wilkinson. 2017. Juvenile age estimation from facial images. Science & Justice 57, 1 (2017), 58–62
work page 2017
-
[12]
Andrew P Founds, Nick Orlans, Whiddon Genevieve, and Craig I Watson. 2011. Nist special databse 32-multiple encounter dataset ii (meds-ii). NIST Intera- gency/Internal Report (NISTIR)-7807 (2011)
work page 2011
-
[13]
Y. Fu, G. Guo, and T. S. Huang. 2010. Age Synthesis and Estimation via Faces: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 11 (Nov 2010), 1955–1976. https://doi.org/10.1109/TPAMI.2010.36
-
[14]
Google. 2018. Using AI to help organizations detect and report child sexual abuse material online. https://www.blog.google/around-the-globe/google-europe/ using-ai-help-organizations-detect-and-report-child-sexual-abuse-material-online/
work page 2018
-
[15]
Petra Grd and Miroslav Bača. 2016. Creating a face database for age estimation and classification. In Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2016 39th International Convention on. IEEE, 1371–1374
work page 2016
-
[16]
Hu Han, Charles Otto, and Anil K Jain. 2013. Age estimation from face images: Human vs. machine performance. In 2013 International Conference on Biometrics (ICB). IEEE, 1–8
work page 2013
-
[17]
Juliane A Kloess, Jessica Woodhams, Helen Whittle, Tim Grant, and Catherine E Hamilton-Giachritsis. 2017. The challenges of identifying and classifying child sexual abuse material. Sexual Abuse (2017), 1079063217724768
work page 2017
-
[18]
Quan Le, Oisín Boydell, Brian Mac Namee, and Mark Scanlon. 2018. Deep Learning at the Shallow End: Malware Classification for Non-Domain Experts. 26 (07 2018), S118 – S126. https://doi.org/10.1016/j.diin.2018.04.024
-
[19]
Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 34–42
work page 2015
-
[20]
David Lillis, Brett Becker, Tadhg O’Sullivan, and Mark Scanlon. 2016. Current Challenges and Future Research Areas for Digital Forensic Investigation. In The 11th ADFSL Conference on Digital Forensics, Security and Law (CDFSL 2016) . ADFSL, Daytona Beach, FL, USA, 9–20
work page 2016
-
[21]
Khoa Luu, Keshav Seshadri, Marios Savvides, Tien D Bui, and Ching Y Suen
-
[22]
In Biometrics (ijcb), 2011 international joint conference on
Contourlet appearance model for facial age estimation. In Biometrics (ijcb), 2011 international joint conference on . IEEE, 1–8
work page 2011
-
[23]
Sumit Mund. 2015. Microsoft azure machine learning . Packt Publishing Ltd
work page 2015
-
[24]
P Jonathon Phillips, Harry Wechsler, Jeffery Huang, and Patrick J Rauss. 1998. The FERET database and evaluation procedure for face-recognition algorithms. Image and vision computing 16, 5 (1998), 295–306
work page 1998
-
[25]
M Ratnayake, Z Obertová, M Dose, P Gabriel, HM Bröker, M Brauckmann, A Barkus, R Rizgeliene, J Tutkuviene, Stefanie Ritz-Timme, et al. 2014. The juvenile face as a suitable age indicator in child pornography cases: a pilot study on the reliability of automated and visual estimation approaches. International journal of legal medicine 128, 5 (2014), 803–808
work page 2014
-
[26]
Rasmus Rothe, Radu Timofte, and Luc Van Gool. 2016. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV) (July 2016)
work page 2016
-
[27]
Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano
-
[28]
RUSBoost: Improving classification performance when training data is skewed. In Pattern Recognition, 2008. ICPR 2008. 19th International Conference on . IEEE, 1–4
work page 2008
-
[29]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73
work page 2016
-
[30]
Frank Wallhoff. 2006. Facial Expressions and Emotions Database. http:// www-prima.inrialpes.fr/FGnet/html/home.html
work page 2006
-
[31]
Sun-Chong Wang. 2003. Artificial neural network. In Interdisciplinary computing in java programming. Springer, 81–100
work page 2003
-
[32]
Economy Watch. 2010. US Economy. Economy Watch (2010)
work page 2010
-
[33]
Heidi Weber, António Cruz Rodrigues, and Américo Mateus. 2016. Emotion and Mood in Design Thinking. Design Doctoral Conference’16: TRANSversality - Proceedings of the DDC 3rd Conference July (2016), 65–72
work page 2016
-
[34]
Song Yang Zhang, Zhifei and Hairong Qi. 2017. Age Progression/Regression by Conditional Adversarial Autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.