pith. sign in

arxiv: 2606.20449 · v1 · pith:G63MOC7Rnew · submitted 2026-06-18 · 💻 cs.CV

InfantFace: Detecting infant faces in neonatal clinical environments

Pith reviewed 2026-06-26 18:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords infant face detectionneonatal clinical environmentsYOLOv11domain adaptationface localizationnon-contact monitoringneonatal video analysis
0
0 comments X

The pith

A YOLOv11m model fine-tuned on neonatal videos detects infant faces in clinical settings at 0.96 AP50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build a face detector that works reliably on newborns inside hospitals, where standard models lose accuracy because of changing light, background clutter, and medical equipment blocking the view. It starts with a one-stage detector trained on large public face datasets, then adapts the model using 228 videos from 113 infants recorded in real neonatal sessions. If the adaptation succeeds, the detector becomes accurate enough to support camera-based checks for pain, breathing pauses, and heart-rate signals without attaching sensors to the baby. Before adaptation the model already beats three general-purpose face detectors at 0.87 AP50; after adaptation the score rises to 0.96.

Core claim

A one-stage YOLOv11m model trained on combined public face datasets and then fine-tuned on a neonatal dataset of 228 videos from 114 sessions of 113 infants reaches an AP50 of 0.96 for localizing infant faces, outperforming three state-of-the-art general face detectors evaluated on the same clinical data.

What carries the argument

The one-stage YOLOv11m-based detector with clinical-domain adaptation on neonatal video data.

If this is right

  • Reliable face localization becomes available as the starting point for camera-based pain scoring and distress analysis in neonates.
  • Non-contact cardiorespiratory signal extraction and breathing-cessation alerts can operate without physical sensors attached to the infant.
  • The same adapted model can be evaluated against other general face detectors to quantify the gain from neonatal-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar domain-adaptation steps could be tested on older children or other pediatric wards where face detection is also obstructed.
  • Public release of additional neonatal face datasets, with privacy protections, would allow direct comparison of future detectors.
  • The performance jump after adaptation suggests that small targeted clinical datasets can close large gaps left by adult-trained models.

Load-bearing premise

The collection of 228 videos from 113 infants is varied enough to represent real clinical conditions and to support domain adaptation without overfitting or hidden bias.

What would settle it

Running the same fine-tuned model on a fresh collection of neonatal videos recorded under different hospitals, lighting, or equipment and finding that AP50 falls well below 0.96 would show the adaptation does not generalize.

Figures

Figures reproduced from arXiv: 2606.20449 by Abdullah Bin-Obaid, Lionel Tarassenko, Maria M. Cobo, Mauricio Villarroel, Rebeccah Slater.

Figure 1
Figure 1. Figure 1: Example detections from the evaluated face detectors across public datasets. Columns correspond to (a) VGGFace2, (b) CelebA, (c) FDDB and (d) WIDER FACE. The green bounding boxes correspond to the detections from the face detectors shown in the left column (MTCNN, SCRFD, DSFD). The red bounding boxes correspond to detections from our InfantFace model before domain adaptation. In (d), our model did not dete… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of the YOLOv11m architecture. It highlights the three main components of the network: the backbone, the neck and the multiscale detection head. Functional blocks are colour-coded to distinguish between different operations. The red arrows indicate the flow of multiscale feature maps, extracted at three different spatial resolutions, that are passed from the neck to the detection head. (CBS) stand… view at source ↗
read the original abstract

Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a YOLOv11m-based one-stage detector for infant faces in neonatal clinical settings. It first trains on the union of VGGFace2, CelebA, FDDB and WIDER FACE, reporting AP50 = 0.87 that exceeds three unnamed general-purpose face detectors; it then fine-tunes on a private neonatal corpus of 228 videos from 114 sessions of 113 infants and reports an improved AP50 of 0.96. The work notes the absence of public neonatal benchmarks and calls for their creation under appropriate privacy safeguards.

Significance. A reliably validated domain-adapted detector would directly support downstream non-contact neonatal monitoring tasks (pain scoring, cardiorespiratory extraction, apnoea alerts). The reported 0.09 AP50 gain is potentially impactful, but its significance cannot be assessed without evidence that the fine-tuning evaluation used infant-disjoint, session-disjoint held-out data.

major comments (2)
  1. [Abstract] Abstract: the headline claim that fine-tuning raises AP50 from 0.87 to 0.96 is load-bearing for the central contribution, yet the abstract (and, by the provided text, the manuscript) supplies no information on train/test partitioning of the 228-video neonatal set, whether the 114 sessions were split by infant identity, or any cross-validation protocol. Without these details the 0.09 gain cannot be distinguished from overfitting or identity leakage.
  2. [Abstract] Abstract: the statement that the pre-fine-tuning model 'surpass[es] the performance of three state-of-the-art general face detectors' is unsupported by any named baselines, per-detector AP50 scores, or evaluation protocol on the public datasets, preventing verification of the claimed superiority.
minor comments (1)
  1. [Abstract] Abstract: the phrase '114 recording sessions of 113 independent infants' is internally inconsistent (one extra session) and should be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight the need for greater transparency in our evaluation protocol and baseline comparisons, which we will address through revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that fine-tuning raises AP50 from 0.87 to 0.96 is load-bearing for the central contribution, yet the abstract (and, by the provided text, the manuscript) supplies no information on train/test partitioning of the 228-video neonatal set, whether the 114 sessions were split by infant identity, or any cross-validation protocol. Without these details the 0.09 gain cannot be distinguished from overfitting or identity leakage.

    Authors: We agree that explicit details on the train/test partitioning are required to substantiate the reported performance gain and rule out identity leakage or overfitting. The neonatal corpus comprises 228 videos from 114 sessions of 113 independent infants. In the revision we will add a dedicated subsection describing the partitioning protocol, confirming that the splits are infant-disjoint (and session-disjoint where relevant) and specifying the cross-validation procedure employed. revision: yes

  2. Referee: [Abstract] Abstract: the statement that the pre-fine-tuning model 'surpass[es] the performance of three state-of-the-art general face detectors' is unsupported by any named baselines, per-detector AP50 scores, or evaluation protocol on the public datasets, preventing verification of the claimed superiority.

    Authors: We acknowledge that the manuscript does not name the three general face detectors, report their individual AP50 scores, or detail the evaluation protocol on the combined public datasets. In the revision we will explicitly identify the detectors, provide their per-detector AP50 values on the union of VGGFace2, CelebA, FDDB and WIDER FACE, and describe the evaluation protocol used. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML training and fine-tuning with no self-referential derivations

full rationale

The paper reports training a YOLOv11m detector on public datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to AP50 0.87, followed by fine-tuning on the authors' private neonatal corpus (228 videos, 113 infants) to reach AP50 0.96. No equations, first-principles derivations, or 'predictions' are claimed. The performance numbers are direct empirical measurements after supervised training/fine-tuning; they do not reduce to fitted parameters by construction, nor do they rely on self-citation chains, uniqueness theorems, or ansatzes imported from prior work. The evaluation protocol is described as standard domain adaptation without any load-bearing self-referential steps. This is the expected non-finding for an applied computer-vision paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transfer learning assumptions and the representativeness of a small private neonatal dataset; insufficient abstract detail prevents exhaustive enumeration of hyperparameters.

free parameters (1)
  • fine-tuning hyperparameters
    Learning rate, epochs, and augmentation choices for domain adaptation are unspecified in the abstract.
axioms (1)
  • domain assumption General face datasets provide transferable features to infant faces in clinical settings
    Invoked by the pre-training then fine-tuning pipeline described.

pith-pipeline@v0.9.1-grok · 5791 in / 1189 out tokens · 25401 ms · 2026-06-26T18:07:08.835071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages

  1. [1]

    & Kumar, M

    Kumar, A., Kaur, A. & Kumar, M. Face detection techniques: a review. Artif. Intell. Rev. 52, 927–948 (2019)

  2. [2]

    M., Agarwal, B

    Rizvi, Q. M., Agarwal, B. G. & Beg, R. A review on face detection methods. J. Manag. Dev. Inf. Technol. 11 (2011)

  3. [3]

    K., Ahsan, M

    Hasan, M. K., Ahsan, M. S., Newaz, S. S. & Lee, G. M. Human face detection techniques: A comprehensive review and future research directions. Electronics 10, 2354 (2021)

  4. [4]

    Huang, B. et al. A neonatal dataset and benchmark for non-contact neonatal heart rate monitoring based on spatio-temporal neural networks. Eng. Appl. Artif. Intell. 106, 104447 (2021)

  5. [5]

    Heiderich, T. M. et al. Face-based automatic pain assessment: challenges and perspectives in neonatal intensive care units. Jornal de Pediatr. 99, 546–560 (2023)

  6. [6]

    Bergamasco, L. et al. Pain assessment in neonatal clinical practice via facial expression analysis and deep learning. In International Work-Conference on Bioinformatics and Biomedical Engineering, 249–263 (Springer, 2024)

  7. [7]

    Hausmann, J. et al. Accurate neonatal face detection for improved pain classification in the challenging nicu setting. IEEE Access (2024)

  8. [8]

    Deep learning-based pain intensity estimation from facial expressions

    Ben Aoun, N. Deep learning-based pain intensity estimation from facial expressions. In International Conference on Intelligent Systems Design and Applications , 484–493 (Springer, 2023)

  9. [9]

    Giordano, V. et al. Comparative analysis of artificial intelligence and expert assessments in detecting neonatal procedural pain. Sci. Reports 14, 20374 (2024)

  10. [10]

    & Sun, Y

    Cao, K., Tan, T., Chen, Z., Yang, K. & Sun, Y . A novel heart rate estimation framework with self-correcting face detection for neonatal intensive care unit. Displays 85, 102852 (2024)

  11. [11]

    Villarroel, M. et al. Continuous non-contact vital sign monitoring in neonatal intensive care unit. Healthc. technology letters 1, 87–91 (2014). 25

  12. [12]

    Chaichulee, S. et al. Cardio-respiratory signal extraction from video camera data for continuous non-contact vital sign monitoring using deep learning. Physiol. measurement 40, 115001 (2019)

  13. [13]

    Chaichulee, S. et al. Localised photoplethysmography imaging for heart rate estimation of pre-term infants in the clinic. In Optical diagnostics and sensing XVIII: toward point-of-care diagnostics, vol. 10501, 146–159 (SPIE, 2018)

  14. [14]

    & Qiao, Y

    Zhang, K., Zhang, Z., Li, Z. & Qiao, Y . Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 1499–1503 (2016)

  15. [15]

    & Grundmann, M

    Bazarevsky, V., Kartynnik, Y ., Vakunov, A., Raveendran, K. & Grundmann, M. Blazeface: Sub- millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047 (2019)

  16. [16]

    Howard, A. G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  17. [17]

    Deng, J. et al. Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641 (2019)

  18. [18]

    & Dollár, P

    Lin, T.-Y ., Goyal, P ., Girshick, R., He, K. & Dollár, P . Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988 (2017)

  19. [19]

    & Zafeiriou, S

    Guo, J., Deng, J., Lattas, A. & Zafeiriou, S. Sample and computation redistribution for efficient face detection. arXiv preprint arXiv:2105.04714 (2021)

  20. [20]

    Li, J. et al. Dsfd: dual shot face detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5060–5069 (2019)

  21. [21]

    Dosso, Y . S. et al. Neonatal face tracking for non-contact continuous patient monitoring. In 2020 IEEE International Symposium on Medical Measurements and Applications (MeMeA) , 1–6 (IEEE, 2020)

  22. [22]

    Grooby, E. et al. Neonatal face and facial landmark detection from video recordings. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), 1–5 (IEEE, 2023). 26

  23. [23]

    S., Kyrollos, D., Greenwood, K

    Dosso, Y . S., Kyrollos, D., Greenwood, K. J., Harrold, J. & Green, J. R. Nicuface: Robust neonatal face detection in complex nicu scenes. IEEE Access 10, 62893–62909 (2022)

  24. [24]

    Gleichauf, J. et al. Sensor fusion for the robust detection of facial regions of neonates using neural networks. Sensors 23, 4910 (2023)

  25. [25]

    Cao, Q., Shen, L., Xie, W., Parkhi, O. M. & Zisserman, A. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition (2018)

  26. [26]

    & Tang, X

    Liu, Z., Luo, P ., Wang, X. & Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) (2015)

  27. [27]

    & Learned-Miller, E

    Jain, V. & Learned-Miller, E. Fddb: A benchmark for face detection in unconstrained settings. Tech. Rep., UMass Amherst technical report (2010)

  28. [28]

    Yang, S., Luo, P ., Loy, C. C. & Tang, X. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  29. [29]

    Cobo, M. M. et al. Multicentre, randomised controlled trial to investigate the effects of parental touch on relieving acute procedural pain in neonates (petal). BMJ open 12, e061841 (2022)

  30. [30]

    Hauck, A. G. et al. Effect of parental touch on relieving acute procedural pain in neonates and parental anxiety (petal): a multicentre, randomised controlled trial in the uk. The Lancet Child & Adolesc. Heal. 8, 259–269 (2024)

  31. [31]

    Stevens, B. J. et al. The premature infant pain profile-revised (pipp-r): initial validation and feasibility. The Clin. journal pain 30, 238–243 (2014)

  32. [32]

    Labelme: Image polygonal annotation with python

    Wada, K. Labelme: Image polygonal annotation with python. GitHub: https://github.com/wkentaro/labelme (2021).Version 5.0.1. DOI: 10.5281/zenodo.5711226

  33. [33]

    & Qiu, J

    Jocher, G. & Qiu, J. Ultralytics yolov11. GitHub: https://github.com/ultralytics/ultralytics (2024). Version 11.0.0. 27

  34. [34]

    Lin, T.-Y . et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755 (Springer, 2014)

  35. [35]

    Torchvision image transforms - pytorch documentation (2026)

    PyTorch Team. Torchvision image transforms - pytorch documentation (2026). URL https://docs.pytorch. org/vision/main/transforms.html

  36. [36]

    Zheng, Z. et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Transactions on cybernetics 52, 8574–8586 (2021)

  37. [37]

    Li, X. et al. Generalized focal loss: Towards efficient representation learning for dense object detection. IEEE transactions on pattern analysis machine intelligence 45, 3139–3153 (2022)

  38. [38]

    & Wolfowitz, J

    Kiefer, J. & Wolfowitz, J. Stochastic estimation of the maximum of a regression function. The Annals Math. Stat. 462–466 (1952). S1 Supplementary Information InfantFace: Detecting infant faces in neonatal clinical environments Supplementary materials A: WIDER FACE category distribution To better understand the composition of the WIDER FACE dataset, we vis...