Tree-gated Deep Regressor Ensemble For Face Alignment In The Wild

Arnaud Dapogny; Estephe Arnaud; Kevin Bailly

arxiv: 1907.03248 · v2 · pith:SKC4OBF7new · submitted 2019-07-07 · 💻 cs.CV

Tree-gated Deep Regressor Ensemble For Face Alignment In The Wild

Estephe Arnaud , Arnaud Dapogny , Kevin Bailly This is my paper

Pith reviewed 2026-05-25 01:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords face alignmentdeep regressorsensemble methodstree-structured gateadaptive weightingin-the-wild datasetscomputer vision

0 comments

The pith

An ensemble of deep regressors with a tree-structured gate for adaptive weighting aligns faces more accurately than single models or averaged ensembles on in-the-wild datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces a single large deep regressor for face alignment with an ensemble of smaller ones. Rather than averaging their outputs, it uses a tree-structured gate to adaptively weight each regressor according to the input image. This targets robustness on datasets that include large pose changes, expressions, lighting shifts, and occlusions where prior methods falter. The approach is positioned as preprocessing for tasks such as expression recognition, face tracking, and animation. Experiments on multiple challenging datasets are presented to show gains over existing state-of-the-art techniques.

Core claim

The central claim is that an ensemble of deep regressors combined with a tree-structured gate for adaptive weighting outperforms both a single large regressor and ensembles that rely on simple averaging, delivering higher accuracy for face alignment under real-world variations in pose, expression, illumination, and partial occlusions.

What carries the argument

The tree-structured gate, which adaptively weights the outputs of an ensemble of deep regressors instead of averaging them.

If this is right

The ensemble with tree gating handles greater variations in head pose, expression, illumination, and occlusions than prior single-model approaches.
It supplies a more reliable preprocessing step for downstream applications such as facial expression recognition, face recognition, tracking, and animation.
Adaptive weighting through the gate is presented as the source of improvement over simple averaging.
The method is shown to exceed state-of-the-art results across several challenging face datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tree-gate mechanism could be tested on other landmark regression tasks such as hand or body pose estimation.
The tree structure might allow inspection of which regressors are active for different image conditions, offering a form of interpretability.
Combining the gated ensemble with data augmentation strategies not explored in the paper could yield further robustness gains.

Load-bearing premise

The tree-structured gate supplies adaptive weighting that is superior to both a single large regressor and to simple averaging of ensemble outputs.

What would settle it

An ablation study on the same datasets that shows no accuracy gain when the tree gate is replaced by either a single regressor or uniform averaging would falsify the advantage of the proposed scheme.

Figures

Figures reproduced from arXiv: 1907.03248 by Arnaud Dapogny, Estephe Arnaud, Kevin Bailly.

**Figure 2.** Figure 2: Overview of regressor ensemble layer. Regressors and gating operators are depticted in green and blue, respectively. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative top-scoring regressor distribution and comparison between softmax and tree gates. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualisations of the predictions outputted for each cascade step with only the top (maximum value of either softmax [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Face alignment consists in aligning a shape model on a face in an image. It is an active domain in computer vision as it is a preprocessing for applications like facial expression recognition, face recognition and tracking, face animation, etc. Current state-of-the-art methods already perform well on "easy" datasets, i.e. those that present moderate variations in head pose, expression, illumination or partial occlusions, but may not be robust to "in-the-wild" data. In this paper, we address this problem by using an ensemble of deep regressors instead of a single large regressor. Furthermore, instead of averaging the outputs of each regressor, we propose an adaptive weighting scheme that uses a tree-structured gate. Experiments on several challenging face datasets demonstrate that our approach outperforms the state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The tree gate is a modest tweak on ensemble regressors for face alignment that claims better wild-data robustness than averaging, but the abstract leaves the size of the gain and the controls unclear.

read the letter

The one thing to know is that this paper swaps simple averaging for a tree-structured gate that adaptively weights an ensemble of deep regressors on face alignment. The gate is the only real addition on top of standard ensemble practice in the field. Everything else follows the usual pipeline of training regressors on landmark coordinates and testing on pose/expression/occlusion variations. The experiments are said to beat prior methods on several in-the-wild datasets, which is the expected claim for an applied CV paper. That part is straightforward and the application target is sensible. The soft spot is exactly what the abstract hides: no numbers, no error bars, no explicit comparison showing the tree gate beats both a single big regressor and plain averaging, and no mention of statistical tests or fixed protocols. If those controls are missing from the full paper too, the central claim is hard to evaluate. If they are present and clean, the work is incremental but usable. This is for people who need reliable face landmarks in unconstrained images and are already running ensembles. A reader who wants a small robustness bump without changing the whole architecture might try the gate. It is not going to shift how anyone thinks about deep ensembles in general. I would send it to peer review because the benchmarks are standard, the method is easy to reimplement, and referees can check the tables directly.

Referee Report

0 major / 2 minor

Summary. The paper proposes replacing a single large deep regressor for face alignment with an ensemble of smaller deep regressors whose outputs are combined via an adaptive weighting scheme implemented by a tree-structured gate. The central empirical claim is that this tree-gated ensemble outperforms prior state-of-the-art methods on several challenging in-the-wild face-alignment benchmarks.

Significance. If the reported gains are reproducible under standard protocols, the work supplies a practical, modular improvement to ensemble regression for landmark localization that could benefit downstream tasks such as expression recognition and tracking. The tree gate is a lightweight, interpretable mechanism for input-dependent weighting that avoids both the capacity of a monolithic network and the rigidity of uniform averaging.

minor comments (2)

The abstract asserts outperformance without naming the datasets, metrics, or baseline methods; the introduction or experimental section should include a concise statement of the evaluation protocol (e.g., 300W, AFLW, COFW with inter-ocular normalization) so that the claim can be assessed without reading the full results tables.
Figure captions and the method diagram should explicitly label the tree gate’s input features and the number of leaves used, to clarify how the adaptive weighting differs from a simple gating network.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and for recommending minor revision. The referee's description of the tree-gated ensemble approach and its potential benefits is accurate. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical method paper

full rationale

The paper proposes an ensemble of deep regressors with a tree-structured adaptive gate for face alignment and claims superiority via experiments on challenging datasets. No equations, derivations, or first-principles results are present in the provided text that reduce any claim to a fitted quantity defined by the method itself or to a self-citation chain. The central claim is end-to-end empirical outperformance on established benchmarks, which is self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard deep-learning assumptions for regression plus the untested premise that tree gating yields better adaptation than averaging; no new physical entities or formal axioms beyond domain conventions.

free parameters (2)

ensemble size
Number of deep regressors chosen as hyperparameter
tree depth and split criteria
Structural parameters of the gating tree fitted or selected during training

axioms (1)

domain assumption Deep neural networks can be trained as regressors to predict facial landmark coordinates from image features.
Background assumption shared with all prior deep face alignment work

pith-pipeline@v0.9.0 · 5667 in / 1097 out tokens · 26082 ms · 2026-05-25T01:38:14.088812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

[1]

Alabort-i-Medina, E

J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou. Menpo: A comprehensive platform for parametric image alignment and visual deformable models. In Proceedings of the ACM International Con- ference on Multimedia, MM ’14, pages 679–682, New York, NY , USA, 2014. ACM. 4

work page 2014
[2]

V . N. Boddeti, M.-C. Roh, J. Shin, T. Oguri, and T. Kanade. Face Alignment Robust to Pose, Expres- sions and Occlusions. arXiv:1707.05938, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark estimation under occlusion. In Interna- tional Conference on Computer Vision, 2013. 2, 5

work page 2013
[4]

Dapogny and K

A. Dapogny and K. Bailly. Face alignment with cas- caded semi-parametric deep greedy neural forests. Pat- tern Recognition Letters, 102:75–81, 2018. 3, 5, 6

work page 2018
[5]

Eigen, M

D. Eigen, M. Ranzato, and I. Sutskever. Learning Fac- tored Representations in a Deep Mixture of Experts. In International Conference on Learning Representations,

work page
[6]

G. Ghiasi. Occlusion Coherence: Localizing Occluded Faces with a Hierarchical Deformable Part Model. In Computer Vision and Pattern Recognition, 2014. 2, 5

work page 2014
[7]

Honari, P

S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz. Improving Landmark Localization with Semi-Supervised Learning. In Computer Vision and Pattern Recognition, 2018. 2, 5

work page 2018
[8]

Decision Forests, Convolutional Networks and the Models in-Between

Y . Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, and A. Criminisi. Decision Forests, Convolutional Networks and the Models in- Between. arXiv:1603.01250, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, 1991. 3

work page 1991
[10]

Jourabloo, M

A. Jourabloo, M. Ye, X. Liu, and L. Ren. Pose-invariant face alignment with a single CNN. In International Conference on Computer Vision, 2017. 5 6 Softmax gate Tree gate Figure 3: Cumulative top-scoring regressor distribution and comparison between softmax and tree gates

work page 2017
[11]

Kontschieder, M

P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. Deep neural decision forests. In International Joint Conference on Artiﬁcial Intelligence, 2016. 2, 4

work page 2016
[12]

D. P. Kingma and J. Lei Ba. Adam: A Method For Stochastic Optimization. In International Conference on Learning Representations, 2015. 4

work page 2015
[13]

S. Ren, X. Cao, Y . Wei, and J. Sun. Face alignment at 3000 FPS via regressing local binary features. In Computer Vision and Pattern Recognition, 2014. 1, 5

work page 2014
[14]

Sagonas, E

C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 Faces In-The-Wild Challenge: database and results. Image and Vision Computing, 47:3–18, 2015. 4

work page 2015
[15]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neu- ral Networks: The Sparsely-gated Mixture-of-Experts Layer. In International Conference on Learning Rep- resentations, 2017. 3, 5

work page 2017
[16]

Y . Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. InComputer Vision and Pattern Recognition, 2013. 1

work page 2013
[17]

Adaptive Neural Trees

R. Tanno, K. Arulkumaran, D. C. Alexander, A. Cri- minisi, and A. Nori. Adaptive Neural Trees. arXiv:1807.06699, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Trigeorgis, P

G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic Descent Method: A Recur- rent Process Applied for End-to-End Face Alignment. In Computer Vision and Pattern Recognition, 2016. 1

work page 2016
[19]

Y . Wu, C. Gou, and Q. Ji. Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In Computer Vision and Pattern Recognition,

work page
[20]

S. Xiao, J. Feng, J. Xing, and H. Lai. Robust Facial Landmark Detection via Recurrent Attentive- Reﬁnement Networks. In European Conference on Computer Vision, volume 1, 2016. 5

work page 2016
[21]

Xiong and F

X. Xiong and F. De La Torre. Supervised descent method and its applications to face alignment. In Com- puter Vision and Pattern Recognition, 2013. 1, 5

work page 2013
[22]

X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas. Consensus of regression for occlusion-robust facial feature local- ization. In European Conference on Computer Vision,

work page
[23]

Zhang, M

J. Zhang, M. Kan, S. Shan, and X. Chen. Occlusion- free Face Alignment: Deep Regression Networks Cou- pled with De-corrupt AutoEncoders. In Computer Vi- sion and Pattern Recognition, 2016. 2, 5

work page 2016
[24]

Zhang, P

Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with auxiliary attributes. Pattern Analysis and Machine Intelligence, 38(5):918–930, 2016. 2, 5 7 Step 1 (top 1) Step 2 (top 1) Step 3 (top 1) Step 4 (top 1) Final prediction Ground truth Sotfmax gateSotfmax gateSotfmax gateSotfmax gate Tree gateTree gateTree gateTre...

work page 2016

[1] [1]

Alabort-i-Medina, E

J. Alabort-i-Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou. Menpo: A comprehensive platform for parametric image alignment and visual deformable models. In Proceedings of the ACM International Con- ference on Multimedia, MM ’14, pages 679–682, New York, NY , USA, 2014. ACM. 4

work page 2014

[2] [2]

V . N. Boddeti, M.-C. Roh, J. Shin, T. Oguri, and T. Kanade. Face Alignment Robust to Pose, Expres- sions and Occlusions. arXiv:1707.05938, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust face landmark estimation under occlusion. In Interna- tional Conference on Computer Vision, 2013. 2, 5

work page 2013

[4] [4]

Dapogny and K

A. Dapogny and K. Bailly. Face alignment with cas- caded semi-parametric deep greedy neural forests. Pat- tern Recognition Letters, 102:75–81, 2018. 3, 5, 6

work page 2018

[5] [5]

Eigen, M

D. Eigen, M. Ranzato, and I. Sutskever. Learning Fac- tored Representations in a Deep Mixture of Experts. In International Conference on Learning Representations,

work page

[6] [6]

G. Ghiasi. Occlusion Coherence: Localizing Occluded Faces with a Hierarchical Deformable Part Model. In Computer Vision and Pattern Recognition, 2014. 2, 5

work page 2014

[7] [7]

Honari, P

S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz. Improving Landmark Localization with Semi-Supervised Learning. In Computer Vision and Pattern Recognition, 2018. 2, 5

work page 2018

[8] [8]

Decision Forests, Convolutional Networks and the Models in-Between

Y . Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, and A. Criminisi. Decision Forests, Convolutional Networks and the Models in- Between. arXiv:1603.01250, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [9]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, 1991. 3

work page 1991

[10] [10]

Jourabloo, M

A. Jourabloo, M. Ye, X. Liu, and L. Ren. Pose-invariant face alignment with a single CNN. In International Conference on Computer Vision, 2017. 5 6 Softmax gate Tree gate Figure 3: Cumulative top-scoring regressor distribution and comparison between softmax and tree gates

work page 2017

[11] [11]

Kontschieder, M

P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. Deep neural decision forests. In International Joint Conference on Artiﬁcial Intelligence, 2016. 2, 4

work page 2016

[12] [12]

D. P. Kingma and J. Lei Ba. Adam: A Method For Stochastic Optimization. In International Conference on Learning Representations, 2015. 4

work page 2015

[13] [13]

S. Ren, X. Cao, Y . Wei, and J. Sun. Face alignment at 3000 FPS via regressing local binary features. In Computer Vision and Pattern Recognition, 2014. 1, 5

work page 2014

[14] [14]

Sagonas, E

C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 Faces In-The-Wild Challenge: database and results. Image and Vision Computing, 47:3–18, 2015. 4

work page 2015

[15] [15]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously Large Neu- ral Networks: The Sparsely-gated Mixture-of-Experts Layer. In International Conference on Learning Rep- resentations, 2017. 3, 5

work page 2017

[16] [16]

Y . Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. InComputer Vision and Pattern Recognition, 2013. 1

work page 2013

[17] [17]

Adaptive Neural Trees

R. Tanno, K. Arulkumaran, D. C. Alexander, A. Cri- minisi, and A. Nori. Adaptive Neural Trees. arXiv:1807.06699, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

Trigeorgis, P

G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic Descent Method: A Recur- rent Process Applied for End-to-End Face Alignment. In Computer Vision and Pattern Recognition, 2016. 1

work page 2016

[19] [19]

Y . Wu, C. Gou, and Q. Ji. Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In Computer Vision and Pattern Recognition,

work page

[20] [20]

S. Xiao, J. Feng, J. Xing, and H. Lai. Robust Facial Landmark Detection via Recurrent Attentive- Reﬁnement Networks. In European Conference on Computer Vision, volume 1, 2016. 5

work page 2016

[21] [21]

Xiong and F

X. Xiong and F. De La Torre. Supervised descent method and its applications to face alignment. In Com- puter Vision and Pattern Recognition, 2013. 1, 5

work page 2013

[22] [22]

X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas. Consensus of regression for occlusion-robust facial feature local- ization. In European Conference on Computer Vision,

work page

[23] [23]

Zhang, M

J. Zhang, M. Kan, S. Shan, and X. Chen. Occlusion- free Face Alignment: Deep Regression Networks Cou- pled with De-corrupt AutoEncoders. In Computer Vi- sion and Pattern Recognition, 2016. 2, 5

work page 2016

[24] [24]

Zhang, P

Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment with auxiliary attributes. Pattern Analysis and Machine Intelligence, 38(5):918–930, 2016. 2, 5 7 Step 1 (top 1) Step 2 (top 1) Step 3 (top 1) Step 4 (top 1) Final prediction Ground truth Sotfmax gateSotfmax gateSotfmax gateSotfmax gate Tree gateTree gateTree gateTre...

work page 2016