Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking

Hyun Soo Park; Jae Shin Yoon; Shoou-I Yu; Takaaki Shiratori

arxiv: 1907.10815 · v1 · pith:Y7VDB6A3new · submitted 2019-07-25 · 💻 cs.CV

Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking

Jae Shin Yoon , Takaaki Shiratori , Shoou-I Yu , Hyun Soo Park This is my paper

Pith reviewed 2026-05-24 16:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords face trackingdomain adaptationself-supervised learningmonocular performance capturehigh-fidelity face modelstexture consistency2D to 3D driving

0 comments

The pith

Self-supervised adaptation lets high-fidelity face models track performance from cellphone videos without any new labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to animate detailed 3D face models using ordinary 2D video from consumer cameras. It replaces the need for special 3D input data by training a network that maps single images directly to model controls. Domain differences between controlled lab captures and real-world footage are then bridged through a self-supervised step that enforces texture consistency across consecutive frames. This removes the requirement to model new lighting, backgrounds, or to collect labeled examples in the target setting. The outcome is a system that drives complex facial motion from phone cameras.

Core claim

The central claim is that a network can be trained to drive a high-fidelity face model from single 2D images, after which self-supervised domain adaptation via consecutive frame texture consistency transfers the model to uncontrolled environments without labeled data from the new domain.

What carries the argument

Consecutive frame texture consistency, a self-supervised constraint that assumes constant face appearance across adjacent frames and uses that to adapt the driving network.

If this is right

High-fidelity models become usable with standard 2D image input instead of meshes or unwrapped textures.
No explicit modeling of the target environment is required for domain transfer.
Complex facial motions can be captured from commodity cameras without domain-specific labels.
The adaptation step works on unlabeled video sequences from the new setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency signal could support adaptation for other time-varying tracking problems where appearance is stable over short intervals.
Mobile real-time performance capture becomes practical once the network is adapted.
The method may extend to objects other than faces if a comparable temporal consistency cue exists.

Load-bearing premise

The face's appearance stays consistent from one frame to the next even when the camera, lighting, or background changes.

What would settle it

A video sequence in which face texture visibly changes between consecutive frames due to lighting variation or motion would produce tracking errors after adaptation.

Figures

Figures reproduced from arXiv: 1907.10815 by Hyun Soo Park, Jae Shin Yoon, Shoou-I Yu, Takaaki Shiratori.

**Figure 1.** Figure 1: Results of high-fidelity 3D facial performance tracking from our method, which automatically adapts a high [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the I2ZNet architecture. I2ZNet [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our self-supervised domain adapta [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Proposed method during testing phase. changes. Therefore, we incorporate an additional network T ← C(T) to convert the color of the predicted texture to the one of the currently observed texture. C(T) is also learned, and since training data is limited, we learn a single 1-by-1 convolutional filter which can be viewed as the color correction matrix and corrects the white-balance between the two textures.… view at source ↗

**Figure 6.** Figure 6: Temporal stability graph for subject 4. Note that [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 8.** Figure 8: Visualization of 3D face tracking for in-the-wild [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Ablation studies on the performance degradation [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 9.** Figure 9: Ablation test on I2ZNet with a representative sub [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: I2ZNet directly regresses the latent facial state codes [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of the vertex-wise accuracy with [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

read the original abstract

Improvements in data-capture and face modeling techniques have enabled us to create high-fidelity realistic face models. However, driving these realistic face models requires special input data, e.g. 3D meshes and unwrapped textures. Also, these face models expect clean input data taken under controlled lab environments, which is very different from data collected in the wild. All these constraints make it challenging to use the high-fidelity models in tracking for commodity cameras. In this paper, we propose a self-supervised domain adaptation approach to enable the animation of high-fidelity face models from a commodity camera. Our approach first circumvents the requirement for special input data by training a new network that can directly drive a face model just from a single 2D image. Then, we overcome the domain mismatch between lab and uncontrolled environments by performing self-supervised domain adaptation based on "consecutive frame texture consistency" based on the assumption that the appearance of the face is consistent over consecutive frames, avoiding the necessity of modeling the new environment such as lighting or background. Experiments show that we are able to drive a high-fidelity face model to perform complex facial motion from a cellphone camera without requiring any labeled data from the new domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's self-supervised adaptation via consecutive-frame texture consistency aims to drive high-fidelity face models from phone video without labels, but the assumption looks too brittle for typical uncontrolled footage.

read the letter

This paper's key move is using self-supervised adaptation based on consecutive frame texture consistency to drive high-fidelity face models from phone cameras. They train a 2D image to model driver first, then adapt without labels by assuming texture stays consistent frame to frame. The approach is new in combining the driver network with this particular self-supervision signal to handle domain shift from lab to wild. It does a good job identifying the practical barriers: special inputs and controlled environments. The main concern is that the consistency assumption will not hold reliably. In practice, even adjacent frames in cellphone video show shifts from head motion changing the shading, auto-exposure adjusting brightness, or slight viewpoint changes. The method explicitly skips modeling lighting or background to rely on this signal, but if the signal is noisy the adaptation could fail or introduce artifacts. The stress-test note points this out, and since the abstract does not include error analysis or comparisons under varied conditions, it is not clear how well it performs when the assumption is stressed. If the full paper has those details and they are solid, that would change the picture. This is aimed at computer vision groups doing face performance capture who need to deploy lab models on consumer devices. Someone looking for domain adaptation tricks in tracking might get an idea from it. I would recommend sending it for peer review. The problem is real and the method is straightforward enough that referees can evaluate the experiments properly.

Referee Report

2 major / 1 minor

Summary. The paper claims a self-supervised domain adaptation method that first trains a network to drive a high-fidelity face model directly from a single 2D image (bypassing the need for 3D meshes or unwrapped textures) and then adapts the model to uncontrolled cellphone video by enforcing a texture-consistency loss between consecutive frames. The adaptation rests on the assumption that face appearance remains stationary across frames, thereby avoiding explicit modeling of lighting, background, or other environmental factors. Experiments are said to show successful driving of complex facial motion from commodity cameras without any labeled target-domain data.

Significance. If the central claim is substantiated, the work would enable practical deployment of lab-captured high-fidelity face models in everyday monocular settings, which is a meaningful step for performance capture and animation pipelines. The self-supervised formulation that sidesteps new labeled data collection is a clear methodological strength.

major comments (2)

[Method (domain adaptation subsection)] The domain-adaptation stage (described after the initial network training) defines the self-supervised loss exclusively via consecutive-frame texture consistency. No ablation or sensitivity analysis is provided that tests the loss under the illumination shifts, auto-exposure changes, or small viewpoint variations that routinely occur in cellphone video; because the adaptation step depends directly on this unverified premise, the absence of such validation is load-bearing for the central claim.
[Experiments] The experiments section asserts that the adapted model successfully drives complex facial motion from cellphone footage, yet reports no quantitative tracking or reconstruction error metrics, no comparison against supervised or lighting-aware baselines, and no failure-case analysis on sequences where the consistency assumption is violated. This leaves the empirical support for the “without requiring any labeled data” claim difficult to evaluate.

minor comments (1)

[Abstract] The abstract would be strengthened by inclusion of at least one quantitative result (e.g., a tracking error number or comparison) rather than a purely qualitative statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method (domain adaptation subsection)] The domain-adaptation stage (described after the initial network training) defines the self-supervised loss exclusively via consecutive-frame texture consistency. No ablation or sensitivity analysis is provided that tests the loss under the illumination shifts, auto-exposure changes, or small viewpoint variations that routinely occur in cellphone video; because the adaptation step depends directly on this unverified premise, the absence of such validation is load-bearing for the central claim.

Authors: We agree that validating the texture consistency assumption under realistic variations strengthens the central claim. In the revised manuscript we will add an ablation study that applies controlled illumination shifts, auto-exposure simulation, and small viewpoint perturbations to consecutive-frame pairs and reports the resulting adaptation quality. This directly tests the load-bearing premise. revision: yes
Referee: [Experiments] The experiments section asserts that the adapted model successfully drives complex facial motion from cellphone footage, yet reports no quantitative tracking or reconstruction error metrics, no comparison against supervised or lighting-aware baselines, and no failure-case analysis on sequences where the consistency assumption is violated. This leaves the empirical support for the “without requiring any labeled data” claim difficult to evaluate.

Authors: We acknowledge the lack of quantitative metrics and comparisons. Because the method is deliberately self-supervised, direct reconstruction error on target labels is unavailable by design; however, we will add proxy quantitative evaluations (e.g., landmark reprojection error on held-out frames) together with comparisons against a supervised baseline trained on limited synthetic data and a lighting-augmented variant. We will also include a dedicated failure-case analysis for sequences that violate the consistency assumption (rapid lighting changes, large head motion). These additions will be incorporated in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: adaptation uses explicit consistency assumption without reducing to self-definition or fitted inputs.

full rationale

The paper's core method trains a network to drive a face model from single 2D images, then applies self-supervised domain adaptation via a loss enforcing consecutive-frame texture consistency under the stated assumption that face appearance remains stationary across frames. This assumption is declared upfront and is not derived from or equivalent to the method's outputs; the adaptation step is a direct application of the loss rather than a prediction that collapses to fitted parameters or prior self-citations. No equations or steps in the provided text reduce the claimed result to its inputs by construction, and the approach remains falsifiable against external video data where the assumption may fail.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of consecutive-frame face appearance consistency to enable adaptation without explicit environment modeling; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption the appearance of the face is consistent over consecutive frames
This assumption underpins the self-supervised domain adaptation step and avoids the need to model lighting or background.

pith-pipeline@v0.9.0 · 5752 in / 1283 out tokens · 28489 ms · 2026-05-24T16:42:21.454825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

A morphable model for the synthesis of 3D faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In Proc. ACM SIG- GRAPH, pages 187–194, 1999. 1, 2

work page 1999
[2]

in-the-wild

James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, and Yannis Panagakis andStefanos Zafeiriou. 3D face morphable models “in-the-wild”. In Proc. CVPR, 2017. 2

work page 2017
[3]

Large scale 3D morphable models

James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. Large scale 3D morphable models. IJCV, 126(2-4):233–254,

work page
[4]

FaceWarehouse: A 3D facial ex- pression database for visual computing

Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. FaceWarehouse: A 3D facial ex- pression database for visual computing. IEEE TVCG, 20(3):413–425, 2014. 2, 6

work page 2014
[5]

Active appearance models

Timothy F Cootes, Gareth J Edwards, and Christo- pher J Taylor. Active appearance models. IEEE TPAMI, (6):681–685, 2001. 1, 2

work page 2001
[6]

Active shape models-their training and application

Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application. CVIU, 61(1):38–59, 1995. 1, 2

work page 1995
[7]

ImageNet: A large-scale hierarchi- cal image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchi- cal image database. In Proc. CVPR, 2009. 12

work page 2009
[8]

Supervision-by- registration: An unsupervised approach to improve the precision of facial landmark detectors

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. Supervision-by- registration: An unsupervised approach to improve the precision of facial landmark detectors. InProc. CVPR,

work page
[9]

Joint 3D face reconstruction and dense align- ment with position map regression network

Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3D face reconstruction and dense align- ment with position map regression network. In Proc. ECCV, 2018. 3, 6

work page 2018
[10]

Dense 3D face alignment from 2D video for real-time use

László A Jeni, Jeffrey F Cohn, and Takeo Kanade. Dense 3D face alignment from 2D video for real-time use. Image Vision Comput., 58(C):13–24, 2017. 3

work page 2017
[11]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Ja- cobs, and Jitendra Malik. End-to-end recovery of hu- man shape and pose. In Proc. CVPR, 2018. 4

work page 2018
[12]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In Proc. ICLR, 2014. 2

work page 2014
[13]

Deep appearance models for face ren- dering

Stephen Lombardi, Tomas Simon, Jason Saragih, and Yaser Sheikh. Deep appearance models for face ren- dering. ACM TOG, 37(4), 2018. 1, 2, 3, 6, 12

work page 2018
[14]

Stacked hourglass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proc. ECCV, 2016. 3, 11

work page 2016
[15]

3D face reconstruction by learning from synthetic data

Elad Richardson, Matan Sela, and Ron Kimmel. 3D face reconstruction by learning from synthetic data. In Proc. 3DV, 2016. 3

work page 2016
[16]

Learning detailed face reconstruction from a single image

Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face reconstruction from a single image. In Proc. CVPR, 2017. 3

work page 2017
[17]

Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior

Sami Romdhani and Thomas Vetter. Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proc. CVPR, 2005. 2

work page 2005
[18]

Adap- tive 3D face reconstruction from unconstrained photo collections

Joseph Roth, Yiying Tong, and Xiaoming Liu. Adap- tive 3D face reconstruction from unconstrained photo collections. In Proc. CVPR, 2016. 2

work page 2016
[19]

300 faces in-the-wild challenge: Database and results

Christos Sagonas, Epameinondas Antonakos, Geor- gios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image Vision Comput., 47:3–18, 2016. 12

work page 2016
[20]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In Proc. ICLR, 2015. 3, 11

work page 2015
[21]

Ran Tao, Efstratios Gavves, and Arnold W. M. Smeul- ders. Siamese instance search for tracking. In Proc. CVPR, 2016. 12

work page 2016
[22]

Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz

Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Flo- rian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proc. CVPR, 2018. 2, 3, 4, 5, 6

work page 2018
[23]

MoFA: Model-based deep convo- lutional face autoencoder for unsupervised monocular reconstruction

Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Pérez, and Christian Theobalt. MoFA: Model-based deep convo- lutional face autoencoder for unsupervised monocular reconstruction. In Proc. ICCV, 2017. 2, 6

work page 2017
[24]

Regressing robust and discriminative 3D morphable models with a very deep neural network

Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard Medioni. Regressing robust and discriminative 3D morphable models with a very deep neural network. In Proc. CVPR, 2017. 3

work page 2017
[25]

Lightweight binocular facial performance capture under uncon- trolled lighting

Levi Valgaerts, Chenglei Wu, Andrés Bruhn, Hans- Peter Seidel, and Christian Theobalt. Lightweight binocular facial performance capture under uncon- trolled lighting. ACM TOG, 31(6):187–1, 2012. 6

work page 2012
[26]

Pixel-level matching for video object segmentation using convo- lutional neural networks

Jae Shin Yoon, Francois Rameau, Junsik Kim, Seokju Lee, Seunghak Shin, and In So Kweon. Pixel-level matching for video object segmentation using convo- lutional neural networks. In Proc. ICCV, 2017. 3, 12

work page 2017
[27]

Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large poses: A 3D solution. In Proc. CVPR, 2016. 6

work page 2016
[28]

Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z. Li. High-ﬁdelity pose and expression nor- malization for face recognition in the wild. In Proc. CVPR, 2015. 6

work page 2015
[29]

where to look

Xiangyu Zhu, Xiaoming Liu, Zhen Lei, and Stan Z. Li. Face alignment in full pose range: A 3D total so- lution. IEEE TPAMI, 2019. 3 Supplementary Material: Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking Jae Shin Yoon† Takaaki Shiratori‡ Shoou-I Yu‡ Hyun Soo Park† †University of Minnesota ‡Facebook Reality Labs {j...

work page 2019

[1] [1]

A morphable model for the synthesis of 3D faces

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In Proc. ACM SIG- GRAPH, pages 187–194, 1999. 1, 2

work page 1999

[2] [2]

in-the-wild

James Booth, Epameinondas Antonakos, Stylianos Ploumpis, George Trigeorgis, and Yannis Panagakis andStefanos Zafeiriou. 3D face morphable models “in-the-wild”. In Proc. CVPR, 2017. 2

work page 2017

[3] [3]

Large scale 3D morphable models

James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. Large scale 3D morphable models. IJCV, 126(2-4):233–254,

work page

[4] [4]

FaceWarehouse: A 3D facial ex- pression database for visual computing

Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and Kun Zhou. FaceWarehouse: A 3D facial ex- pression database for visual computing. IEEE TVCG, 20(3):413–425, 2014. 2, 6

work page 2014

[5] [5]

Active appearance models

Timothy F Cootes, Gareth J Edwards, and Christo- pher J Taylor. Active appearance models. IEEE TPAMI, (6):681–685, 2001. 1, 2

work page 2001

[6] [6]

Active shape models-their training and application

Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and application. CVIU, 61(1):38–59, 1995. 1, 2

work page 1995

[7] [7]

ImageNet: A large-scale hierarchi- cal image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchi- cal image database. In Proc. CVPR, 2009. 12

work page 2009

[8] [8]

Supervision-by- registration: An unsupervised approach to improve the precision of facial landmark detectors

Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. Supervision-by- registration: An unsupervised approach to improve the precision of facial landmark detectors. InProc. CVPR,

work page

[9] [9]

Joint 3D face reconstruction and dense align- ment with position map regression network

Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3D face reconstruction and dense align- ment with position map regression network. In Proc. ECCV, 2018. 3, 6

work page 2018

[10] [10]

Dense 3D face alignment from 2D video for real-time use

László A Jeni, Jeffrey F Cohn, and Takeo Kanade. Dense 3D face alignment from 2D video for real-time use. Image Vision Comput., 58(C):13–24, 2017. 3

work page 2017

[11] [11]

Black, David W

Angjoo Kanazawa, Michael J. Black, David W. Ja- cobs, and Jitendra Malik. End-to-end recovery of hu- man shape and pose. In Proc. CVPR, 2018. 4

work page 2018

[12] [12]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In Proc. ICLR, 2014. 2

work page 2014

[13] [13]

Deep appearance models for face ren- dering

Stephen Lombardi, Tomas Simon, Jason Saragih, and Yaser Sheikh. Deep appearance models for face ren- dering. ACM TOG, 37(4), 2018. 1, 2, 3, 6, 12

work page 2018

[14] [14]

Stacked hourglass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proc. ECCV, 2016. 3, 11

work page 2016

[15] [15]

3D face reconstruction by learning from synthetic data

Elad Richardson, Matan Sela, and Ron Kimmel. 3D face reconstruction by learning from synthetic data. In Proc. 3DV, 2016. 3

work page 2016

[16] [16]

Learning detailed face reconstruction from a single image

Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. Learning detailed face reconstruction from a single image. In Proc. CVPR, 2017. 3

work page 2017

[17] [17]

Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior

Sami Romdhani and Thomas Vetter. Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proc. CVPR, 2005. 2

work page 2005

[18] [18]

Adap- tive 3D face reconstruction from unconstrained photo collections

Joseph Roth, Yiying Tong, and Xiaoming Liu. Adap- tive 3D face reconstruction from unconstrained photo collections. In Proc. CVPR, 2016. 2

work page 2016

[19] [19]

300 faces in-the-wild challenge: Database and results

Christos Sagonas, Epameinondas Antonakos, Geor- gios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: Database and results. Image Vision Comput., 47:3–18, 2016. 12

work page 2016

[20] [20]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In Proc. ICLR, 2015. 3, 11

work page 2015

[21] [21]

Ran Tao, Efstratios Gavves, and Arnold W. M. Smeul- ders. Siamese instance search for tracking. In Proc. CVPR, 2016. 12

work page 2016

[22] [22]

Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz

Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Flo- rian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proc. CVPR, 2018. 2, 3, 4, 5, 6

work page 2018

[23] [23]

MoFA: Model-based deep convo- lutional face autoencoder for unsupervised monocular reconstruction

Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Pérez, and Christian Theobalt. MoFA: Model-based deep convo- lutional face autoencoder for unsupervised monocular reconstruction. In Proc. ICCV, 2017. 2, 6

work page 2017

[24] [24]

Regressing robust and discriminative 3D morphable models with a very deep neural network

Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard Medioni. Regressing robust and discriminative 3D morphable models with a very deep neural network. In Proc. CVPR, 2017. 3

work page 2017

[25] [25]

Lightweight binocular facial performance capture under uncon- trolled lighting

Levi Valgaerts, Chenglei Wu, Andrés Bruhn, Hans- Peter Seidel, and Christian Theobalt. Lightweight binocular facial performance capture under uncon- trolled lighting. ACM TOG, 31(6):187–1, 2012. 6

work page 2012

[26] [26]

Pixel-level matching for video object segmentation using convo- lutional neural networks

Jae Shin Yoon, Francois Rameau, Junsik Kim, Seokju Lee, Seunghak Shin, and In So Kweon. Pixel-level matching for video object segmentation using convo- lutional neural networks. In Proc. ICCV, 2017. 3, 12

work page 2017

[27] [27]

Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face alignment across large poses: A 3D solution. In Proc. CVPR, 2016. 6

work page 2016

[28] [28]

Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z. Li. High-ﬁdelity pose and expression nor- malization for face recognition in the wild. In Proc. CVPR, 2015. 6

work page 2015

[29] [29]

where to look

Xiangyu Zhu, Xiaoming Liu, Zhen Lei, and Stan Z. Li. Face alignment in full pose range: A 3D total so- lution. IEEE TPAMI, 2019. 3 Supplementary Material: Self-Supervised Adaptation of High-Fidelity Face Models for Monocular Performance Tracking Jae Shin Yoon† Takaaki Shiratori‡ Shoou-I Yu‡ Hyun Soo Park† †University of Minnesota ‡Facebook Reality Labs {j...

work page 2019