arxiv: 2605.06179 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

Zejian Kang , Xuanyang Xu , Wentao Yang , Kai Zheng , Yuanchen Fei , Hongyuan Zou , Hui Shan , Shuo Yang

show 1 more author

Xiangru Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords facial expression estimationARKit blendshapespreference alignmentperceptual optimizationdigital human animationpseudo-label refinementsemantic facial actions

0 comments

The pith

Human preference feedback on rendered faces refines ARKit expression coefficients beyond noisy software labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that ARKit blendshape coefficient prediction can be improved by starting from pseudo labels but then optimizing directly against human judgments of how the rendered expressions look, rather than copying the labels as ground truth. Current supervised methods simply reproduce the errors, biases, and omissions in the software estimates, which limits how faithful the resulting facial animations appear. By collecting preference data on rendered outputs and using it to steer the model, the approach aims for coefficients that better match perceptual reality. If successful, this would produce digital human animations with more accurate and expressive facial actions without needing perfect capture labels.

Core claim

SuperFace treats software-estimated ARKit coefficients only as an initial reference and then refines the prediction model through human preference feedback on the visual quality of rendered facial expressions, shifting the optimization target from numerical imitation of pseudo labels to alignment with perceptual judgments of expression fidelity.

What carries the argument

Preference-driven refinement that collects human judgments on rendered facial expressions and uses them to adjust ARKit coefficient predictions away from the initial pseudo-label targets.

If this is right

Coefficient predictions avoid reproducing noise, magnitude biases, and missing actions present in the software pseudo labels.
Rendered facial animations exhibit higher visual fidelity and more complete expression coverage than models trained under direct supervision.
The method demonstrates that preference optimization can serve as a practical substitute for scarce or imperfect ground-truth data in semantic facial action prediction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preference-refinement loop could be applied to other animation or rendering tasks where pseudo labels are known to be imperfect, such as body pose or hand tracking.
If the feedback loop scales with modest annotation budgets, it may reduce reliance on expensive motion-capture hardware for training expressive models.

Load-bearing premise

Human preference feedback collected on rendered facial expressions provides a reliable, unbiased signal capable of guiding optimization toward true perceptual fidelity.

What would settle it

A side-by-side perceptual study in which independent viewers rate the visual accuracy of animations produced by the preference-refined model versus the pseudo-label baseline on the same input sequences; if the ratings show no consistent preference or if inter-annotator agreement on the feedback is low, the central claim would be undermined.

Figures

Figures reproduced from arXiv: 2605.06179 by Hongyuan Zou, Hui Shan, Kai Zheng, Shuo Yang, Wentao Yang, Xiangru Huang, Xuanyang Xu, Yuanchen Fei, Zejian Kang.

**Figure 1.** Figure 1: SuperFace. SuperFace improves ARKit facial expression estimation beyond pseudo supervision by incorporating human preference feedback on rendered results. Instead of treating Live Link Face coefficients as fixed ground truth, our method uses them as initialization, constructs region-aware preference comparisons for upper-face and lower-face motions, and iteratively optimizes the predictor toward perceptual… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed SuperFace pipeline. SuperFace starts from supervised initialization on chosen ARKit coefficients, which are initialized by Live Link Face in the first round. The policy model then generates candidate coefficients, which are combined with the current chosen coefficients through upper-face and lower-face mixing to construct localized comparison pairs. Human annotators provide prefere… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of ARKit facial expression estimation. SuperFace produces facial animations that are more perceptually consistent with the reference images, especially in local facial actions and overall expression semantics. This is notable because SuperFace uses only 2D RGB input, whereas Live Link Face benefits from depth sensing. The improvement mainly comes from DPO-based preference optimizatio… view at source ↗

read the original abstract

Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SuperFace shifts ARKit facial coeff prediction from copying Live Link Face pseudo-labels to human preference feedback on renders, but the abstract gives almost no method or result details to check if the shift actually works.

read the letter

The central point is that this paper takes the standard supervised setup for predicting ARKit blendshape coefficients and adds a second stage where human raters judge rendered expressions and the model is updated to match those judgments instead of the original noisy labels. That is a reasonable direction for getting animations that look better to people rather than just matching software outputs that have known biases and gaps in action coverage.

Referee Report

2 major / 1 minor

Summary. The paper proposes SuperFace, a preference-driven framework for predicting ARKit blendshape coefficients in facial expression estimation. It argues that pseudo-labels from Live Link Face are noisy and biased, so the method uses them only for initialization and refines predictions via human preference feedback on rendered expressions to achieve better perceptual fidelity rather than label imitation. Experiments are claimed to show improved expression fidelity over Live Link Face supervision.

Significance. If the central claim holds, the work would advance facial animation by demonstrating that preference alignment can outperform pseudo-supervision for semantic expression prediction, with potential impact on AR/VR and digital human applications. However, the absence of any methodological details prevents assessment of whether this advance is realized.

major comments (2)

[Abstract] Abstract: the claim of improved expression fidelity via preference-driven optimization is load-bearing for the paper's contribution, yet the abstract supplies no information on preference data collection protocol, the preference model or loss formulation, training procedure, or any quantitative metrics, rendering the experimental claim unevaluable.
[Abstract] Abstract: the assumption that human preference feedback on rendered expressions yields an unbiased, reliable optimization target superior to Live Link Face pseudo-labels is stated without supporting evidence, methodology, or analysis of potential annotation inconsistencies, which is central to validating the shift from imitation to perceptual alignment.

minor comments (1)

[Abstract] Abstract: the phrase 'SuperFace' is introduced without definition or motivation for the name.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments on our paper. We address each major comment below and propose revisions to strengthen the abstract's clarity regarding our methodology and evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of improved expression fidelity via preference-driven optimization is load-bearing for the paper's contribution, yet the abstract supplies no information on preference data collection protocol, the preference model or loss formulation, training procedure, or any quantitative metrics, rendering the experimental claim unevaluable.

Authors: We agree that the abstract could better convey these elements to allow readers to evaluate the claim immediately. The full details are provided in Sections 3 (Method) and 4 (Experiments) of the manuscript, including the protocol for collecting pairwise preferences on rendered faces, the Bradley-Terry preference model, the combined loss function, and metrics such as user preference win rates. In the revised manuscript, we will update the abstract to include concise descriptions of the preference collection protocol, model, and key quantitative results demonstrating improved fidelity. revision: yes
Referee: [Abstract] Abstract: the assumption that human preference feedback on rendered expressions yields an unbiased, reliable optimization target superior to Live Link Face pseudo-labels is stated without supporting evidence, methodology, or analysis of potential annotation inconsistencies, which is central to validating the shift from imitation to perceptual alignment.

Authors: The abstract states the motivation based on known limitations of pseudo-labels, but we acknowledge it does not detail the supporting evidence. The manuscript includes a user study and perceptual evaluations in Section 4 that demonstrate the superiority of the preference-aligned model. We also analyze annotation quality through agreement metrics. We will revise the abstract to reference that human preferences provide a more reliable target, as validated by our experiments, while noting the methodology in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim rests on replacing Live Link Face pseudo-labels with an external human preference signal collected on rendered expressions. This preference feedback is described as an independent optimization target rather than a quantity derived from the model's own outputs or fitted parameters. No equations, loss formulations, or training procedures are shown that would reduce a claimed prediction back to the input labels by construction. No self-citations are used to establish uniqueness theorems or to smuggle in ansatzes. The derivation therefore remains self-contained against an external benchmark (human judgments) and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or new entities; the framework builds on standard ARKit representation and preference optimization concepts already present in the literature.

pith-pipeline@v0.9.0 · 5516 in / 1127 out tokens · 31055 ms · 2026-05-08T13:38:21.852093+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 17 canonical work pages · 7 internal anchors

[1]

https://developer.apple.com/documentation/arkit, 2023

Arkit. https://developer.apple.com/documentation/arkit, 2023. URL https://developer. apple.com/documentation/arkit. Apple Developer Documentation

2023
[2]

Aloni, R

Y . Aloni, R. Shalev-Arkushin, Y . Shafir, G. Tevet, O. Fried, and A. H. Bermano. Express4d: Expres- sive, friendly, and extensible 4d facial motion generation benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 485–494, October 2025

2025
[3]

Live Link Face

Apple Inc. Live Link Face. https://apps.apple.com/us/app/live-link-face/id1495370834,
[4]

Accessed: 2025-06-10

2025
[5]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page internal anchor Pith review arXiv 2025
[6]

M. Behzad. Facet-vlm: Facial emotion learning with text-guided multiview fusion via vision-language model for 3d/4d facial expression recognition.Neurocomputing, page 131621, 2025

2025
[7]

Blanz and T

V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces.Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pages 187–194, 1999. URL https://dl.acm.org/doi/10.1145/311535.311556

work page doi:10.1145/311535.311556 1999
[8]

Chang, A

F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni. Expnet: Landmark-free, deep, 3d facial expressions. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 122–129. IEEE, 2018

2018
[9]

Danˇeˇcek, M

R. Danˇeˇcek, M. J. Black, and T. Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311– 20322, 2022

2022
[10]

L. Dong, X. Wang, S. Setlur, V . Govindaraju, and I. Nwogu. Ig3d: Integrating 3d face representations in facial expression inference. InEuropean Conference on Computer Vision, pages 404–421. Springer, 2024

2024
[11]

Egger, W

B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhöfer, T. Beeler, F. Bernard, T. Bolkart, A. Ko- rtylewski, S. Romdhani, C. Theobalt, V . Blanz, and T. Vetter. 3d morphable face models – past, present and future.ACM Trans. Graph., 39(5):88:1–88:38, 2020

2020
[12]

Metahuman: High-fidelity digital humans made easy

Epic Games. Metahuman: High-fidelity digital humans made easy. https://www.metahuman.com/ en-US, 2025. Accessed: 2025-06-30

2025
[13]

Y . Feng, H. Feng, M. J. Black, and T. Bolkart. Learning an animatable detailed 3d face model from in-the-wild images.ACM Transactions on Graphics, 40(4):1–13, 2021. URL https://dl.acm.org/ doi/10.1145/3450626.3459936

work page doi:10.1145/3450626.3459936 2021
[14]

Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025

A. Ferguson, A. A. A. Osman, B. Bescos, C. Stoll, C. Twigg, C. Lassner, D. Otte, E. Vignola, F. Prada, F. Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025. URL https: //arxiv.org/abs/2511.15586

work page arXiv 2025
[15]

arXiv preprint arXiv:2505.00615 , year=

S. Giebenhain, T. Kirschstein, M. Rünz, L. Agapito, and M. Nießner. Pixel3dmm: Versatile screen- space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615, 2025. URL https://arxiv.org/abs/2505.00615

work page arXiv 2025
[16]

Grishchenko, G

I. Grishchenko, G. Yan, E. G. Bazavan, A. Zanfir, N. Chinaev, K. Raveendran, M. Grundmann, and C. Sminchisescu. Blendshapes ghum: Real-time monocular facial blendshape prediction.arXiv preprint arXiv:2309.05782, 2023

work page arXiv 2023
[17]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review arXiv 2021
[18]

Z. Hu, K. Yuan, X. Liu, Z. Yu, Y . Zong, J. Shi, H. Yue, and J. Yang. Feallm: Advancing facial emotion analysis in multimodal large language models with emotional synergy and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5677–5686, 2025

2025
[19]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 11

work page internal anchor Pith review arXiv 2025
[20]

H. Jin, Y . Lian, and J. Hua. Learning facial expressions with 3d mesh convolutional neural network.ACM Transactions on Intelligent Systems and Technology (TIST), 10(1):1–22, 2018

2018
[21]

Z. Kang, K. Zheng, Y . Fei, W. Yang, H. Zou, and X. Huang. Semanticface: Semantic facial action estimation via semantic distillation in interpretable space.arXiv preprint arXiv:2603.14827, 2026

work page arXiv 2026
[22]

H. Kim, P. Wang, and H. Lee. Action unit-based 3d face reconstruction using transformers. 2024

2024
[23]

J. P. Lewis, K. Anjyo, T. Rhee, M. Zhang, F. Pighin, and Z. Deng. Practice and theory of blendshape facial models.Eurographics State of the Art Reports, 2014. URL https://diglib.eg.org/items/ 8be51042-1c33-463e-954d-9221658e6172

2014
[24]

A. Li, L. Xu, C. Ling, J. Zhang, and P. Wang. Emoverse: Enhancing multimodal large language models for affective computing via multitask learning.Neurocomputing, 650:130810, 2025

2025
[25]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[26]

T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194:1–194:17, 2017

2017
[27]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2304.08485

work page internal anchor Pith review arXiv 2023
[28]

C. Lv, Z. Wu, X. Wang, and M. Zhou. 3d facial expression modeling based on facial landmarks in single image.Neurocomputing, 355:155–167, 2019

2019
[29]

F. Ma, Y . He, B. Sun, and S. Li. Multimodal prompt alignment for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12581–12591, 2025

2025
[30]

Menzel, M

T. Menzel, M. Botsch, and M. E. Latoschik. Automated blendshape personalization for faithful face animations using commodity smartphones.Proceedings of the 28th ACM Symposium on Virtual Reality Software and Technology (VRST ’22), pages 1–10, 2022. doi: 10.1145/3562939.3565622. Also available at: https://dl.acm.org/doi/10.1145/3562939.3565622

work page doi:10.1145/3562939.3565622 2022
[31]

Moeini, K

A. Moeini, K. Faez, H. Sadeghi, and H. Moeini. 2d facial expression recognition via 3d reconstruction and feature fusion.Journal of Visual Communication and Image Representation, 35:1–14, 2016

2016
[32]

F. I. Parke. A parametric model for human faces. Technical report, University of Utah, 1974. URL https://dl.acm.org/doi/10.5555/907312

work page doi:10.5555/907312 1974
[33]

Deadface: Arkit blendshape tracking and animation

Qaanaaq. Deadface: Arkit blendshape tracking and animation. https://github.com/Qaanaaq/ DeadFace, 2023. GitHub repository

2023
[34]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[35]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

2023
[36]

Ramanathan, A

S. Ramanathan, A. Kassim, Y . Venkatesh, and W. S. Wah. Human facial expression recognition using a 3d morphable model. In2006 International conference on image processing, pages 661–664. IEEE, 2006

2006
[37]

arXiv preprint arXiv:2404.04104 , year=

G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos. Smirk: 3d facial expressions through analysis-by-neural-synthesis.arXiv preprint arXiv:2404.04104, 2024. URL https://arxiv.org/abs/2404.04104

work page arXiv 2024
[38]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[39]

H. O. Shahreza and S. Marcel. Facellm: A multimodal large language model for face understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677–3687, 2025

2025
[40]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 12

work page internal anchor Pith review arXiv 2024
[41]

Vemulapalli and A

R. Vemulapalli and A. Agarwala. A compact embedding for facial expression similarity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5683–5692, 2019

2019
[42]

J. Wu, Z. Kang, H. Liu, Y . Fei, and X. Huang. Keyframeface: From text to expressive facial keyframes,
[43]

URLhttps://arxiv.org/abs/2512.11321

work page internal anchor Pith review arXiv
[44]

arXiv preprint arXiv:2503.12937 , year =

J. Zhang, J. Huang, H. Yao, S. Liu, X. Zhang, S. Lu, and D. Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025

work page arXiv 2025
[45]

A”, “B”, and “Similar

H. Zhao, Z. Liu, Y . Liu, Z. Qin, J. Liu, and T. Gedeon. Facephi: Lightweight multimodal large language model for facial landmark emotion recognition. In5th Workshop on practical ML for limited/low resource settings, 2024. 13 Appendix A Evaluation Metrics Details We use the following metrics to evaluate model performance: Total votes.The total number of v...

2024