Recognition: unknown
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
Pith reviewed 2026-05-08 13:38 UTC · model grok-4.3
The pith
Human preference feedback on rendered faces refines ARKit expression coefficients beyond noisy software labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SuperFace treats software-estimated ARKit coefficients only as an initial reference and then refines the prediction model through human preference feedback on the visual quality of rendered facial expressions, shifting the optimization target from numerical imitation of pseudo labels to alignment with perceptual judgments of expression fidelity.
What carries the argument
Preference-driven refinement that collects human judgments on rendered facial expressions and uses them to adjust ARKit coefficient predictions away from the initial pseudo-label targets.
If this is right
- Coefficient predictions avoid reproducing noise, magnitude biases, and missing actions present in the software pseudo labels.
- Rendered facial animations exhibit higher visual fidelity and more complete expression coverage than models trained under direct supervision.
- The method demonstrates that preference optimization can serve as a practical substitute for scarce or imperfect ground-truth data in semantic facial action prediction.
Where Pith is reading between the lines
- The same preference-refinement loop could be applied to other animation or rendering tasks where pseudo labels are known to be imperfect, such as body pose or hand tracking.
- If the feedback loop scales with modest annotation budgets, it may reduce reliance on expensive motion-capture hardware for training expressive models.
Load-bearing premise
Human preference feedback collected on rendered facial expressions provides a reliable, unbiased signal capable of guiding optimization toward true perceptual fidelity.
What would settle it
A side-by-side perceptual study in which independent viewers rate the visual accuracy of animations produced by the preference-refined model versus the pseudo-label baseline on the same input sequences; if the ratings show no consistent preference or if inter-annotator agreement on the feedback is low, the central claim would be undermined.
Figures
read the original abstract
Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SuperFace, a preference-driven framework for predicting ARKit blendshape coefficients in facial expression estimation. It argues that pseudo-labels from Live Link Face are noisy and biased, so the method uses them only for initialization and refines predictions via human preference feedback on rendered expressions to achieve better perceptual fidelity rather than label imitation. Experiments are claimed to show improved expression fidelity over Live Link Face supervision.
Significance. If the central claim holds, the work would advance facial animation by demonstrating that preference alignment can outperform pseudo-supervision for semantic expression prediction, with potential impact on AR/VR and digital human applications. However, the absence of any methodological details prevents assessment of whether this advance is realized.
major comments (2)
- [Abstract] Abstract: the claim of improved expression fidelity via preference-driven optimization is load-bearing for the paper's contribution, yet the abstract supplies no information on preference data collection protocol, the preference model or loss formulation, training procedure, or any quantitative metrics, rendering the experimental claim unevaluable.
- [Abstract] Abstract: the assumption that human preference feedback on rendered expressions yields an unbiased, reliable optimization target superior to Live Link Face pseudo-labels is stated without supporting evidence, methodology, or analysis of potential annotation inconsistencies, which is central to validating the shift from imitation to perceptual alignment.
minor comments (1)
- [Abstract] Abstract: the phrase 'SuperFace' is introduced without definition or motivation for the name.
Simulated Author's Rebuttal
Thank you for the referee's insightful comments on our paper. We address each major comment below and propose revisions to strengthen the abstract's clarity regarding our methodology and evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of improved expression fidelity via preference-driven optimization is load-bearing for the paper's contribution, yet the abstract supplies no information on preference data collection protocol, the preference model or loss formulation, training procedure, or any quantitative metrics, rendering the experimental claim unevaluable.
Authors: We agree that the abstract could better convey these elements to allow readers to evaluate the claim immediately. The full details are provided in Sections 3 (Method) and 4 (Experiments) of the manuscript, including the protocol for collecting pairwise preferences on rendered faces, the Bradley-Terry preference model, the combined loss function, and metrics such as user preference win rates. In the revised manuscript, we will update the abstract to include concise descriptions of the preference collection protocol, model, and key quantitative results demonstrating improved fidelity. revision: yes
-
Referee: [Abstract] Abstract: the assumption that human preference feedback on rendered expressions yields an unbiased, reliable optimization target superior to Live Link Face pseudo-labels is stated without supporting evidence, methodology, or analysis of potential annotation inconsistencies, which is central to validating the shift from imitation to perceptual alignment.
Authors: The abstract states the motivation based on known limitations of pseudo-labels, but we acknowledge it does not detail the supporting evidence. The manuscript includes a user study and perceptual evaluations in Section 4 that demonstrate the superiority of the preference-aligned model. We also analyze annotation quality through agreement metrics. We will revise the abstract to reference that human preferences provide a more reliable target, as validated by our experiments, while noting the methodology in the main text. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claim rests on replacing Live Link Face pseudo-labels with an external human preference signal collected on rendered expressions. This preference feedback is described as an independent optimization target rather than a quantity derived from the model's own outputs or fitted parameters. No equations, loss formulations, or training procedures are shown that would reduce a claimed prediction back to the input labels by construction. No self-citations are used to establish uniqueness theorems or to smuggle in ansatzes. The derivation therefore remains self-contained against an external benchmark (human judgments) and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
https://developer.apple.com/documentation/arkit, 2023
Arkit. https://developer.apple.com/documentation/arkit, 2023. URL https://developer. apple.com/documentation/arkit. Apple Developer Documentation
2023
-
[2]
Aloni, R
Y . Aloni, R. Shalev-Arkushin, Y . Shafir, G. Tevet, O. Fried, and A. H. Bermano. Express4d: Expres- sive, friendly, and extensible 4d facial motion generation benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 485–494, October 2025
2025
-
[3]
Live Link Face
Apple Inc. Live Link Face. https://apps.apple.com/us/app/live-link-face/id1495370834,
-
[4]
Accessed: 2025-06-10
2025
-
[5]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
work page internal anchor Pith review arXiv 2025
-
[6]
M. Behzad. Facet-vlm: Facial emotion learning with text-guided multiview fusion via vision-language model for 3d/4d facial expression recognition.Neurocomputing, page 131621, 2025
2025
-
[7]
V . Blanz and T. Vetter. A morphable model for the synthesis of 3d faces.Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pages 187–194, 1999. URL https://dl.acm.org/doi/10.1145/311535.311556
-
[8]
Chang, A
F.-J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni. Expnet: Landmark-free, deep, 3d facial expressions. In2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 122–129. IEEE, 2018
2018
-
[9]
Danˇeˇcek, M
R. Danˇeˇcek, M. J. Black, and T. Bolkart. Emoca: Emotion driven monocular face capture and animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311– 20322, 2022
2022
-
[10]
L. Dong, X. Wang, S. Setlur, V . Govindaraju, and I. Nwogu. Ig3d: Integrating 3d face representations in facial expression inference. InEuropean Conference on Computer Vision, pages 404–421. Springer, 2024
2024
-
[11]
Egger, W
B. Egger, W. A. P. Smith, A. Tewari, S. Wuhrer, M. Zollhöfer, T. Beeler, F. Bernard, T. Bolkart, A. Ko- rtylewski, S. Romdhani, C. Theobalt, V . Blanz, and T. Vetter. 3d morphable face models – past, present and future.ACM Trans. Graph., 39(5):88:1–88:38, 2020
2020
-
[12]
Metahuman: High-fidelity digital humans made easy
Epic Games. Metahuman: High-fidelity digital humans made easy. https://www.metahuman.com/ en-US, 2025. Accessed: 2025-06-30
2025
-
[13]
Y . Feng, H. Feng, M. J. Black, and T. Bolkart. Learning an animatable detailed 3d face model from in-the-wild images.ACM Transactions on Graphics, 40(4):1–13, 2021. URL https://dl.acm.org/ doi/10.1145/3450626.3459936
-
[14]
Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025
A. Ferguson, A. A. A. Osman, B. Bescos, C. Stoll, C. Twigg, C. Lassner, D. Otte, E. Vignola, F. Prada, F. Bogo, et al. Mhr: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025. URL https: //arxiv.org/abs/2511.15586
-
[15]
arXiv preprint arXiv:2505.00615 , year=
S. Giebenhain, T. Kirschstein, M. Rünz, L. Agapito, and M. Nießner. Pixel3dmm: Versatile screen- space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615, 2025. URL https://arxiv.org/abs/2505.00615
-
[16]
I. Grishchenko, G. Yan, E. G. Bazavan, A. Zanfir, N. Chinaev, K. Raveendran, M. Grundmann, and C. Sminchisescu. Blendshapes ghum: Real-time monocular facial blendshape prediction.arXiv preprint arXiv:2309.05782, 2023
-
[17]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review arXiv 2021
-
[18]
Z. Hu, K. Yuan, X. Liu, Z. Yu, Y . Zong, J. Shi, H. Yue, and J. Yang. Feallm: Advancing facial emotion analysis in multimodal large language models with emotional synergy and reasoning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5677–5686, 2025
2025
-
[19]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 11
work page internal anchor Pith review arXiv 2025
-
[20]
H. Jin, Y . Lian, and J. Hua. Learning facial expressions with 3d mesh convolutional neural network.ACM Transactions on Intelligent Systems and Technology (TIST), 10(1):1–22, 2018
2018
- [21]
-
[22]
H. Kim, P. Wang, and H. Lee. Action unit-based 3d face reconstruction using transformers. 2024
2024
-
[23]
J. P. Lewis, K. Anjyo, T. Rhee, M. Zhang, F. Pighin, and Z. Deng. Practice and theory of blendshape facial models.Eurographics State of the Art Reports, 2014. URL https://diglib.eg.org/items/ 8be51042-1c33-463e-954d-9221658e6172
2014
-
[24]
A. Li, L. Xu, C. Ling, J. Zhang, and P. Wang. Emoverse: Enhancing multimodal large language models for affective computing via multitask learning.Neurocomputing, 650:130810, 2025
2025
-
[25]
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
2023
-
[26]
T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194:1–194:17, 2017
2017
-
[27]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2304.08485
work page internal anchor Pith review arXiv 2023
-
[28]
C. Lv, Z. Wu, X. Wang, and M. Zhou. 3d facial expression modeling based on facial landmarks in single image.Neurocomputing, 355:155–167, 2019
2019
-
[29]
F. Ma, Y . He, B. Sun, and S. Li. Multimodal prompt alignment for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12581–12591, 2025
2025
-
[30]
T. Menzel, M. Botsch, and M. E. Latoschik. Automated blendshape personalization for faithful face animations using commodity smartphones.Proceedings of the 28th ACM Symposium on Virtual Reality Software and Technology (VRST ’22), pages 1–10, 2022. doi: 10.1145/3562939.3565622. Also available at: https://dl.acm.org/doi/10.1145/3562939.3565622
-
[31]
Moeini, K
A. Moeini, K. Faez, H. Sadeghi, and H. Moeini. 2d facial expression recognition via 3d reconstruction and feature fusion.Journal of Visual Communication and Image Representation, 35:1–14, 2016
2016
-
[32]
F. I. Parke. A parametric model for human faces. Technical report, University of Utah, 1974. URL https://dl.acm.org/doi/10.5555/907312
-
[33]
Deadface: Arkit blendshape tracking and animation
Qaanaaq. Deadface: Arkit blendshape tracking and animation. https://github.com/Qaanaaq/ DeadFace, 2023. GitHub repository
2023
-
[34]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[35]
Rafailov, A
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023
2023
-
[36]
Ramanathan, A
S. Ramanathan, A. Kassim, Y . Venkatesh, and W. S. Wah. Human facial expression recognition using a 3d morphable model. In2006 International conference on image processing, pages 661–664. IEEE, 2006
2006
-
[37]
arXiv preprint arXiv:2404.04104 , year=
G. Retsinas, P. P. Filntisis, R. Danecek, V . F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos. Smirk: 3d facial expressions through analysis-by-neural-synthesis.arXiv preprint arXiv:2404.04104, 2024. URL https://arxiv.org/abs/2404.04104
-
[38]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review arXiv 2017
-
[39]
H. O. Shahreza and S. Marcel. Facellm: A multimodal large language model for face understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677–3687, 2025
2025
-
[40]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 12
work page internal anchor Pith review arXiv 2024
-
[41]
Vemulapalli and A
R. Vemulapalli and A. Agarwala. A compact embedding for facial expression similarity. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5683–5692, 2019
2019
-
[42]
J. Wu, Z. Kang, H. Liu, Y . Fei, and X. Huang. Keyframeface: From text to expressive facial keyframes,
-
[43]
URLhttps://arxiv.org/abs/2512.11321
work page internal anchor Pith review arXiv
-
[44]
arXiv preprint arXiv:2503.12937 , year =
J. Zhang, J. Huang, H. Yao, S. Liu, X. Zhang, S. Lu, and D. Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025
-
[45]
A”, “B”, and “Similar
H. Zhao, Z. Liu, Y . Liu, Z. Qin, J. Liu, and T. Gedeon. Facephi: Lightweight multimodal large language model for facial landmark emotion recognition. In5th Workshop on practical ML for limited/low resource settings, 2024. 13 Appendix A Evaluation Metrics Details We use the following metrics to evaluate model performance: Total votes.The total number of v...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.