pith. sign in

arxiv: 2601.07603 · v3 · pith:7VZGHT4Pnew · submitted 2026-01-12 · 💻 cs.CV

UIKA: Fast Universal Head Avatar from Pose-Free Images

Pith reviewed 2026-05-22 11:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords head avatarGaussian splattingfeed-forward reconstructionUV space mappinganimatable modelpose-free imagessynthetic training datafacial correspondence
0
0 comments X

The pith

UIKA creates animatable Gaussian head avatars from any number of pose-free images via a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents UIKA as a method to create animatable head avatars from an arbitrary number of pose-free input images, including single photos or videos. The approach relies on estimating pixel-wise facial correspondences to reproject colors into a pose-independent UV space. Learnable UV tokens are then used with attention mechanisms to aggregate information across views at both screen and UV levels. These tokens are decoded into canonical Gaussian attributes for the avatar model. The model is trained on a large-scale synthetic dataset to handle diverse identities, leading to better performance than previous methods in both single-view and multi-view scenarios.

Core claim

UIKA is a feed-forward animatable Gaussian head model that processes any number of pose-free images by associating each with pixel-wise facial correspondence estimation. This allows reprojecting valid pixel colors from screen space to UV space independent of camera pose and expression. Learnable UV tokens enable attention at screen and UV levels to aggregate information, which are decoded into canonical Gaussian attributes. A large-scale identity-rich synthetic dataset supports training the large avatar model.

What carries the argument

The UV-guided avatar modeling strategy, where pixel-wise facial correspondence enables reprojection to pose-independent UV space, combined with learnable UV tokens for attention-based aggregation across inputs.

If this is right

  • Supports creation of avatars from a single image or smartphone videos without requiring pose information.
  • Outperforms existing approaches in both monocular and multi-view settings.
  • Produces a universal model that can be animated after training on synthetic data.
  • Replaces long optimization processes with a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the UV token approach to other body parts could enable full-body avatars from casual captures.
  • Improving correspondence estimation accuracy might further boost performance on challenging expressions.
  • The reliance on synthetic data suggests potential for domain adaptation techniques to handle real-world lighting variations better.

Load-bearing premise

The method depends on having accurate pixel-wise facial correspondence estimation for each input image to enable color reprojection to UV space.

What would settle it

If the generated avatars show significant artifacts or fail to animate correctly when input images have varying expressions without precise correspondence maps, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2601.07603 by Boyao Zhou, Hao Zhu, Hongyu Liu, Liangxiao Hu, Xuan Wang, Xun Cao, Yuan Sun, Yujun Shen, Zijian Wu.

Figure 1
Figure 1. Figure 1: We present UIKA, a novel feed-forward approach for high-fidelity 3D Gaussian head avatar reconstruction from an arbitrary number of input images (e.g., a single portrait image or multi-view captures) without requiring extra camera or expression annotations. Abstract We present UIKA, a feed-forward animatable Gaus￾sian head model from an arbitrary number of unposed in￾puts, including a single image, multi-v… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline Overview. Given a set of unposed input images, our pipeline begins with a facial correspondence estimator that predicts UV coordinates for valid facial pixels, and the corresponding colors are reprojected onto the shared UV space. The source images (screen space) and reprojected images (UV space) are encoded through two dedicated encoders, producing multi-scale features from both screen space and … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results for comparison to baselines in both monocular and multi-view settings in NeRSemble-v2 datasets. In both cases, we focus on two reenactment scenarios: self reenactment and cross reenactment, and report per￾formance across multiple quantitative metrics. For self reenactment, where ground-truth images are available, we measure image reconstruction quality using PSNR, SSIM, and LPIPS. Ident… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of different numbers of input views in VFHQ and NeRSemble-v2 dataset. (a) Input (b) w/o aggr (c) w/o uv_attn (d) w/o synth (e) Ours (f) GT [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results for ablation study in the monocular settings in NeRSemble-v2 dataset. (a) Inputs Reenactments (b) Input Reenactments [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results for in-the-wild cases. Self-adaptive fusion strategy. In the ablated version, we do not add the aggregated UV map into our decoding stage. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces UIKA, a feed-forward animatable Gaussian head avatar model that accepts an arbitrary number of pose-free inputs (single image, multi-view captures, or smartphone videos). It proposes a UV-guided modeling strategy that associates each input with pixel-wise facial correspondence maps to reproject screen-space colors into a pose- and expression-independent UV space, aggregates information via learnable UV tokens and attention at both screen and UV levels, and decodes the tokens into canonical Gaussian attributes. The model is trained on a large-scale synthetic identity-rich dataset and claims significant outperformance over existing methods in monocular and multi-view settings.

Significance. If the performance claims and underlying assumptions are rigorously validated, the work could enable practical, optimization-free avatar creation from casual captures, advancing universal head modeling for AR/VR and animation applications. The combination of UV-space reprojection with attention-based aggregation and synthetic data training represents a promising direction for handling variable input counts without per-subject optimization.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim that the method 'significantly outperforms existing approaches in both monocular and multi-view settings' is asserted without quantitative metrics, specific baselines, error analysis, ablation studies, or statistical significance tests. This prevents verification of the outperformance and is load-bearing for the paper's primary contribution.
  2. [§3.1] §3.1 (UV-guided avatar modeling strategy): The reprojection of screen-space colors to UV space via per-image pixel-wise facial correspondence maps is presented as enabling consistent canonical Gaussians, yet no quantitative evaluation of correspondence accuracy, failure cases under expression/identity variation, or ablation on map quality is provided. Errors in these maps would directly corrupt aggregated UV tokens and the feed-forward reconstruction, making this assumption critical to the monocular and multi-view claims.
minor comments (2)
  1. [§3.2] Notation for 'learnable UV tokens' and their attention application at screen vs. UV levels could be formalized with equations to improve clarity of the aggregation process.
  2. [Abstract] The abstract mentions 'associated with a pixel-wise facial correspondence estimation' without specifying the source or method used to obtain these maps on arbitrary inputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the emphasis on strengthening the quantitative validation of our claims and the robustness of the UV-guided modeling assumptions. Below we provide point-by-point responses to the major comments. We will incorporate the suggested additions in the revised version to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the method 'significantly outperforms existing approaches in both monocular and multi-view settings' is asserted without quantitative metrics, specific baselines, error analysis, ablation studies, or statistical significance tests. This prevents verification of the outperformance and is load-bearing for the paper's primary contribution.

    Authors: We acknowledge that the abstract summarizes the key finding and that §4 would benefit from more explicit quantitative support to allow direct verification. The current experiments section includes comparisons against existing methods, but we agree that additional detail is warranted. In the revision we will expand §4 with dedicated tables reporting specific metrics (e.g., PSNR, SSIM, LPIPS), list the exact baselines used, include error analysis and ablation studies on core components, and add statistical significance tests where appropriate. These changes will make the outperformance claim fully substantiated and easier to evaluate. revision: yes

  2. Referee: [§3.1] §3.1 (UV-guided avatar modeling strategy): The reprojection of screen-space colors to UV space via per-image pixel-wise facial correspondence maps is presented as enabling consistent canonical Gaussians, yet no quantitative evaluation of correspondence accuracy, failure cases under expression/identity variation, or ablation on map quality is provided. Errors in these maps would directly corrupt aggregated UV tokens and the feed-forward reconstruction, making this assumption critical to the monocular and multi-view claims.

    Authors: We agree that a dedicated quantitative assessment of the correspondence maps is important given their central role in the pipeline. The present manuscript demonstrates the overall effectiveness through end-to-end results and qualitative examples, but does not isolate correspondence accuracy. In the revised version we will add an evaluation of correspondence quality (using available ground-truth landmarks on synthetic data), a discussion of observed failure cases under large expression and identity changes, and an ablation that measures the impact of map quality on final Gaussian reconstruction metrics. This will directly address the concern about error propagation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core pipeline—UV-guided reprojection via assumed pixel-wise facial correspondence maps, learnable UV tokens with cross-level attention, aggregation into canonical Gaussian attributes, and training on externally prepared synthetic identity-rich data—does not reduce any claimed prediction or output to quantities defined by the inputs or by self-citation. The correspondence estimation is treated as an available input rather than derived internally, and performance claims rest on architectural and data choices without tautological fitting or renaming of prior results. This is the common case of an independent feed-forward model.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on reliable facial correspondence estimation as a domain assumption and on the generalization power of the synthetic training set; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pixel-wise facial correspondence estimation can be performed reliably on arbitrary input images to reproject colors to UV space independent of pose and expression.
    Invoked in the first contribution to enable pose-free modeling.

pith-pipeline@v0.9.0 · 5741 in / 1100 out tokens · 54572 ms · 2026-05-22T11:57:42.687009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 6 internal anchors

  1. [1]

    Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures

    Marcel C Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, et al. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InSIGGRAPH Asia 2024 Con- ference Papers, pages 1–12, 2024. 3

  2. [2]

    How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks)

    Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). InInternational Conference on Computer Vision, 2017. 6

  3. [3]

    Neural head reenactment with latent pose descriptors

    Egor Burkov, Igor Pasechnik, Artur Grigorev, and Vic- tor Lempitsky. Neural head reenactment with latent pose descriptors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13786– 13795, 2020. 1, 2

  4. [4]

    Hera: Hybrid explicit representation for ultra-realistic head avatars

    Hongrui Cai, Yuting Xiao, Xuan Wang, Jiafei Li, Yudong Guo, Yanbo Fan, Shenghua Gao, and Juyong Zhang. Hera: Hybrid explicit representation for ultra-realistic head avatars. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 260–270, 2025. 1

  5. [5]

    Efficient geometry-aware 3d generative adversar- ial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversar- ial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123– 16133, 2022. 3

  6. [6]

    Mono- gaussianavatar: Monocular gaussian point-based head avatar

    Yufan Chen, Lizhen Wang, Qijing Li, Hongjiang Xiao, Shengping Zhang, Hongxun Yao, and Yebin Liu. Mono- gaussianavatar: Monocular gaussian point-based head avatar. InACM SIGGRAPH 2024 Conference Papers, pages 1–9, 2024. 2, 3

  7. [7]

    Generalizable and an- imatable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and an- imatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Sys- tems, 2024. 2, 3, 5, 7

  8. [8]

    Gpavatar: Generaliz- able and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generaliz- able and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024. 2, 3, 7

  9. [9]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 6

  10. [10]

    Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

    Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition workshops, pages 0–0,

  11. [11]

    Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

    Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7130, 2024. 2

  12. [12]

    Portrait4d- v2: Pseudo multi-view data creates better 4d head synthe- sizer.arXiv preprint arXiv:2403.13570, 2024

    Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d- v2: Pseudo multi-view data creates better 4d head synthe- sizer.arXiv preprint arXiv:2403.13570, 2024. 2, 3, 7

  13. [13]

    Diffusionrig: Learning personalized priors for facial appearance editing

    Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, and Xiuming Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12736–12746, 2023. 2, 3, 7

  14. [14]

    Scaling rec- tified flow transformers for high-resolution image synthe- sis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthe- sis. InProceedings of the 41st International Conference on Machine Learning. ...

  15. [15]

    Learning an animatable detailed 3d face model from in-the-wild images.ACM Transactions on Graphics (ToG), 40(4):1–13, 2021

    Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images.ACM Transactions on Graphics (ToG), 40(4):1–13, 2021. 3

  16. [16]

    Dynamic neural radiance fields for monocu- lar 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance fields for monocu- lar 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021. 3

  17. [17]

    Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022

    Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip- guided domain adaptation of image generators.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022. 2

  18. [18]

    Constructing diffusion avatar with learnable embeddings

    Xuan Gao, Jingtao Zhou, Dongyu Liu, Yuqi Zhou, and Juy- ong Zhang. Constructing diffusion avatar with learnable embeddings. InACM SIGGRAPH Asia Conference Pro- ceedings, 2025. 1

  19. [19]

    Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction,

    Simon Giebenhain, Tobias Kirschstein, Martin Rünz, Lour- des Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction,

  20. [20]

    Toontalker: Cross-domain face reenactment

    Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, and Yujiu Yang. Toontalker: Cross-domain face reenactment. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7690–7700, 2023. 1

  21. [21]

    Generative adversarial networks.Com- munications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Com- munications of the ACM, 63(11):139–144, 2020. 1, 2

  22. [22]

    Neural head avatars from monocular rgb videos

    Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18653–18664, 2022. 3

  23. [23]

    Liveportrait: Efficient portrait animation with stitching and retargeting control

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and re- targeting control.arXiv preprint arXiv:2407.03168, 2024. 5

  24. [24]

    Lam: Large avatar model for one-shot animatable gaus- sian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng 9 Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2, 3, 7

  25. [25]

    Depth-aware generative adversarial network for talking head video generation

    Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3397–3406, 2022. 2

  26. [26]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 3

  27. [27]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 2

  28. [28]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 1, 2

  29. [29]

    Alias-free generative adversarial networks

    Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. InProc. NeurIPS, 2021. 1, 2

  30. [30]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. 3

  31. [31]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. 5

  32. [32]

    Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view ra- diance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 3, 5, 15

  33. [33]

    FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

    Tobias Kirschstein, Simon Giebenhain, and Matthias Nießner. Flexavatar: Learning complete 3d head avatars with partial supervision.arXiv preprint arXiv:2512.15599,

  34. [34]

    Avat3r: Large ani- matable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

    Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large ani- matable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025. 2, 3, 7, 15

  35. [35]

    Surfhead: Affine rig blending for geometri- cally accurate 2d gaussian surfel head avatars

    Jaeseong Lee, Taewoong Kang, Marcel Buehler, Min-Jung Kim, Sungwon Hwang, Junha Hyung, Hyojin Jang, and Jaegul Choo. Surfhead: Affine rig blending for geometri- cally accurate 2d gaussian surfel head avatars. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 1

  36. [36]

    Spherehead: Stable 3d full-head synthesis with spherical tri-plane representa- tion, 2024

    Heyuan Li, Ce Chen, Tianhao Shi, Yuda Qiu, Sizhe An, Guanying Chen, and Xiaoguang Han. Spherehead: Stable 3d full-head synthesis with spherical tri-plane representa- tion, 2024. 5

  37. [37]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 4, 5, 16

  38. [38]

    One-shot high- fidelity talking-head synthesis with deformable neural ra- diance field

    Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high- fidelity talking-head synthesis with deformable neural ra- diance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17969– 17978, 2023. 3

  39. [39]

    Generalizable one-shot 3d neu- ral head avatar.Advances in Neural Information Processing Systems, 36, 2024

    Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz. Generalizable one-shot 3d neu- ral head avatar.Advances in Neural Information Processing Systems, 36, 2024. 3

  40. [40]

    Fastavatar: Instant 3d gaussian splatting for faces from single unconstrained poses.arXiv preprint arXiv:2508.18389, 2025

    Hao Liang, Zhixuan Ge, Ashish Tiwari, Soumendu Ma- jee, GM Godaliyadda, Ashok Veeraraghavan, and Guha Balakrishnan. Fastavatar: Instant 3d gaussian splatting for faces from single unconstrained poses.arXiv preprint arXiv:2508.18389, 2025. 2, 3

  41. [41]

    Hhavatar: Gaussian head avatar with dynamic hairs.IEEE Transactions on Pattern Analysis and Machine Intelligence,

    Zhanfeng Liao, Yuelang Xu, Zhe Li, Qijing Li, Boyao Zhou, Ruifeng Bai, Di Xu, Hongwen Zhang, and Yebin Liu. Hhavatar: Gaussian head avatar with dynamic hairs.IEEE Transactions on Pattern Analysis and Machine Intelligence,

  42. [42]

    Human motionformer: Transferring human motions with vision transformers.arXiv preprint arXiv:2302.11306, 2023

    Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yib- ing Song, Jia Xu, et al. Human motionformer: Transferring human motions with vision transformers.arXiv preprint arXiv:2302.11306, 2023. 1

  43. [43]

    Avatarartist: Open-domain 4d avatarization

    Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, and Qifeng Chen. Avatarartist: Open-domain 4d avatarization. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 10758–10769, 2025. 2, 3

  44. [44]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 1

  45. [45]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

    Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung- Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine- controllable and expressive freestyle portrait animation. arXiv preprint arXiv:2406.01900, 2024. 1, 3

  46. [46]

    Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.International Journal of Computer Vi- sion (IJCV), 2025

    Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Zhifeng Li, Wei Liu, Zhang lin- feng, and Qifeng Chen. Follow-your-emoji-faster: To- wards efficient, fine-controllable, and expressive freestyle portrait animation.International Journal of Computer Vi- sion (IJCV), 2025. 1, 3

  47. [47]

    Otavatar: One-shot talking face avatar with control- lable tri-plane rendering

    Zhiyuan Ma, Xiangyu Zhu, Guo-Jun Qi, Zhen Lei, and Lei Zhang. Otavatar: One-shot talking face avatar with control- lable tri-plane rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16910, 2023. 3 10

  48. [48]

    Jewett, Si- mon Venshtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mohamed Ezzeldin A

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagautdinov, Shunsuke Saito, Shoou-I Yu, Stuart Ander- son, Michael Zollhöfer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Si- mon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Si- mon Venshtain, Christopher He...

  49. [49]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 3

  50. [50]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 3

  51. [51]

    Dit- head: High-resolution talking head synthesis using diffu- sion transformers, 2023

    Aaron Mir, Eduardo Alonso, and Esther Mondragón. Dit- head: High-resolution talking head synthesis using diffu- sion transformers, 2023. 3

  52. [52]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion.arXiv preprint arXiv:2304.07193, 2023. 14

  53. [53]

    Perchead: Perceptual head model for single-image 3d head reconstruction & editing, 2025

    Antonio Oroz, Matthias Nießner, and Tobias Kirschstein. Perchead: Perceptual head model for single-image 3d head reconstruction & editing, 2025. 2, 3

  54. [54]

    Renderme-360: a large digital asset li- brary and benchmarks towards high-fidelity head avatars

    Dongwei Pan, Long Zhuo, Jingtan Piao, Huiwen Luo, Wei Cheng, Yuxin Wang, Siming Fan, Shengqi Liu, Lei Yang, Bo Dai, et al. Renderme-360: a large digital asset li- brary and benchmarks towards high-fidelity head avatars. Advances in Neural Information Processing Systems, 36,

  55. [55]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

  56. [56]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion mod- els with transformers.arXiv preprint arXiv:2212.09748,

  57. [57]

    Flexavatar: Flexible large reconstruction model for animatable gaussian head avatars with detailed deformation, 2025

    Cheng Peng, Zhuo Su, Liao Wang, Chen Guo, Zhaohu Li, Chengjiang Long, Zheng Lv, Jingxiang Sun, Chenyang- guang Zhang, and Yebin Liu. Flexavatar: Flexible large reconstruction model for animatable gaussian head avatars with detailed deformation, 2025. 3

  58. [58]

    Vhap: Versatile head alignment with adap- tive appearance priors, 2024

    Shenhan Qian. Vhap: Versatile head alignment with adap- tive appearance priors, 2024. 5

  59. [59]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians.IEEE Conf

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Da- vide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians.IEEE Conf. Comput. Vis. Pattern Recog., 2024. 5

  60. [60]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Da- vide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299– 20309, 2024. 1

  61. [61]

    Lhm: Large ani- matable human reconstruction model for single image to 3d in seconds

    Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, and Liefeng Bo. Lhm: Large ani- matable human reconstruction model for single image to 3d in seconds. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 14184–14194, 2025. 2

  62. [62]

    Pf-lhm: 3d animatable avatar reconstruction from pose-free articulated human images

    Lingteng Qiu, Peihao Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Siyu Zhu, Xiaoguang Han, Guany- ing Chen, and Zilong Dong. Pf-lhm: 3d animatable avatar reconstruction from pose-free articulated human images. arXiv preprint arXiv:2506.13766, 2025. 2

  63. [63]

    Towards robust monocu- lar depth estimation: Mixing datasets for zero-shot cross- dataset transfer.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocu- lar depth estimation: Mixing datasets for zero-shot cross- dataset transfer.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020. 3, 5, 14

  64. [64]

    Vi- sion transformers for dense prediction.ArXiv preprint,

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction.ArXiv preprint,

  65. [65]

    Accelerating 3D Deep Learning with PyTorch3D

    Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay- lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020. 5

  66. [66]

    Animating arbitrary objects via deep motion transfer

    Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019. 2

  67. [67]

    First order mo- tion model for image animation.Advances in neural information processing systems, 32, 2019

    Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order mo- tion model for image animation.Advances in neural information processing systems, 32, 2019. 1

  68. [68]

    Motion representations for ar- ticulated animation

    Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for ar- ticulated animation. InProceedings of the IEEE/CVF Con- 11 ference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021. 2

  69. [69]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sen- tana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Jul...

  70. [70]

    Diffused heads: Diffusion models beat gans on talking-face gen- eration

    Michał Stypułkowski, Konstantinos V ougioukas, Sen He, Maciej Zi˛ eba, Stavros Petridis, and Maja Pantic. Diffused heads: Diffusion models beat gans on talking-face gen- eration. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 5091– 5100, 2024. 3

  71. [71]

    Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars

    Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. Next3d: Genera- tive neural texture rasterization for 3d-aware head avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20991–21002, 2023. 2

  72. [72]

    Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bah- mani, and David B. Lindell. Mvp4d: Multi-view portrait video diffusion for animatable 4d avatars. New York, NY , USA, 2025. Association for Computing Machinery. 3

  73. [73]

    Felix Taubner, Ruihang Zhang, Mathieu Tuli, and David B. Lindell. CAP4D: Creating animatable 4D portrait avatars with morphable multi-view diffusion models. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5318–5330, 2025. 2, 3

  74. [74]

    Real-time radiance fields for single-image portrait view synthesis

    Alex Trevithick, Matthew Chan, Michael Stengel, Eric Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Ravi Ra- mamoorthi, and Koki Nagano. Real-time radiance fields for single-image portrait view synthesis. 2023. 3

  75. [75]

    Progressive disentangled representa- tion learning for fine-grained controllable talking head syn- thesis

    Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representa- tion learning for fine-grained controllable talking head syn- thesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979– 17989, 2023. 1, 2

  76. [76]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 14, 15

  77. [77]

    To- wards real-world blind face restoration with generative fa- cial prior

    Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. To- wards real-world blind face restoration with generative fa- cial prior. InIEEE Conf. Comput. Vis. Pattern Recog., 2021. 2

  78. [78]

    3d gaussian head avatars with expressive dynamic appearances by compact tenso- rial representations

    Yating Wang, Xuan Wang, Ran Yi, Yanbo Fan, Jichen Hu, Jingcheng Zhu, and Lizhuang Ma. 3d gaussian head avatars with expressive dynamic appearances by compact tenso- rial representations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21117–21126,

  79. [79]

    AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation,

    Huawei Wei, Zejun Yang, and Zhisheng Wang. Anipor- trait: Audio-driven synthesis of photorealistic portrait ani- mations.arXiv:2403.17694, 2024. 3

  80. [80]

    Fastavatar: Towards unified fast high-fidelity 3d avatar reconstruction with large gaussian reconstruction transformers, 2025

    Yue Wu, Yufan Wu, Wen Li, Yuxi Lu, Kairui Feng, and Xu- anhong Chen. Fastavatar: Towards unified fast high-fidelity 3d avatar reconstruction with large gaussian reconstruction transformers, 2025. 2, 3

Showing first 80 references.