pith. sign in

arxiv: 2512.15599 · v2 · submitted 2025-12-17 · 💻 cs.CV

FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision

Pith reviewed 2026-05-16 21:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D head avatarsmonocular trainingmulti-view supervisiontransformer animation modelbias sinksportrait animationview extrapolation
0
0 comments X

The pith

FlexAvatar produces complete 3D head avatars from one image by training a transformer on mixed monocular and multi-view data via bias-sink tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix incomplete 3D head reconstructions that arise when models train only on monocular videos. The root problem is that motion driving signals become entangled with the target camera viewpoint in such data. To break the entanglement the authors insert learnable data-source tokens called bias sinks into a transformer-based animation model. These tokens let the same network train jointly on monocular videos and multi-view captures, so the learned avatars inherit strong generalization from the first source and full 3D geometry from the second. The same training also produces a smooth latent space that supports identity interpolation and fitting to any number of input views.

Core claim

FlexAvatar is a transformer-based 3D portrait animation model equipped with learnable data-source tokens, termed bias sinks. The tokens allow unified training on monocular videos and multi-view image sets, thereby disentangling driving signals from target viewpoints. The resulting model generates complete 3D head avatars that support realistic facial animation even when conditioned on a single input photograph.

What carries the argument

Learnable data-source tokens (bias sinks) placed inside a transformer-based 3D portrait animation network that separate motion signals from viewpoint during mixed-dataset training.

If this is right

  • Single-image avatars exhibit full 3D completeness rather than the partial reconstructions typical of monocular-only training.
  • View extrapolation yields realistic facial animations without the holes or distortions seen in prior methods.
  • The latent avatar space remains smooth enough for identity interpolation and for fitting to arbitrary numbers of input observations.
  • Joint training combines the generalization strength of monocular video with the geometric fidelity of multi-view supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-based disentanglement could be applied to full-body avatar creation if analogous motion-viewpoint biases exist in body datasets.
  • Lowering the need for dense multi-view captures would make high-quality personalized 3D characters more accessible for consumer applications.
  • Bias sinks may prove useful in other conditional synthesis tasks where training data come from sources with mismatched statistical properties.

Load-bearing premise

Learnable data-source tokens can reliably disentangle driving signals from target viewpoints without introducing artifacts or requiring carefully balanced dataset mixtures.

What would settle it

Remove the bias-sink tokens, retrain on the same mixed data, and check whether single-image avatars exhibit missing geometry or fail to produce coherent novel views.

Figures

Figures reproduced from arXiv: 2512.15599 by Matthias Nie{\ss}ner, Simon Giebenhain, Tobias Kirschstein.

Figure 1
Figure 1. Figure 1: FlexAvatar. From just a single portrait image of a person, FlexAvatar creates a high quality 3D head avatar representation that can be freely animated and rendered from diverse viewpoints. Our model can be flexibly applied to other scenarios including creating avatars from a phone scan or from monocular videos. The entire avatar creation process can be executed within minutes. Abstract We introduce FlexAva… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview of FlexAvatar. Given the single input image I, our method allows to change both viewpoint π and facial expression zexp. The transformer-based encoder E first produces a compressed avatar code A via cross-attention. The decoder D then incorporates the effect of the facial expression zexp into the avatar representation. Crucially, the corresponding bias sinks are concatenated to the expressio… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the StyleGAN-PixelShuffle block. fimg with a pre-trained DINOv2 [37] model and a shallow learnable ViT. fimg = MLP([DINO(I), VIT([I, Ipluck])]) (4) where I pluck are the plucker embeddings of the camera viewpoint of the input image I. To map the image features into the template’s UV space, we define queries Q anchored in UV space. This is done by uniformly sampling 3D sur￾face positions in … view at source ↗
Figure 4
Figure 4. Figure 4: Entanglement of driving signal and target viewpoint. Naive training on monocular data works well as long as both ex￾pression code zdrive and rendering camera πtarget are transferred to the avatar (πtarget = πdrive). Artifacts occur when the render￾ing camera is moved, i.e., rendering and driving viewpoint differ (πtarget ̸= πdrive). This issue is fixed by our proposed bias sinks. 3.3. Fighting Entanglement… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Single-image Avatar Creation comparison on the Ava256 dataset. We compare our method to the recent state￾of-the-art on 3D head avatar creation from a single portrait image. Our method produces more complete 3D head avatars and re-enacts the target expression more faithfully. PSNR↑ SSIM↑ LPIPS↓ JOD↑ AKD↓ CSIM↑ INSTA [70] 15.8 0.771 0.344 4.83 5.22 0.631 FlashAvatar [52] 16.3 0.731 0.386 4.15 19.… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison on the NeRSemble Benchmark. 4.7. Ablations In Tab. 5 and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Ablation of method components on the Ava256 dataset. 2D 3D B U F PSNR↑ SSIM↑ LPIPS↓ AKD↓ CSIM↑ only 2D ⊠ □ □ ⊠ □ 13.7 0.736 0.358 6.59 0.593 only 3D □ ⊠ □ ⊠ □ 13.2 0.699 0.378 10.4 0.119 w/o bias sinks ⊠ ⊠ □ ⊠ □ 14.5 0.747 0.351 5.98 0.583 w/o StyleGAN ⊠ ⊠ ⊠ □ □ 17.1 0.765 0.287 7.03 0.614 Oursref ⊠ ⊠ ⊠ ⊠ □ 17.2 0.768 0.285 6.34 0.621 Ours + fitting ⊠ ⊠ ⊠ ⊠ ⊠ 16.9 0.771 0.280 5.59 0.682 B = bia… view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of Data Efficiency during fitting. We plot the performance of our method on the NeRSemble Bench￾mark [25] in relation to how many frames of a person were used during fitting to create the avatar. Note that the two most competitive baselines on the benchmark, RGBAvatar [28] and CAP4D [48] use all available frames while our method requires only ∼ 1 10 of the frames for a competitive performance. By… view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative Portrait Animation with cross-reenactment on the VFHQ test split [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative Few-shot Avatar Creation comparison on the Ava256 dataset [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlexAvatar, a transformer-based 3D portrait animation model for high-quality complete head avatars from a single image. It identifies entanglement between driving signals and target viewpoints in monocular training as the root cause of incomplete 3D reconstructions and proposes learnable data source tokens (bias sinks) to enable unified training across monocular and multi-view datasets, allowing the model to inherit strong generalization from monocular data and full 3D completeness from multi-view supervision at inference. The approach also produces a smooth latent avatar space for identity interpolation and flexible fitting, with evaluations on single-view, few-shot, and monocular tasks showing improved view extrapolation over prior methods.

Significance. If the bias-sink mechanism is shown to reliably disentangle signals without new artifacts, the work would meaningfully advance single-image 3D avatar creation by combining the complementary strengths of the two data regimes, yielding more complete and animatable avatars than monocular-only baselines while retaining their generalization. The smooth latent space and support for variable input counts are additional practical strengths.

major comments (2)
  1. [Method (bias-sink formulation) and Experiments] The central claim that bias sinks enable unified training and artifact-free inheritance of 3D completeness from multi-view data during monocular inference is load-bearing, yet the manuscript provides no targeted ablation or quantitative verification (e.g., view-extrapolation error with/without sinks, or artifact metrics on held-out viewpoints) demonstrating that the tokens separate driving signals from viewpoint without reintroducing entanglement or new failure modes.
  2. [Training procedure and §4] The training procedure description does not specify how the mixture of monocular and multi-view data is balanced or whether the learned tokens remain effective when the test distribution differs from the training mixture; this directly affects whether the claimed leverage of both data sources holds at inference.
minor comments (2)
  1. [Model architecture] Clarify the exact architecture placement of the bias sinks within the transformer (e.g., which attention layers and how they are initialized) to allow reproduction.
  2. [Experiments and results] The abstract states 'extensive evaluations' on multiple tasks; the results section should include error bars, statistical significance tests against baselines, and explicit discussion of any data-selection criteria to address potential post-hoc concerns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional analysis where appropriate.

read point-by-point responses
  1. Referee: [Method (bias-sink formulation) and Experiments] The central claim that bias sinks enable unified training and artifact-free inheritance of 3D completeness from multi-view data during monocular inference is load-bearing, yet the manuscript provides no targeted ablation or quantitative verification (e.g., view-extrapolation error with/without sinks, or artifact metrics on held-out viewpoints) demonstrating that the tokens separate driving signals from viewpoint without reintroducing entanglement or new failure modes.

    Authors: We agree that isolating the effect of the bias sinks with targeted quantitative ablations would strengthen the central claim. In the revised manuscript we will add an ablation in Section 4 that reports view-extrapolation metrics (PSNR, LPIPS, and perceptual scores on held-out viewpoints) with and without the learned tokens, together with qualitative inspection for new artifacts or re-entanglement. While our existing single-view and few-shot results already demonstrate the combined benefit, we acknowledge the value of this explicit verification. revision: yes

  2. Referee: [Training procedure and §4] The training procedure description does not specify how the mixture of monocular and multi-view data is balanced or whether the learned tokens remain effective when the test distribution differs from the training mixture; this directly affects whether the claimed leverage of both data sources holds at inference.

    Authors: We will revise Section 4 to explicitly state the sampling ratios used to balance monocular and multi-view data during training. Our evaluations already span single-view, few-shot, and monocular test regimes whose input distributions differ from the training mixture; the consistent gains in view extrapolation and completeness indicate that the tokens remain effective. We will add a short discussion of this robustness in the revised text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bias sinks are externally optimized parameters

full rationale

The paper introduces learnable data source tokens (bias sinks) within a transformer architecture to enable unified training across monocular and multi-view datasets. These tokens are optimized during training on external data sources and the resulting model is evaluated on separate held-out tasks (single-view, few-shot, monocular avatar creation). No equation or claim reduces a 'prediction' to a fitted quantity defined by the same data, nor does the central disentanglement mechanism rely on a self-citation chain, uniqueness theorem from prior author work, or smuggled ansatz. The derivation chain is self-contained: the architectural choice is trained end-to-end and its efficacy is demonstrated through independent experimental benchmarks rather than by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on the introduction of bias sinks as learnable parameters and the assumption that a transformer can jointly model animation and viewpoint disentanglement.

free parameters (1)
  • bias sinks
    Learnable data-source tokens whose values are optimized during training to encode dataset origin.
axioms (1)
  • domain assumption Transformer architecture can represent 3D portrait animation and viewpoint disentanglement when augmented with data-source tokens
    Core modeling choice stated in the abstract.
invented entities (1)
  • bias sinks no independent evidence
    purpose: Learnable tokens that allow unified training across monocular and multi-view data by encoding data source
    New component introduced to solve the identified entanglement problem

pith-pipeline@v0.9.0 · 5509 in / 1151 out tokens · 20184 ms · 2026-05-16T21:28:52.117535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UIKA: Fast Universal Head Avatar from Pose-Free Images

    cs.CV 2026-01 conditional novelty 7.0

    UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.

  2. Learning a Delighting Prior for Facial Appearance Capture in the Wild

    cs.CV 2026-05 unverdicted novelty 6.0

    A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    A morphable model for the synthesis of 3d faces

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InProc. SIGGRAPH, pages 187–194. ACM Press/Addison-Wesley Publishing Co., 1999. 2

  2. [2]

    Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures

    Marcel C Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, et al. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InSIGGRAPH Asia 2024 Confer- ence Papers, pages 1–12, 2024. 2, 5

  3. [3]

    Taoa- vatar: Real-time lifelike full-body talking avatars for aug- mented reality via 3d gaussian splatting

    Jianchuan Chen, Jingchuan Hu, Gaige Wang, Zhonghua Jiang, Tiansong Zhou, Zhiwen Chen, and Chengfei Lv. Taoa- vatar: Real-time lifelike full-body talking avatars for aug- mented reality via 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10723–10734, 2025. 7

  4. [4]

    Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2024

    Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2024. 2, 6, 7, 1, 3

  5. [5]

    Gpavatar: Generaliz- able and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generaliz- able and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024. 2, 6, 4

  6. [6]

    Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

    Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 5

  7. [7]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 6

  8. [8]

    Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

    Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition workshops, pages 0–0, 2019. 6

  9. [9]

    Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

    Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7130, 2024. 6

  10. [10]

    Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer

    Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In European Conference on Computer Vision, pages 316–333. Springer, 2024. 2, 6, 7

  11. [11]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  12. [12]

    Learning neural parametric head models

    Simon Giebenhain, Tobias Kirschstein, Markos Georgopou- los, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21003–21012, 2023. 8, 2

  13. [13]

    Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction.arXiv preprint arXiv:2505.00615,

    Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction. arXiv preprint arXiv:2505.00615, 2025. 5, 2

  14. [14]

    Sega: Drivable 3d gaussian head avatar from a single image

    Chen Guo, Zhuo Su, Jian Wang, Shuang Li, Xu Chang, Zhaohu Li, Yang Zhao, Guidong Wang, and Ruqi Huang. Sega: Drivable 3d gaussian head avatar from a single image. arXiv preprint arXiv:2504.14373, 2025. 2

  15. [15]

    Lam: Large avatar model for one-shot animatable gaus- sian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2, 3, 6, 7, 1

  16. [16]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juy- ong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 2, 5

  17. [17]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 2

  18. [18]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 2

  19. [19]

    Pixel-in-pixel net: Towards efficient facial landmark detection in the wild.In- ternational Journal of Computer Vision, 2021

    Haibo Jin, Shengcai Liao, and Ling Shao. Pixel-in-pixel net: Towards efficient facial landmark detection in the wild.In- ternational Journal of Computer Vision, 2021. 6

  20. [20]

    Analyzing and improv- ing the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020. 2, 4

  21. [21]

    Modnet: Real-time trimap-free portrait mat- ting via objective decomposition

    Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son WH Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1140– 1147, 2022. 2

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  23. [23]

    Realistic one-shot mesh-based head avatars

    Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference on Computer Vision, pages 345–

  24. [24]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  25. [25]

    Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 5, 6, 7, 1, 2

  26. [26]

    Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars

    Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12089–12100, 2025. 2, 4, 6, 1

  27. [27]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021. 2

  28. [28]

    Rgbavatar: Reduced gaussian blendshapes for online modeling of head avatars

    Linzhou Li, Yumeng Li, Yanlin Weng, Youyi Zheng, and Kun Zhou. Rgbavatar: Reduced gaussian blendshapes for online modeling of head avatars. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10747–10757, 2025. 7, 1

  29. [29]

    Learning a model of facial shape and expression from 4d scans.ACM Trans

    Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017. 2, 3, 4

  30. [30]

    One-shot high-fidelity talking- head synthesis with deformable neural radiance field

    Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhi- gang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking- head synthesis with deformable neural radiance field. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17969–17978, 2023. 2

  31. [31]

    Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36:47239–47250, 2023

    Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz. Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36:47239–47250, 2023. 2

  32. [32]

    Fovvideovdp: A visible difference predictor for wide field-of-view video.ACM Transactions on Graphics (TOG), 40(4):1–19, 2021

    Rafał K Mantiuk, Gyorgy Denes, Alexandre Chapiro, Anton Kaplanyan, Gizem Rufo, Romain Bachy, Trisha Lian, and Anjul Patney. Fovvideovdp: A visible difference predictor for wide field-of-view video.ACM Transactions on Graphics (TOG), 40(4):1–19, 2021. 6

  33. [33]

    Nerf in the wild: Neural radiance fields for uncon- strained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

  34. [34]

    Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, et al. Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024. 5, 6

  35. [35]

    Detection hub: Unifying object detection datasets via query adaptation on language embedding

    Lingchen Meng, Xiyang Dai, Yinpeng Chen, Pengchuan Zhang, Dongdong Chen, Mengchen Liu, Jianfeng Wang, Zuxuan Wu, Lu Yuan, and Yu-Gang Jiang. Detection hub: Unifying object detection datasets via query adaptation on language embedding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11402–11411, 2023. 2

  36. [36]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  37. [37]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3, 5

  38. [38]

    PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

    Antonio Oroz, Matthias Nießner, and Tobias Kirschstein. Perchead: Perceptual head model for single-image 3d head reconstruction & editing.arXiv preprint arXiv:2511.02777,

  39. [39]

    Deepsdf: Learning con- tinuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

  40. [40]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2

  41. [41]

    Im- head: A large-scale implicit morphable model for localized head modeling

    Rolandos Alexandros Potamias, Stathis Galanakis, Jiankang Deng, Athanasios Papaioannou, and Stefanos Zafeiriou. Im- head: A large-scale implicit morphable model for localized head modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 10196–10206,

  42. [42]

    Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,

  43. [43]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 5

  44. [44]

    Fine-tuning image transformers using learnable memory

    Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, and Andrew Jackson. Fine-tuning image transformers using learnable memory. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12155–12164, 2022. 2

  45. [45]

    Gasp: Gaussian avatars with synthetic priors

    Jack Saunders, Charlie Hewitt, Yanan Jian, Marek Kowal- ski, Tadas Baltrusaitis, Yiye Chen, Darren Cosker, Virginia Estellers, Nicholas Gyd´e, Vinay P Namboodiri, et al. Gasp: Gaussian avatars with synthetic priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 271–280, 2025. 2

  46. [46]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4

  47. [47]

    Mvp4d: Multi-view portrait video diffusion for animatable 4d avatars.arXiv preprint arXiv:2510.12785, 2025

    Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bah- mani, and David B Lindell. Mvp4d: Multi-view portrait video diffusion for animatable 4d avatars.arXiv preprint arXiv:2510.12785, 2025. 2

  48. [48]

    Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models

    Felix Taubner, Ruihang Zhang, Mathieu Tuli, and David B Lindell. Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5318–5330. IEEE Computer Society, 2025. 2, 7, 1

  49. [49]

    V oodoo xp: Expressive one-shot head reenactment for vr telepresence.arXiv preprint arXiv:2405.16204, 2024

    Phong Tran, Egor Zakharov, Long-Nhat Ho, Liwen Hu, Adilbek Karmanov, Aviral Agarwal, McLean Goldwhite, Ariana Bermudez Venegas, Anh Tuan Tran, and Hao Li. V oodoo xp: Expressive one-shot head reenactment for vr telepresence.arXiv preprint arXiv:2405.16204, 2024. 2, 8

  50. [50]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  51. [51]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  52. [52]

    Flashavatar: High-fidelity head avatar with efficient gaussian embedding

    Jun Xiang, Xuan Gao, Yudong Guo, and Juyong Zhang. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1802– 1812, 2024. 7

  53. [53]

    Vfhq: A high-quality dataset and bench- mark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022. 6

  54. [54]

    Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 8

  55. [55]

    Vasa-3d: Lifelike audio-driven gaussian head avatars from a single image

    Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Stephen Lin, and Baining Guo. Vasa-3d: Lifelike audio-driven gaussian head avatars from a single image. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

  56. [56]

    3d gaussian parametric head model

    Yuelang Xu, Lizhen Wang, Zerong Zheng, Zhaoqi Su, and Yebin Liu. 3d gaussian parametric head model. InEuropean Conference on Computer Vision, pages 129–147. Springer,

  57. [57]

    Vrmm: A volumetric re- lightable morphable head model

    Haotian Yang, Mingwu Zheng, Chongyang Ma, Yu-Kun Lai, Pengfei Wan, and Haibin Huang. Vrmm: A volumetric re- lightable morphable head model. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 2

  58. [58]

    Matanyone: Stable video matting with consistent memory propagation

    Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video matting with consistent memory propagation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7299–7308, 2025. 2

  59. [59]

    gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025

    Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025. 4

  60. [60]

    Real3d-portrait: One- shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503, 2024

    Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, et al. Real3d-portrait: One-shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503,

  61. [61]

    Celebv-text: A large-scale facial text-video dataset

    Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Wei- dong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023. 5

  62. [62]

    One2avatar: Generative implicit head avatar for few-shot user adaptation.arXiv preprint arXiv:2402.11909, 2024

    Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, and Yinda Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation.arXiv preprint arXiv:2402.11909, 2024. 2

  63. [63]

    Hravatar: High-quality and relightable gaussian head avatar

    Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Kangjie Chen, Minghan Qin, Yu Li, and Haoqian Wang. Hravatar: High-quality and relightable gaussian head avatar. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26285–26296, 2025. 7

  64. [64]

    Fate: Full- head gaussian avatar with textural editing from monocular video

    Jiawei Zhang, Zijian Wu, Zhiyang Liang, Yicheng Gong, Dongfang Hu, Yao Yao, Xun Cao, and Hao Zhu. Fate: Full- head gaussian avatar with textural editing from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5535–5545, 2025. 7

  65. [65]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  66. [66]

    Invertavatar: Incremental gan inversion for gen- eralized head avatars

    Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, and Yebin Liu. Invertavatar: Incremental gan inversion for gen- eralized head avatars. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024. 6, 4

  67. [67]

    Headgap: Few-shot 3d head avatar via generalizable gaussian priors

    Xiaozheng Zheng, Chao Wen, Zhaohu Li, Weiyi Zhang, Zhuo Su, Xu Chang, Yang Zhao, Zheng Lv, Xiaoyuan Zhang, Yongjie Zhang, et al. Headgap: Few-shot 3d head avatar via generalizable gaussian priors. In2025 Inter- national Conference on 3D Vision (3DV), pages 946–957. IEEE, 2025. 2, 5

  68. [68]

    Prompt vision transformer for domain generalization.arXiv preprint arXiv:2208.08914, 2022

    Zangwei Zheng, Xiangyu Yue, Kai Wang, and Yang You. Prompt vision transformer for domain generalization.arXiv preprint arXiv:2208.08914, 2022. 2

  69. [69]

    Uni-perceiver: Pre- training unified architecture for generic perception for zero- shot and few-shot tasks

    Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre- training unified architecture for generic perception for zero- shot and few-shot tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16804–16815, 2022. 2

  70. [70]

    Instant volumetric head avatars

    Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4574–4584, 2023. 7 FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision Supplementary Material Figure 9.Interpolation of 3D Head Avatars.FlexAvatar can produce real...