FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
Pith reviewed 2026-05-16 21:28 UTC · model grok-4.3
The pith
FlexAvatar produces complete 3D head avatars from one image by training a transformer on mixed monocular and multi-view data via bias-sink tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlexAvatar is a transformer-based 3D portrait animation model equipped with learnable data-source tokens, termed bias sinks. The tokens allow unified training on monocular videos and multi-view image sets, thereby disentangling driving signals from target viewpoints. The resulting model generates complete 3D head avatars that support realistic facial animation even when conditioned on a single input photograph.
What carries the argument
Learnable data-source tokens (bias sinks) placed inside a transformer-based 3D portrait animation network that separate motion signals from viewpoint during mixed-dataset training.
If this is right
- Single-image avatars exhibit full 3D completeness rather than the partial reconstructions typical of monocular-only training.
- View extrapolation yields realistic facial animations without the holes or distortions seen in prior methods.
- The latent avatar space remains smooth enough for identity interpolation and for fitting to arbitrary numbers of input observations.
- Joint training combines the generalization strength of monocular video with the geometric fidelity of multi-view supervision.
Where Pith is reading between the lines
- The same token-based disentanglement could be applied to full-body avatar creation if analogous motion-viewpoint biases exist in body datasets.
- Lowering the need for dense multi-view captures would make high-quality personalized 3D characters more accessible for consumer applications.
- Bias sinks may prove useful in other conditional synthesis tasks where training data come from sources with mismatched statistical properties.
Load-bearing premise
Learnable data-source tokens can reliably disentangle driving signals from target viewpoints without introducing artifacts or requiring carefully balanced dataset mixtures.
What would settle it
Remove the bias-sink tokens, retrain on the same mixed data, and check whether single-image avatars exhibit missing geometry or fail to produce coherent novel views.
Figures
read the original abstract
We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FlexAvatar, a transformer-based 3D portrait animation model for high-quality complete head avatars from a single image. It identifies entanglement between driving signals and target viewpoints in monocular training as the root cause of incomplete 3D reconstructions and proposes learnable data source tokens (bias sinks) to enable unified training across monocular and multi-view datasets, allowing the model to inherit strong generalization from monocular data and full 3D completeness from multi-view supervision at inference. The approach also produces a smooth latent avatar space for identity interpolation and flexible fitting, with evaluations on single-view, few-shot, and monocular tasks showing improved view extrapolation over prior methods.
Significance. If the bias-sink mechanism is shown to reliably disentangle signals without new artifacts, the work would meaningfully advance single-image 3D avatar creation by combining the complementary strengths of the two data regimes, yielding more complete and animatable avatars than monocular-only baselines while retaining their generalization. The smooth latent space and support for variable input counts are additional practical strengths.
major comments (2)
- [Method (bias-sink formulation) and Experiments] The central claim that bias sinks enable unified training and artifact-free inheritance of 3D completeness from multi-view data during monocular inference is load-bearing, yet the manuscript provides no targeted ablation or quantitative verification (e.g., view-extrapolation error with/without sinks, or artifact metrics on held-out viewpoints) demonstrating that the tokens separate driving signals from viewpoint without reintroducing entanglement or new failure modes.
- [Training procedure and §4] The training procedure description does not specify how the mixture of monocular and multi-view data is balanced or whether the learned tokens remain effective when the test distribution differs from the training mixture; this directly affects whether the claimed leverage of both data sources holds at inference.
minor comments (2)
- [Model architecture] Clarify the exact architecture placement of the bias sinks within the transformer (e.g., which attention layers and how they are initialized) to allow reproduction.
- [Experiments and results] The abstract states 'extensive evaluations' on multiple tasks; the results section should include error bars, statistical significance tests against baselines, and explicit discussion of any data-selection criteria to address potential post-hoc concerns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate clarifications and additional analysis where appropriate.
read point-by-point responses
-
Referee: [Method (bias-sink formulation) and Experiments] The central claim that bias sinks enable unified training and artifact-free inheritance of 3D completeness from multi-view data during monocular inference is load-bearing, yet the manuscript provides no targeted ablation or quantitative verification (e.g., view-extrapolation error with/without sinks, or artifact metrics on held-out viewpoints) demonstrating that the tokens separate driving signals from viewpoint without reintroducing entanglement or new failure modes.
Authors: We agree that isolating the effect of the bias sinks with targeted quantitative ablations would strengthen the central claim. In the revised manuscript we will add an ablation in Section 4 that reports view-extrapolation metrics (PSNR, LPIPS, and perceptual scores on held-out viewpoints) with and without the learned tokens, together with qualitative inspection for new artifacts or re-entanglement. While our existing single-view and few-shot results already demonstrate the combined benefit, we acknowledge the value of this explicit verification. revision: yes
-
Referee: [Training procedure and §4] The training procedure description does not specify how the mixture of monocular and multi-view data is balanced or whether the learned tokens remain effective when the test distribution differs from the training mixture; this directly affects whether the claimed leverage of both data sources holds at inference.
Authors: We will revise Section 4 to explicitly state the sampling ratios used to balance monocular and multi-view data during training. Our evaluations already span single-view, few-shot, and monocular test regimes whose input distributions differ from the training mixture; the consistent gains in view extrapolation and completeness indicate that the tokens remain effective. We will add a short discussion of this robustness in the revised text. revision: yes
Circularity Check
No significant circularity; bias sinks are externally optimized parameters
full rationale
The paper introduces learnable data source tokens (bias sinks) within a transformer architecture to enable unified training across monocular and multi-view datasets. These tokens are optimized during training on external data sources and the resulting model is evaluated on separate held-out tasks (single-view, few-shot, monocular avatar creation). No equation or claim reduces a 'prediction' to a fitted quantity defined by the same data, nor does the central disentanglement mechanism rely on a self-citation chain, uniqueness theorem from prior author work, or smuggled ansatz. The derivation chain is self-contained: the architectural choice is trained end-to-end and its efficacy is demonstrated through independent experimental benchmarks rather than by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- bias sinks
axioms (1)
- domain assumption Transformer architecture can represent 3D portrait animation and viewpoint disentanglement when augmented with data-source tokens
invented entities (1)
-
bias sinks
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
entanglement between driving signal and target viewpoint
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
UIKA: Fast Universal Head Avatar from Pose-Free Images
UIKA is a feed-forward animatable Gaussian head model using UV-guided correspondence estimation and learnable UV tokens with dual-level attention, trained on large-scale synthetic data to handle pose-free inputs.
-
Learning a Delighting Prior for Facial Appearance Capture in the Wild
A delighting network trained via Dataset Latent Modulation on heterogeneous OLAT and Light Stage data enables high-quality in-the-wild facial reflectance capture from video and produces the NeRSemble-Scan dataset.
Reference graph
Works this paper leans on
-
[1]
A morphable model for the synthesis of 3d faces
V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. InProc. SIGGRAPH, pages 187–194. ACM Press/Addison-Wesley Publishing Co., 1999. 2
work page 1999
-
[2]
Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures
Marcel C Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, et al. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InSIGGRAPH Asia 2024 Confer- ence Papers, pages 1–12, 2024. 2, 5
work page 2024
-
[3]
Jianchuan Chen, Jingchuan Hu, Gaige Wang, Zhonghua Jiang, Tiansong Zhou, Zhiwen Chen, and Chengfei Lv. Taoa- vatar: Real-time lifelike full-body talking avatars for aug- mented reality via 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10723–10734, 2025. 7
work page 2025
-
[4]
Xuangeng Chu and Tatsuya Harada. Generalizable and ani- matable gaussian head avatar.Advances in Neural Informa- tion Processing Systems, 37:57642–57670, 2024. 2, 6, 7, 1, 3
work page 2024
-
[5]
Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generaliz- able and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024. 2, 6, 4
-
[6]
Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer
Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025. 5
work page 2025
-
[7]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4690–4699, 2019. 6
work page 2019
-
[8]
Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set
Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition workshops, pages 0–0, 2019. 6
work page 2019
-
[9]
Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data
Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7130, 2024. 6
work page 2024
-
[10]
Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer
Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In European Conference on Computer Vision, pages 316–333. Springer, 2024. 2, 6, 7
work page 2024
-
[11]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[12]
Learning neural parametric head models
Simon Giebenhain, Tobias Kirschstein, Markos Georgopou- los, Martin R ¨unz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21003–21012, 2023. 8, 2
work page 2023
-
[13]
Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction. arXiv preprint arXiv:2505.00615, 2025. 5, 2
-
[14]
Sega: Drivable 3d gaussian head avatar from a single image
Chen Guo, Zhuo Su, Jian Wang, Shuang Li, Xu Chang, Zhaohu Li, Yang Zhao, Guidong Wang, and Ruqi Huang. Sega: Drivable 3d gaussian head avatar from a single image. arXiv preprint arXiv:2504.14373, 2025. 2
-
[15]
Lam: Large avatar model for one-shot animatable gaus- sian head
Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaus- sian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2, 3, 6, 7, 1
work page 2025
-
[16]
Headnerf: A real-time nerf-based parametric head model
Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juy- ong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 2, 5
work page 2022
-
[17]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 2
work page 2021
-
[18]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 2
work page 2022
-
[19]
Haibo Jin, Shengcai Liao, and Ling Shao. Pixel-in-pixel net: Towards efficient facial landmark detection in the wild.In- ternational Journal of Computer Vision, 2021. 6
work page 2021
-
[20]
Analyzing and improv- ing the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020. 2, 4
work page 2020
-
[21]
Modnet: Real-time trimap-free portrait mat- ting via objective decomposition
Zhanghan Ke, Jiayu Sun, Kaican Li, Qiong Yan, and Ryn- son WH Lau. Modnet: Real-time trimap-free portrait mat- ting via objective decomposition. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1140– 1147, 2022. 2
work page 2022
-
[22]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[23]
Realistic one-shot mesh-based head avatars
Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InEuropean Conference on Computer Vision, pages 345–
-
[24]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radi- ance field reconstruction of human heads.ACM Transactions on Graphics (TOG), 42(4):1–14, 2023. 5, 6, 7, 1, 2
work page 2023
-
[26]
Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars
Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 12089–12100, 2025. 2, 4, 6, 1
work page 2025
-
[27]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Rgbavatar: Reduced gaussian blendshapes for online modeling of head avatars
Linzhou Li, Yumeng Li, Yanlin Weng, Youyi Zheng, and Kun Zhou. Rgbavatar: Reduced gaussian blendshapes for online modeling of head avatars. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10747–10757, 2025. 7, 1
work page 2025
-
[29]
Learning a model of facial shape and expression from 4d scans.ACM Trans
Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017. 2, 3, 4
work page 2017
-
[30]
One-shot high-fidelity talking- head synthesis with deformable neural radiance field
Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhi- gang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking- head synthesis with deformable neural radiance field. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17969–17978, 2023. 2
work page 2023
-
[31]
Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz. Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36:47239–47250, 2023. 2
work page 2023
-
[32]
Rafał K Mantiuk, Gyorgy Denes, Alexandre Chapiro, Anton Kaplanyan, Gizem Rufo, Romain Bachy, Trisha Lian, and Anjul Patney. Fovvideovdp: A visible difference predictor for wide field-of-view video.ACM Transactions on Graphics (TOG), 40(4):1–19, 2021. 6
work page 2021
-
[33]
Nerf in the wild: Neural radiance fields for uncon- strained photo collections
Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2
work page 2021
-
[34]
Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, et al. Codec avatar studio: Paired human captures for complete, drive- able, and generalizable avatars.Advances in Neural Infor- mation Processing Systems, 37:83008–83023, 2024. 5, 6
work page 2024
-
[35]
Detection hub: Unifying object detection datasets via query adaptation on language embedding
Lingchen Meng, Xiyang Dai, Yinpeng Chen, Pengchuan Zhang, Dongdong Chen, Mengchen Liu, Jianfeng Wang, Zuxuan Wu, Lu Yuan, and Yu-Gang Jiang. Detection hub: Unifying object detection datasets via query adaptation on language embedding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 11402–11411, 2023. 2
work page 2023
-
[36]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[37]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Antonio Oroz, Matthias Nießner, and Tobias Kirschstein. Perchead: Perceptual head model for single-image 3d head reconstruction & editing.arXiv preprint arXiv:2511.02777,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Deepsdf: Learning con- tinuous signed distance functions for shape representation
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2
work page 2019
-
[40]
Nerfies: Deformable neural radiance fields
Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2
work page 2021
-
[41]
Im- head: A large-scale implicit morphable model for localized head modeling
Rolandos Alexandros Potamias, Stathis Galanakis, Jiankang Deng, Athanasios Papaioannou, and Stefanos Zafeiriou. Im- head: A large-scale implicit morphable model for localized head modeling. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 10196–10206,
-
[42]
Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,
-
[43]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Fine-tuning image transformers using learnable memory
Mark Sandler, Andrey Zhmoginov, Max Vladymyrov, and Andrew Jackson. Fine-tuning image transformers using learnable memory. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 12155–12164, 2022. 2
work page 2022
-
[45]
Gasp: Gaussian avatars with synthetic priors
Jack Saunders, Charlie Hewitt, Yanan Jian, Marek Kowal- ski, Tadas Baltrusaitis, Yiye Chen, Darren Cosker, Virginia Estellers, Nicholas Gyd´e, Vinay P Namboodiri, et al. Gasp: Gaussian avatars with synthetic priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 271–280, 2025. 2
work page 2025
-
[46]
Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 4
work page 2016
-
[47]
Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bah- mani, and David B Lindell. Mvp4d: Multi-view portrait video diffusion for animatable 4d avatars.arXiv preprint arXiv:2510.12785, 2025. 2
-
[48]
Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models
Felix Taubner, Ruihang Zhang, Mathieu Tuli, and David B Lindell. Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5318–5330. IEEE Computer Society, 2025. 2, 7, 1
work page 2025
-
[49]
Phong Tran, Egor Zakharov, Long-Nhat Ho, Liwen Hu, Adilbek Karmanov, Aviral Agarwal, McLean Goldwhite, Ariana Bermudez Venegas, Anh Tuan Tran, and Hao Li. V oodoo xp: Expressive one-shot head reenactment for vr telepresence.arXiv preprint arXiv:2405.16204, 2024. 2, 8
-
[50]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[51]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6
work page 2004
-
[52]
Flashavatar: High-fidelity head avatar with efficient gaussian embedding
Jun Xiang, Xuan Gao, Yudong Guo, and Juyong Zhang. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1802– 1812, 2024. 7
work page 2024
-
[53]
Vfhq: A high-quality dataset and bench- mark for video face super-resolution
Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and bench- mark for video face super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 657–666, 2022. 6
work page 2022
-
[54]
Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. Vasa-1: Lifelike audio-driven talking faces generated in real time.Advances in Neural Information Pro- cessing Systems, 37:660–684, 2024. 8
work page 2024
-
[55]
Vasa-3d: Lifelike audio-driven gaussian head avatars from a single image
Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Stephen Lin, and Baining Guo. Vasa-3d: Lifelike audio-driven gaussian head avatars from a single image. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
work page 2025
-
[56]
3d gaussian parametric head model
Yuelang Xu, Lizhen Wang, Zerong Zheng, Zhaoqi Su, and Yebin Liu. 3d gaussian parametric head model. InEuropean Conference on Computer Vision, pages 129–147. Springer,
-
[57]
Vrmm: A volumetric re- lightable morphable head model
Haotian Yang, Mingwu Zheng, Chongyang Ma, Yu-Kun Lai, Pengfei Wan, and Haibin Huang. Vrmm: A volumetric re- lightable morphable head model. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 2
work page 2024
-
[58]
Matanyone: Stable video matting with consistent memory propagation
Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, and Chen Change Loy. Matanyone: Stable video matting with consistent memory propagation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7299–7308, 2025. 2
work page 2025
-
[59]
Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025. 4
work page 2025
-
[60]
Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, et al. Real3d-portrait: One-shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503,
-
[61]
Celebv-text: A large-scale facial text-video dataset
Jianhui Yu, Hao Zhu, Liming Jiang, Chen Change Loy, Wei- dong Cai, and Wayne Wu. Celebv-text: A large-scale facial text-video dataset. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023. 5
work page 2023
-
[62]
Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, and Yinda Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation.arXiv preprint arXiv:2402.11909, 2024. 2
-
[63]
Hravatar: High-quality and relightable gaussian head avatar
Dongbin Zhang, Yunfei Liu, Lijian Lin, Ye Zhu, Kangjie Chen, Minghan Qin, Yu Li, and Haoqian Wang. Hravatar: High-quality and relightable gaussian head avatar. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 26285–26296, 2025. 7
work page 2025
-
[64]
Fate: Full- head gaussian avatar with textural editing from monocular video
Jiawei Zhang, Zijian Wu, Zhiyang Liang, Yicheng Gong, Dongfang Hu, Yao Yao, Xun Cao, and Hao Zhu. Fate: Full- head gaussian avatar with textural editing from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5535–5545, 2025. 7
work page 2025
-
[65]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6
work page 2018
-
[66]
Invertavatar: Incremental gan inversion for gen- eralized head avatars
Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Jinli Suo, and Yebin Liu. Invertavatar: Incremental gan inversion for gen- eralized head avatars. InACM SIGGRAPH 2024 Conference Papers, pages 1–10, 2024. 6, 4
work page 2024
-
[67]
Headgap: Few-shot 3d head avatar via generalizable gaussian priors
Xiaozheng Zheng, Chao Wen, Zhaohu Li, Weiyi Zhang, Zhuo Su, Xu Chang, Yang Zhao, Zheng Lv, Xiaoyuan Zhang, Yongjie Zhang, et al. Headgap: Few-shot 3d head avatar via generalizable gaussian priors. In2025 Inter- national Conference on 3D Vision (3DV), pages 946–957. IEEE, 2025. 2, 5
work page 2025
-
[68]
Prompt vision transformer for domain generalization.arXiv preprint arXiv:2208.08914, 2022
Zangwei Zheng, Xiangyu Yue, Kai Wang, and Yang You. Prompt vision transformer for domain generalization.arXiv preprint arXiv:2208.08914, 2022. 2
-
[69]
Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Hongsheng Li, Xiaohua Wang, and Jifeng Dai. Uni-perceiver: Pre- training unified architecture for generic perception for zero- shot and few-shot tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16804–16815, 2022. 2
work page 2022
-
[70]
Instant volumetric head avatars
Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4574–4584, 2023. 7 FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision Supplementary Material Figure 9.Interpolation of 3D Head Avatars.FlexAvatar can produce real...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.