pith. machine review for the scientific record. sign in

arxiv: 2604.04787 · v2 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D Gaussian avatarsautoregressive generationGaussian Splattingsingle image avatarizationTransformer decoderpoint cloud generationanimation binding prediction
0
0 comments X

The pith

A decoder-only Transformer autoregressively builds point clouds for 4D Gaussian avatars from one portrait image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method that turns a single photo into a dynamic, animatable 4D avatar by generating its underlying points one at a time. A Transformer model decides on the fly how many points to use and where they should go, while also figuring out how each point should move with the body. After the points are created, a separate decoder converts them into full Gaussian attributes, using hidden information from the first model to sharpen the final result. The authors show that this step-by-step process produces avatars that look realistic when rendered and can be controlled for animation without extra fixes. If the approach holds up, it could simplify the creation of personalized moving avatars for games, video, and virtual reality.

Core claim

The central claim is that autoregressively generating the points of a 3D Gaussian Splatting representation with a decoder-only Transformer, while jointly predicting per-point binding information, enables adaptive point density and produces high-fidelity, controllable 4D avatars from a single image once the Gaussian decoder is conditioned on the autoregressive latent features.

What carries the argument

Decoder-only Transformer that autoregressively outputs points and their animation bindings for a 3D Gaussian Splatting point cloud, followed by a latent-conditioned Gaussian decoder.

If this is right

  • Point count and density adjust automatically to match the complexity of each subject.
  • Animation becomes possible directly from the predicted per-point binding data.
  • Conditioning the Gaussian decoder on autoregressive latent features improves final render quality.
  • The full pipeline works from a single portrait without requiring multi-view input or post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sequential generation style could support incremental updates, such as adding detail to an avatar over time.
  • Training data requirements might decrease because the model learns local point decisions rather than global arrangements at once.
  • The same autoregressive structure could be tested on related tasks like generating dynamic scenes or objects beyond human avatars.

Load-bearing premise

That sequential autoregressive point generation plus joint binding prediction will produce higher-fidelity and more controllable avatars than generating all points simultaneously.

What would settle it

A head-to-head test on the same single-image input set where a non-autoregressive Gaussian avatar method matches or exceeds the autoregressive version on photorealism and animation quality metrics.

Figures

Figures reproduced from arXiv: 2604.04787 by Boyao Zhou, Hongyu Liu, Qifeng Chen, Runtao Liu, Xuan Wang, Yating Wang, Yue Ma, Yujun Shen, Zijian Wu, Ziyu Wan.

Figure 1
Figure 1. Figure 1: Gallery of the proposed AvatarPointillist.The leftmost column shows the input image, the middle column displays the Gaussian point cloud generated by our AR model, and the rightmost column presents the final drivable 4D Gaussian avatar. The generation order proceeds from bottom to top and left to right. It can be seen that our AR model directly models the Gaussian point cloud, allowing it to simulate the a… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of different Gaussian point cloud modeling [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our framework. It consists of two modules: an autoregressive (AR) model for Gaussian geometry generation and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with state-of-the-art methods. The leftmost column shows the input images, with the target image [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of ablation study on input setting of Gaussian decoder. The leftmost column shows the input. The FLAME Positions baseline, similar to the LAM method, uses the canonical FLAME mesh vertices as a template and only applies decoder-predicted offsets to deform this template into a final Gaussian point cloud. Pointwise AR Feature refers to using only the AR features (F p n ) without positional info… view at source ↗
read the original abstract

We introduce AvatarPointillist, a novel framework for generating dynamic 4D Gaussian avatars from a single portrait image. At the core of our method is a decoder-only Transformer that autoregressively generates a point cloud for 3D Gaussian Splatting. This sequential approach allows for precise, adaptive construction, dynamically adjusting point density and the total number of points based on the subject's complexity. During point generation, the AR model also jointly predicts per-point binding information, enabling realistic animation. After generation, a dedicated Gaussian decoder converts the points into complete, renderable Gaussian attributes. We demonstrate that conditioning the decoder on the latent features from the AR generator enables effective interaction between stages and markedly improves fidelity. Extensive experiments validate that AvatarPointillist produces high-quality, photorealistic, and controllable avatars. We believe this autoregressive formulation represents a new paradigm for avatar generation, and we will release our code inspire future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces AvatarPointillist, a framework for generating dynamic 4D Gaussian avatars from a single portrait image. A decoder-only Transformer autoregressively produces a point cloud for 3D Gaussian Splatting while jointly predicting per-point binding information to enable animation. A latent-conditioned Gaussian decoder then converts these points into full renderable Gaussian attributes, with the authors claiming that the autoregressive sequential construction allows adaptive point density and number based on subject complexity, yielding photorealistic and controllable results.

Significance. If the empirical claims hold, the autoregressive formulation for adaptive point generation combined with joint binding prediction and latent conditioning offers a potentially more flexible pipeline for controllable 4D avatar creation than non-autoregressive alternatives. The explicit plan to release code is a strength that supports reproducibility and future work in the area.

minor comments (2)
  1. [Experiments] The abstract states that 'extensive experiments validate' improved fidelity and controllability, but the main text should include explicit quantitative tables with baselines, ablations on the autoregressive component versus non-autoregressive variants, and error bars or statistical significance to make the central empirical claim fully verifiable.
  2. [Method] In the method description, the precise mechanism by which latent features from the AR generator are injected into the Gaussian decoder (e.g., cross-attention, concatenation, or FiLM) should be specified with an equation or diagram to ensure the claimed 'effective interaction between stages' is reproducible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of AvatarPointillist, the recognition of the potential advantages of the autoregressive adaptive point generation combined with joint binding prediction, and the recommendation for minor revision. The explicit note on code release as a reproducibility strength is appreciated. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces AvatarPointillist as a new autoregressive pipeline: a decoder-only Transformer generates sequential points and joint binding predictions for 3D Gaussian Splatting, followed by a latent-conditioned Gaussian decoder. No equations, fitted parameters, or derivations are shown that reduce by construction to prior outputs or self-citations. The abstract and method description present the autoregressive formulation and stage interaction as an independent design choice without invoking uniqueness theorems, ansatzes from prior self-work, or renaming of known results. The central claim of improved fidelity and controllability therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method implicitly assumes standard Transformer training and Gaussian splatting rendering work as described without additional unstated priors.

pith-pipeline@v0.9.0 · 5483 in / 1085 out tokens · 29958 ms · 2026-05-10T18:44:35.707426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. InAdvances in neural information processing systems, pages 1877–1901, 2020. 3

  2. [2]

    Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures

    Marcel C Buehler, Gengyan Li, Erroll Wood, Leonhard Helminger, Xu Chen, Tanmay Shah, Daoye Wang, Stephan Garbin, Sergio Orts-Escolano, Otmar Hilliges, et al. Cafca: High-quality novel view synthesis of expressive faces from casual few-shot captures. InSIGGRAPH Asia 2024 Confer- ence Papers, pages 1–12, 2024. 3

  3. [3]

    Neural head reenactment with latent pose descriptors

    Egor Burkov, Igor Pasechnik, Artur Grigorev, and Victor Lempitsky. Neural head reenactment with latent pose descriptors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13786– 13795, 2020. 3

  4. [4]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J. Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry- aware 3d generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022. 3

  5. [5]

    Monogaus- sianavatar: Monocular gaussian point-based head avatar

    Yufan Chen, Lizhen Wang, Qijing Li, Hongjiang Xiao, Shengping Zhang, Hongxun Yao, and Yebin Liu. Monogaus- sianavatar: Monocular gaussian point-based head avatar. In ACM SIGGRAPH 2024 Conference Papers, pages 1–9, 2024. 3

  6. [6]

    Generalizable and animatable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,

  7. [7]

    arXiv preprint arXiv:2401.10215 (2024)

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. Gpavatar: Generalizable and precise head avatar from image (s).arXiv preprint arXiv:2401.10215, 2024. 3

  8. [8]

    Emoca: Emotion driven monocular face capture and animation

    Radek Dan ˇeˇcek, Michael J Black, and Timo Bolkart. Emoca: Emotion driven monocular face capture and animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20311–20322, 2022. 3

  9. [9]

    Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set

    Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 3

  10. [10]

    Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data

    Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, and Baoyuan Wang. Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7119–7130, 2024. 2, 3

  11. [11]

    Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer

    Yu Deng, Duomin Wang, and Baoyuan Wang. Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. arXiv preprint arXiv:2403.13570, 2024. 3, 6, 7

  12. [12]

    Megaportraits: One-shot megapixel neural head avatars

    Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Alek- sei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022. 3

  13. [13]

    Learning an animatable detailed 3d face model from in-the- wild images.ACM Transactions on Graphics (ToG), 40(4): 1–13, 2021

    Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the- wild images.ACM Transactions on Graphics (ToG), 40(4): 1–13, 2021. 3

  14. [14]

    Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8649–8658, 2021. 3

  15. [15]

    Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction,

    Simon Giebenhain, Tobias Kirschstein, Martin R ¨unz, Lour- des Agapito, and Matthias Nießner. Pixel3dmm: Versatile screen-space priors for single-image 3d face reconstruction,

  16. [16]

    Toontalker: Cross-domain face reenactment

    Yuan Gong, Yong Zhang, Xiaodong Cun, Fei Yin, Yanbo Fan, Xuan Wang, Baoyuan Wu, and Yujiu Yang. Toontalker: Cross-domain face reenactment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7690–7700, 2023. 3

  17. [17]

    Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Commu- nications of the ACM, 63(11):139–144, 2020. 2

  18. [18]

    Neural head avatars from monocular rgb videos

    Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. Neural head avatars from monocular rgb videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18653–18664, 2022. 3

  19. [19]

    arXiv preprint arXiv:2407.03168 , year =

    Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, and Di Zhang. Live- portrait: Efficient portrait animation with stitching and retargeting control.arXiv preprint arXiv:2407.03168, 2024. 2

  20. [20]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 2

  21. [21]

    arXiv preprint arXiv:2412.09548 , year=

    Zekun Hao, David W Romero, Tsung-Yi Lin, and Ming-Yu Liu. Meshtron: High-fidelity, artist-like 3d mesh generation at scale.arXiv preprint arXiv:2412.09548, 2024. 4, 5, 6

  22. [22]

    Lam: Large avatar model for one-shot animatable gaussian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025. 2, 3, 5, 6, 7, 8

  23. [23]

    Abdelrahman, and Ayoub Al- Hamadi

    Thorsten Hempel, Ahmed A. Abdelrahman, and Ayoub Al- Hamadi. Toward robust and unconstrained full range of rotation head pose estimation.IEEE Transactions on Image Processing, 33:2377–2387, 2024. 6

  24. [24]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 6

  25. [25]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

  26. [26]

    Depth-aware generative adversarial network for talking head video generation

    Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. Depth-aware generative adversarial network for talking head video generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3397–3406, 2022. 3

  27. [27]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20374– 20384, 2022. 3

  28. [28]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 2

  29. [29]

    Alias-free generative adversarial networks

    Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. InProc. NeurIPS, 2021. 2

  30. [30]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  31. [31]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 3

  32. [32]

    Nersemble: Multi-view radiance field reconstruction of human heads.ACM Trans

    Tobias Kirschstein, Shenhan Qian, Simon Giebenhain, Tim Walter, and Matthias Nießner. Nersemble: Multi-view radiance field reconstruction of human heads.ACM Trans. Graph., 42(4), 2023. 2, 3, 4, 6, 7, 8

  33. [33]

    Avat3r: Large an- imatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025

    Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars.arXiv preprint arXiv:2502.20220, 2025. 3

  34. [34]

    ARMesh: Autoregressive mesh generation via next-level-of-detail pre- diction

    Jiabao Lei, Kewei Shi, Zhihao Liang, and Kui Jia. ARMesh: Autoregressive mesh generation via next-level-of-detail pre- diction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 4

  35. [35]

    Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 2, 3, 4

  36. [36]

    One-shot high-fidelity talking- head synthesis with deformable neural radiance field

    Weichuang Li, Longhao Zhang, Dong Wang, Bin Zhao, Zhigang Wang, Mulin Chen, Bang Zhang, Zhongjian Wang, Liefeng Bo, and Xuelong Li. One-shot high-fidelity talking- head synthesis with deformable neural radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17969–17978, 2023. 2, 3

  37. [37]

    Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36, 2024

    Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, and Jan Kautz. Generalizable one-shot 3d neural head avatar.Advances in Neural Information Processing Systems, 36, 2024. 3

  38. [38]

    Human motionformer: Transferring human motions with vision transformers.arXiv preprint arXiv:2302.11306, 2023

    Hongyu Liu, Xintong Han, Chengbin Jin, Lihui Qian, Huawei Wei, Zhe Lin, Faqiang Wang, Haoye Dong, Yibing Song, Jia Xu, et al. Human motionformer: Transferring human motions with vision transformers.arXiv preprint arXiv:2302.11306, 2023. 3

  39. [39]

    Headartist: Text- conditioned 3d head generation with self score distillation

    Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, and Qifeng Chen. Headartist: Text- conditioned 3d head generation with self score distillation. InACM SIGGRAPH 2024 Conference Papers, New York, NY , USA, 2024. Association for Computing Machinery. 3

  40. [40]

    Avatarartist: Open-domain 4d avatarization

    Hongyu Liu, Xuan Wang, Ziyu Wan, Yue Ma, Jingye Chen, Yanbo Fan, Yujun Shen, Yibing Song, and Qifeng Chen. Avatarartist: Open-domain 4d avatarization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10758–10769, 2025. 2, 3, 6, 7

  41. [41]

    Headartist-vl: Vi- sion/language guided 3d head generation with self score distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Hongyu Liu, Xuan Wang, Ziyu Wan, Yujun Shen, Yibing Song, Jing Liao, and Qifeng Chen. Headartist-vl: Vi- sion/language guided 3d head generation with self score distillation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

  42. [42]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2019. 6

  43. [43]

    MediaPipe: A Framework for Building Perception Pipelines

    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris Mc- Clanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo- Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Medi- apipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019. 6

  44. [44]

    Follow your pose: Pose- guided text-to-video generation using pose-free videos

    Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose- guided text-to-video generation using pose-free videos. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 4117–4125, 2024. 3

  45. [45]

    Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation.arXiv preprint arXiv:2406.01900, 2024

    Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation.arXiv preprint arXiv:2406.01900, 2024. 2, 3

  46. [46]

    Otavatar: One-shot talking face avatar with control- lable tri-plane rendering

    Zhiyuan Ma, Xiangyu Zhu, Guo-Jun Qi, Zhen Lei, and Lei Zhang. Otavatar: One-shot talking face avatar with control- lable tri-plane rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16901–16910, 2023. 2, 3

  47. [47]

    Jewett, Simon Venshtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mohamed Ezzeldin A

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagaut- dinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollh ¨ofer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Venshtain, Christopher Heil...

  48. [48]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2, 3, 5

  49. [49]

    AutoSDF: Shape priors for 3d completion, reconstruction and generation

    Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shub- ham Tulsiani. AutoSDF: Shape priors for 3d completion, reconstruction and generation. InCVPR, 2022. 4

  50. [50]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4, 5

  51. [51]

    Renderme-360: a large digital asset library and benchmarks towards high-fidelity head avatars.Advances in Neural Information Processing Systems, 36, 2024

    Dongwei Pan, Long Zhuo, Jingtan Piao, Huiwen Luo, Wei Cheng, Yuxin Wang, Siming Fan, Shengqi Liu, Lei Yang, Bo Dai, et al. Renderme-360: a large digital asset library and benchmarks towards high-fidelity head avatars.Advances in Neural Information Processing Systems, 36, 2024. 3

  52. [52]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Da- vide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299– 20309, 2024. 2, 6

  53. [53]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 6

  54. [54]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2

  55. [55]

    Animating arbitrary objects via deep motion transfer

    Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. Animating arbitrary objects via deep motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2377–2386, 2019. 2, 3

  56. [56]

    First order motion model for image animation.Advances in neural information processing systems, 32, 2019

    Aliaksandr Siarohin, St ´ephane Lathuili`ere, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. First order motion model for image animation.Advances in neural information processing systems, 32, 2019. 3

  57. [57]

    Motion representations for articulated animation

    Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for articulated animation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13653–13662, 2021. 2, 3

  58. [58]

    Meshgpt: Generating triangle meshes with decoder-only transformers

    Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, and Matthias Nießner. Meshgpt: Generating triangle meshes with decoder-only transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 4

  59. [59]

    Pointgrow: Autoregressively learned point cloud generation with self-attention

    Yongbin Sun, Yue Wang, Ziwei Liu, Joshua Siegel, and Sanjay Sarma. Pointgrow: Autoregressively learned point cloud generation with self-attention. InThe IEEE Winter Conference on Applications of Computer Vision, pages 61– 70, 2020. 4

  60. [60]

    Gaf: Gaussian avatar reconstruction from monocular videos via multi-view diffu- sion.arXiv preprint arXiv:2412.10209, 2024

    Jiapeng Tang, Davide Davoli, Tobias Kirschstein, Liam Schoneveld, and Matthias Niessner. Gaf: Gaussian avatar reconstruction from monocular videos via multi-view diffu- sion.arXiv preprint arXiv:2412.10209, 2024. 3

  61. [61]

    Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models.arXiv preprint arXiv:2412.12093, 2024

    Felix Taubner, Ruihang Zhang, Mathieu Tuli, and David B Lindell. Cap4d: Creating animatable 4d portrait avatars with morphable multi-view diffusion models.arXiv preprint arXiv:2412.12093, 2024. 3

  62. [62]

    Real-time radiance fields for single-image portrait view synthesis

    Alex Trevithick, Matthew Chan, Michael Stengel, Eric Chan, Chao Liu, Zhiding Yu, Sameh Khamis, Ravi Ramamoorthi, and Koki Nagano. Real-time radiance fields for single-image portrait view synthesis. 2023. 3

  63. [63]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in neural information processing systems,

  64. [64]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in neural information processing systems, pages 5998–6008, 2017. 3, 4

  65. [65]

    Progressive disentangled representation learning for fine-grained controllable talking head synthesis

    Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17979–17989, 2023. 3

  66. [66]

    3d gaussian head avatars with expressive dynamic appearances by compact tensorial representations

    Yating Wang, Xuan Wang, Ran Yi, Yanbo Fan, Jichen Hu, Jingcheng Zhu, and Lizhuang Ma. 3d gaussian head avatars with expressive dynamic appearances by compact tensorial representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21117–21126, 2025. 3

  67. [67]

    Aniportrait: Audio-driven synthesis of photorealistic portrait animation

    Huawei Wei, Zejun Yang, and Zhisheng Wang. Aniportrait: Audio-driven synthesis of photorealistic portrait animations. arXiv:2403.17694, 2024. 2, 3

  68. [68]

    Flashavatar: High-fidelity head avatar with efficient gaussian embedding

    Jun Xiang, Xuan Gao, Yudong Guo, and Juyong Zhang. Flashavatar: High-fidelity head avatar with efficient gaussian embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1802– 1812, 2024. 3, 4

  69. [69]

    Vfhq: A high-quality dataset and benchmark for video face super-resolution

    Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, and Ying Shan. Vfhq: A high-quality dataset and benchmark for video face super-resolution. InThe IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022. 3

  70. [70]

    X-portrait: Expressive portrait anima- tion with hierarchical motion attention

    You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, and Linjie Luo. X-portrait: Expressive portrait anima- tion with hierarchical motion attention. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 2, 3

  71. [71]

    Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024

    Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801, 2024. 2

  72. [72]

    V ASA-1: Lifelike audio-driven talking faces generated in real time

    Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. V ASA-1: Lifelike audio-driven talking faces generated in real time. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems, 2024. 3

  73. [73]

    Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels

    Yuelang Xu, Lizhen Wang, Xiaochen Zhao, Hongwen Zhang, and Yebin Liu. Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–10,

  74. [74]

    3d gaussian parametric head model

    Yuelang Xu, Lizhen Wang, Zerong Zheng, Zhaoqi Su, and Yebin Liu. 3d gaussian parametric head model. InEuropean Conference on Computer Vision, pages 129–147. Springer,

  75. [75]

    Facescape: a large- scale high quality 3d face dataset and detailed riggable 3d face prediction

    Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. Facescape: a large- scale high quality 3d face dataset and detailed riggable 3d face prediction. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 601–610,

  76. [76]

    Real3d-portrait: One-shot realistic 3d talking portrait synthesis,

    Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, et al. Real3d-portrait: One-shot realistic 3d talking portrait synthesis.arXiv preprint arXiv:2401.08503,

  77. [77]

    Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan

    Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. InEuropean conference on computer vision, pages 85–101. Springer,

  78. [78]

    Nofa: Nerf-based one-shot facial avatar reconstruction

    Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, et al. Nofa: Nerf-based one-shot facial avatar reconstruction. InACM SIGGRAPH 2023 Conference Proceedings, pages 1–12, 2023. 3

  79. [79]

    One2avatar: Generative implicit head avatar for few-shot user adaptation.arXiv preprint arXiv:2402.11909, 2024

    Zhixuan Yu, Ziqian Bai, Abhimitra Meka, Feitong Tan, Qiangeng Xu, Rohit Pandey, Sean Fanello, Hyun Soo Park, and Yinda Zhang. One2avatar: Generative implicit head avatar for few-shot user adaptation.arXiv preprint arXiv:2402.11909, 2024. 3

  80. [80]

    Few-shot adversarial learning of realistic neural talking head models

    Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. InProceedings of the IEEE/CVF international conference on computer vision, pages 9459– 9468, 2019. 2, 3

Showing first 80 references.