FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Gordon Guocheng Qian; Hao Li; Jiahao Luo; Jian Wang; Thuan Hoang Nguyen; Yinyu Nie

arxiv: 2605.15320 · v1 · pith:RQSG5Y2Bnew · submitted 2026-05-14 · 💻 cs.GR · cs.CV· cs.LG

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Thuan Hoang Nguyen , Jiahao Luo , Yinyu Nie , Hao Li , Gordon Guocheng Qian , Jian Wang This is my paper

Pith reviewed 2026-05-19 15:35 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG

keywords avatar reconstruction3D Gaussianfew-shot learningfeed-forwardFLAME parametershead avatargeneralizable priorsNeRSemble benchmark

0 comments

The pith

A feed-forward model reconstructs animatable 3D Gaussian head avatars from few unposed photos in seconds without per-subject optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Avatar reconstruction has long depended on slow per-subject fitting or heavy preprocessing that scales poorly. FFAvatar instead trains a single network to map a handful of unposed portrait images directly into a high-quality, animatable 3D Gaussian avatar. It fuses the input views through a Multi-View Query-Former into one canonical representation and predicts the animation parameters end-to-end from pixels. A three-stage curriculum first pretrains on more than a million monocular video identities, then refines geometry on multi-view captures, and finally allows quick personalization. The result is faster reconstruction, stronger identity preservation, and higher animation fidelity than prior specialized methods.

Core claim

FFAvatar fuses multiple source images into a unified canonical Gaussian representation through a Multi-View Query-Former and animates that representation via FLAME parameters predicted directly from the input pixels, all learned through a three-stage curriculum of large-scale monocular pretraining, multi-view fine-tuning, and optional fast personalization.

What carries the argument

Multi-View Query-Former that fuses information from multiple source images into a unified canonical Gaussian representation while predicting FLAME animation parameters end-to-end from pixels.

If this is right

Reconstructs avatars in roughly two seconds without personalization and ten seconds with it.
Delivers a 5.5 dB PSNR gain over the prior leader LAM on the NeRSemble benchmark.
Supports real-time animation at 49 frames per second on a single NVIDIA A100 GPU.
Removes the need for hours-long per-subject optimization or expensive multi-view capture rigs at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feed-forward pattern could be retrained on full-body or non-human subjects by swapping the underlying parametric model and expanding the pretraining corpus.
Because reconstruction is fast and pose-free, the approach may enable on-device avatar creation in consumer apps without cloud offloading.
If the learned priors remain stable across new camera hardware, the method could lower the barrier for casual users to generate personal avatars from ordinary smartphone selfies.

Load-bearing premise

The three-stage training on monocular video data from over one million identities plus later multi-view fine-tuning produces priors that generalize to arbitrary few-shot unposed inputs without separate offline pose or FLAME extraction.

What would settle it

A held-out test collection of unposed portraits recorded under lighting, poses, and identities absent from the million-identity pretraining set, where the model yields visibly broken geometry or low animation quality when compared against ground-truth multi-view reconstructions.

Figures

Figures reproduced from arXiv: 2605.15320 by Gordon Guocheng Qian, Hao Li, Jiahao Luo, Jian Wang, Thuan Hoang Nguyen, Yinyu Nie.

**Figure 2.** Figure 2: Three-stage training of FFAvatar. Scalable pretraining fosters generalization across unseen identities by training on our private large-scale multi-frame-per-identity dataset MFHQ-1M, multi-view fine-tuning enhances geometric fidelity by optimizing the pretrained weights on a smallscale set of 360◦ multi-view captures (e.g. Ava256 [18]), and lightweight personalization efficiently improves identity preser… view at source ↗

**Figure 3.** Figure 3: FFAvatar pipeline. FFAvatar reconstructs a canonical Gaussian head avatar from few-shot views using a Multi-view Query-Former, with canonical FLAME vertices as queries and source features as keys/values. An end-to-end FLAME Estimator predicts expression ψ, local articulation θ, and head pose π from driving frames, avoiding offline FLAME preprocessing. A few-to-many objective further improves generalization… view at source ↗

**Figure 4.** Figure 4: FFAvatar qualitative comparison for self-reenactment on the Ava256 test set (top two rows) and cross-reenactment on the NeRSemble benchmark (bottom two rows). FFAvatar-1 view achieves more faithful and geometrically consistent results than the baselines. GAGAvatar [2] often produces over-smoothed textures and pose misalignment, while LAM [9] shows geometry artifacts under challenging views. Additional inpu… view at source ↗

**Figure 5.** Figure 5: Qualitative ablation study. FFAvatar with personalization achieves the most realistic and faithful reconstructions. Personalization enhances identity. Without scalable pretraining, the model trained only on Ava256 fails to generalize to NeRSemble, degrading geometry and identity consistency. Removing high-quality fine-tuning or the GAN loss reduces visual detail. 4.2 Results Qualitative Comparison. A quali… view at source ↗

**Figure 6.** Figure 6: Personalization dynamics. Feed-forward initialization improves quality and converges within 500 steps, while random initialization (from scratch), remains blurry and poorly preserves identity. Personalization Analysis. We use 500 personalization steps because most examples converge by this point ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces FFAvatar, a feed-forward framework for reconstructing high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images. It employs a Multi-View Query-Former to create a unified canonical Gaussian representation and predicts FLAME parameters directly from pixels in an end-to-end manner. A three-stage training curriculum is proposed: pretraining on over 1M monocular video identities, multi-view fine-tuning on 360-degree captures, and optional personalization. Experiments on the NeRSemble benchmark show a 5.5 PSNR improvement over LAM, with reconstruction in 2 seconds and animation at 49 FPS.

Significance. If validated, this work would represent a significant advance in scalable avatar reconstruction by eliminating per-subject optimization and offline preprocessing, enabling real-time applications. The large-scale pretraining on monocular data and the end-to-end FLAME prediction are particularly noteworthy for achieving generalization. The reported performance gains and efficiency metrics suggest practical impact in graphics and computer vision applications.

major comments (1)

Abstract and Methods (three-stage training description): The central claim that the three-stage curriculum produces priors enabling generalization to arbitrary few-shot unposed inputs without offline FLAME extraction is load-bearing for the 'no offline extraction' advantage and the 5.5 PSNR gain. However, since monocular pretraining offers only weak geometric supervision and the fine-tuning set is small, the manuscript should include quantitative evaluation of the pixel-to-FLAME head's accuracy (e.g., mean pose and expression errors compared to offline FLAME on held-out data) to confirm that misalignment does not occur in the fused Gaussians.

minor comments (3)

Abstract: The runtime claims (2 seconds without personalization, 10 seconds with) would benefit from specifying the hardware configuration and whether these include all preprocessing steps.
Experiments: It would be helpful to report error bars or standard deviations for the PSNR and other metrics across multiple runs or subjects to strengthen the statistical significance of the 5.5 dB improvement.
Related Work: Ensure comprehensive citation of recent works on feed-forward 3D reconstruction and Gaussian splatting for avatars.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of FFAvatar in enabling scalable, feed-forward avatar reconstruction. We address the major comment point by point below.

read point-by-point responses

Referee: Abstract and Methods (three-stage training description): The central claim that the three-stage curriculum produces priors enabling generalization to arbitrary few-shot unposed inputs without offline FLAME extraction is load-bearing for the 'no offline extraction' advantage and the 5.5 PSNR gain. However, since monocular pretraining offers only weak geometric supervision and the fine-tuning set is small, the manuscript should include quantitative evaluation of the pixel-to-FLAME head's accuracy (e.g., mean pose and expression errors compared to offline FLAME on held-out data) to confirm that misalignment does not occur in the fused Gaussians.

Authors: We appreciate the referee's emphasis on validating the pixel-to-FLAME prediction accuracy, as this directly supports the end-to-end advantage. Our three-stage curriculum uses large-scale monocular pretraining to learn robust appearance and coarse geometric priors, with the subsequent multi-view fine-tuning on 360-degree captures providing stronger supervision for precise FLAME regression. The observed 5.5 PSNR gain and improved animation quality on NeRSemble already indicate that any potential misalignment is not detrimental to the fused canonical Gaussians. To further address the concern, we will add quantitative results in the revised manuscript: mean pose and expression errors of the pixel-to-FLAME head versus offline FLAME on held-out identities from both pretraining and fine-tuning distributions. This evaluation will be reported in the Methods and Experiments sections to explicitly confirm alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the feed-forward avatar reconstruction pipeline

full rationale

The paper describes a standard supervised neural architecture (Multi-View Query-Former plus pixel-to-FLAME regressor) trained via a three-stage curriculum on external monocular video data and a separate multi-view fine-tuning set, then evaluated on the independent NeRSemble benchmark. No equation or claim reduces a reported prediction or generalization result to a fitted parameter or self-citation by construction; the performance numbers and “no offline extraction” advantage are presented as empirical outcomes of the learned model rather than tautological re-statements of the training inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5812 in / 1212 out tokens · 56230 ms · 2026-05-19T15:35:09.932931+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multi-View Query-Former ... fuses information from multiple source images into a unified canonical Gaussian representation ... FLAME parameters predicted end-to-end directly from pixels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

work page 2023
[2]

Generalizable and animatable gaussian head avatar

Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview. net/forum?id=gVM2AZ5xA6

work page 2024
[3]

GPA- vatar: Generalizable and precise head avatar from image(s)

Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. GPA- vatar: Generalizable and precise head avatar from image(s). InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=hgehGq2bDv

work page 2024
[4]

Black, and Timo Bolkart

Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022

work page 2022
[5]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

work page 2019
[6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR. OpenReview.net, 2021

work page 2021
[7]

Black, and Timo Bolkart

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. InACM Transactions on Graphics, (Proc. SIGGRAPH), volume 40, 2021

work page 2021
[8]

Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8649–8658, June 2021

work page 2021
[9]

Lam: Large avatar model for one-shot animatable gaussian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

work page 2025
[10]

Headnerf: A real-time nerf-based parametric head model

Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[11]

LRM: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[12]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InComputer Vision – ECCV 2022, pages 345–362, 2022

work page 2022
[13]

Sapiens: Foundation for human vision models

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InComputer Vision – ECCV 2024, pages 206–228, 2025

work page 2024
[14]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015

work page 2015
[15]

Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Niessner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025

work page 2025
[16]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

work page 2023
[17]

Learning a model of facial shape and expression from 4d scans.ACM Trans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

work page 2017
[18]

Jewett, Simon Venshtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mohamed Ezzeldin A

Julieta Martinez, Emily Kim, Javier Romero, Timur Bagautdinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollhöfer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Venshtain, Christopher Heilman,...

work page 2024
[19]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

work page 2024
[20]

Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International C...

work page 2024
[21]

Omni-id: Holistic identity representation designed for generative tasks

Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, and Kfir Aberman. Omni-id: Holistic identity representation designed for generative tasks. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025

work page 2025
[22]

Composeme: Attribute-specific image prompts for controllable human image generation

Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan- Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. InSIGGRAPH Asia 2025 Conference Papers, 2025

work page 2025
[23]

Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

Shenhan Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024. URL https: //github.com/ShenhanQian/VHAP

work page 2024
[24]

Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

work page 2024
[25]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens Muller, and Andreas Geiger. Projected gans converge faster. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[26]

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[27]

Face2face: Real-time face capture and reenactment of rgb videos.Commun

Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Niessner. Face2face: Real-time face capture and reenactment of rgb videos.Commun. ACM, 62(1):96–104, December 2018. ISSN 0001-0782. doi: 10.1145/3292039. URLhttp://doi.acm.org/10.1145/3292039

work page doi:10.1145/3292039 2018
[28]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

work page 2024
[29]

Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[30]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 11

work page 2018
[31]

Differentiable augmentation for data- efficient gan training

Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data- efficient gan training. InConference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[32]

Bühler, Xu Chen, Michael J

Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I M Avatar: Implicit morphable head avatars from videos. InComputer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[33]

Instant volumetric head avatars

Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InConference on Computer Vision and Pattern Recognition, 2023. 12 A Experiment Setup Details Baselines.We compare FFAvatarwith state-of-the-art feed-forward head avatar generation methods including GAGAvatar [2] and LAM [9]. Avat3r [15] is not compared because its checkpoi...

work page 2023

[1] [1]

A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

work page 2023

[2] [2]

Generalizable and animatable gaussian head avatar

Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview. net/forum?id=gVM2AZ5xA6

work page 2024

[3] [3]

GPA- vatar: Generalizable and precise head avatar from image(s)

Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. GPA- vatar: Generalizable and precise head avatar from image(s). InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=hgehGq2bDv

work page 2024

[4] [4]

Black, and Timo Bolkart

Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022

work page 2022

[5] [5]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

work page 2019

[6] [6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR. OpenReview.net, 2021

work page 2021

[7] [7]

Black, and Timo Bolkart

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. InACM Transactions on Graphics, (Proc. SIGGRAPH), volume 40, 2021

work page 2021

[8] [8]

Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8649–8658, June 2021

work page 2021

[9] [9]

Lam: Large avatar model for one-shot animatable gaussian head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

work page 2025

[10] [10]

Headnerf: A real-time nerf-based parametric head model

Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[11] [11]

LRM: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[12] [12]

Realistic one-shot mesh-based head avatars

Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InComputer Vision – ECCV 2022, pages 345–362, 2022

work page 2022

[13] [13]

Sapiens: Foundation for human vision models

Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InComputer Vision – ECCV 2024, pages 206–228, 2025

work page 2024

[14] [14]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015

work page 2015

[15] [15]

Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Niessner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025

work page 2025

[16] [16]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

work page 2023

[17] [17]

Learning a model of facial shape and expression from 4d scans.ACM Trans

Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

work page 2017

[18] [18]

Jewett, Simon Venshtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mohamed Ezzeldin A

Julieta Martinez, Emily Kim, Javier Romero, Timur Bagautdinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollhöfer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Venshtain, Christopher Heilman,...

work page 2024

[19] [19]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

work page 2024

[20] [20]

Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International C...

work page 2024

[21] [21]

Omni-id: Holistic identity representation designed for generative tasks

Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, and Kfir Aberman. Omni-id: Holistic identity representation designed for generative tasks. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025

work page 2025

[22] [22]

Composeme: Attribute-specific image prompts for controllable human image generation

Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan- Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. InSIGGRAPH Asia 2025 Conference Papers, 2025

work page 2025

[23] [23]

Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

Shenhan Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024. URL https: //github.com/ShenhanQian/VHAP

work page 2024

[24] [24]

Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

work page 2024

[25] [25]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens Muller, and Andreas Geiger. Projected gans converge faster. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[26] [26]

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[27] [27]

Face2face: Real-time face capture and reenactment of rgb videos.Commun

Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Niessner. Face2face: Real-time face capture and reenactment of rgb videos.Commun. ACM, 62(1):96–104, December 2018. ISSN 0001-0782. doi: 10.1145/3292039. URLhttp://doi.acm.org/10.1145/3292039

work page doi:10.1145/3292039 2018

[28] [28]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

work page 2024

[29] [29]

Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[30] [30]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 11

work page 2018

[31] [31]

Differentiable augmentation for data- efficient gan training

Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data- efficient gan training. InConference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[32] [32]

Bühler, Xu Chen, Michael J

Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I M Avatar: Implicit morphable head avatars from videos. InComputer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[33] [33]

Instant volumetric head avatars

Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InConference on Computer Vision and Pattern Recognition, 2023. 12 A Experiment Setup Details Baselines.We compare FFAvatarwith state-of-the-art feed-forward head avatar generation methods including GAGAvatar [2] and LAM [9]. Avat3r [15] is not compared because its checkpoi...

work page 2023