pith. sign in

arxiv: 2605.15320 · v1 · pith:RQSG5Y2Bnew · submitted 2026-05-14 · 💻 cs.GR · cs.CV· cs.LG

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

Pith reviewed 2026-05-19 15:35 UTC · model grok-4.3

classification 💻 cs.GR cs.CVcs.LG
keywords avatar reconstruction3D Gaussianfew-shot learningfeed-forwardFLAME parametershead avatargeneralizable priorsNeRSemble benchmark
0
0 comments X

The pith

A feed-forward model reconstructs animatable 3D Gaussian head avatars from few unposed photos in seconds without per-subject optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Avatar reconstruction has long depended on slow per-subject fitting or heavy preprocessing that scales poorly. FFAvatar instead trains a single network to map a handful of unposed portrait images directly into a high-quality, animatable 3D Gaussian avatar. It fuses the input views through a Multi-View Query-Former into one canonical representation and predicts the animation parameters end-to-end from pixels. A three-stage curriculum first pretrains on more than a million monocular video identities, then refines geometry on multi-view captures, and finally allows quick personalization. The result is faster reconstruction, stronger identity preservation, and higher animation fidelity than prior specialized methods.

Core claim

FFAvatar fuses multiple source images into a unified canonical Gaussian representation through a Multi-View Query-Former and animates that representation via FLAME parameters predicted directly from the input pixels, all learned through a three-stage curriculum of large-scale monocular pretraining, multi-view fine-tuning, and optional fast personalization.

What carries the argument

Multi-View Query-Former that fuses information from multiple source images into a unified canonical Gaussian representation while predicting FLAME animation parameters end-to-end from pixels.

If this is right

  • Reconstructs avatars in roughly two seconds without personalization and ten seconds with it.
  • Delivers a 5.5 dB PSNR gain over the prior leader LAM on the NeRSemble benchmark.
  • Supports real-time animation at 49 frames per second on a single NVIDIA A100 GPU.
  • Removes the need for hours-long per-subject optimization or expensive multi-view capture rigs at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feed-forward pattern could be retrained on full-body or non-human subjects by swapping the underlying parametric model and expanding the pretraining corpus.
  • Because reconstruction is fast and pose-free, the approach may enable on-device avatar creation in consumer apps without cloud offloading.
  • If the learned priors remain stable across new camera hardware, the method could lower the barrier for casual users to generate personal avatars from ordinary smartphone selfies.

Load-bearing premise

The three-stage training on monocular video data from over one million identities plus later multi-view fine-tuning produces priors that generalize to arbitrary few-shot unposed inputs without separate offline pose or FLAME extraction.

What would settle it

A held-out test collection of unposed portraits recorded under lighting, poses, and identities absent from the million-identity pretraining set, where the model yields visibly broken geometry or low animation quality when compared against ground-truth multi-view reconstructions.

Figures

Figures reproduced from arXiv: 2605.15320 by Gordon Guocheng Qian, Hao Li, Jiahao Luo, Jian Wang, Thuan Hoang Nguyen, Yinyu Nie.

Figure 1
Figure 1. Figure 1: FFAvatar full pipeline reconstructs animatable avatars in 10 seconds on a single A100, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three-stage training of FFAvatar. Scalable pretraining fosters generalization across unseen identities by training on our private large-scale multi-frame-per-identity dataset MFHQ-1M, multi-view fine-tuning enhances geometric fidelity by optimizing the pretrained weights on a small￾scale set of 360◦ multi-view captures (e.g. Ava256 [18]), and lightweight personalization efficiently improves identity preser… view at source ↗
Figure 3
Figure 3. Figure 3: FFAvatar pipeline. FFAvatar reconstructs a canonical Gaussian head avatar from few-shot views using a Multi-view Query-Former, with canonical FLAME vertices as queries and source features as keys/values. An end-to-end FLAME Estimator predicts expression ψ, local articulation θ, and head pose π from driving frames, avoiding offline FLAME preprocessing. A few-to-many objective further improves generalization… view at source ↗
Figure 4
Figure 4. Figure 4: FFAvatar qualitative comparison for self-reenactment on the Ava256 test set (top two rows) and cross-reenactment on the NeRSemble benchmark (bottom two rows). FFAvatar-1 view achieves more faithful and geometrically consistent results than the baselines. GAGAvatar [2] often produces over-smoothed textures and pose misalignment, while LAM [9] shows geometry artifacts under challenging views. Additional inpu… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation study. FFAvatar with personalization achieves the most realistic and faithful reconstructions. Personalization enhances identity. Without scalable pretraining, the model trained only on Ava256 fails to generalize to NeRSemble, degrading geometry and identity consistency. Removing high-quality fine-tuning or the GAN loss reduces visual detail. 4.2 Results Qualitative Comparison. A quali… view at source ↗
Figure 6
Figure 6. Figure 6: Personalization dynamics. Feed-forward initial￾ization improves quality and converges within 500 steps, while random initialization (from scratch), remains blurry and poorly preserves identity. Personalization Analysis. We use 500 personalization steps because most examples converge by this point ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces FFAvatar, a feed-forward framework for reconstructing high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images. It employs a Multi-View Query-Former to create a unified canonical Gaussian representation and predicts FLAME parameters directly from pixels in an end-to-end manner. A three-stage training curriculum is proposed: pretraining on over 1M monocular video identities, multi-view fine-tuning on 360-degree captures, and optional personalization. Experiments on the NeRSemble benchmark show a 5.5 PSNR improvement over LAM, with reconstruction in 2 seconds and animation at 49 FPS.

Significance. If validated, this work would represent a significant advance in scalable avatar reconstruction by eliminating per-subject optimization and offline preprocessing, enabling real-time applications. The large-scale pretraining on monocular data and the end-to-end FLAME prediction are particularly noteworthy for achieving generalization. The reported performance gains and efficiency metrics suggest practical impact in graphics and computer vision applications.

major comments (1)
  1. Abstract and Methods (three-stage training description): The central claim that the three-stage curriculum produces priors enabling generalization to arbitrary few-shot unposed inputs without offline FLAME extraction is load-bearing for the 'no offline extraction' advantage and the 5.5 PSNR gain. However, since monocular pretraining offers only weak geometric supervision and the fine-tuning set is small, the manuscript should include quantitative evaluation of the pixel-to-FLAME head's accuracy (e.g., mean pose and expression errors compared to offline FLAME on held-out data) to confirm that misalignment does not occur in the fused Gaussians.
minor comments (3)
  1. Abstract: The runtime claims (2 seconds without personalization, 10 seconds with) would benefit from specifying the hardware configuration and whether these include all preprocessing steps.
  2. Experiments: It would be helpful to report error bars or standard deviations for the PSNR and other metrics across multiple runs or subjects to strengthen the statistical significance of the 5.5 dB improvement.
  3. Related Work: Ensure comprehensive citation of recent works on feed-forward 3D reconstruction and Gaussian splatting for avatars.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of FFAvatar in enabling scalable, feed-forward avatar reconstruction. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Abstract and Methods (three-stage training description): The central claim that the three-stage curriculum produces priors enabling generalization to arbitrary few-shot unposed inputs without offline FLAME extraction is load-bearing for the 'no offline extraction' advantage and the 5.5 PSNR gain. However, since monocular pretraining offers only weak geometric supervision and the fine-tuning set is small, the manuscript should include quantitative evaluation of the pixel-to-FLAME head's accuracy (e.g., mean pose and expression errors compared to offline FLAME on held-out data) to confirm that misalignment does not occur in the fused Gaussians.

    Authors: We appreciate the referee's emphasis on validating the pixel-to-FLAME prediction accuracy, as this directly supports the end-to-end advantage. Our three-stage curriculum uses large-scale monocular pretraining to learn robust appearance and coarse geometric priors, with the subsequent multi-view fine-tuning on 360-degree captures providing stronger supervision for precise FLAME regression. The observed 5.5 PSNR gain and improved animation quality on NeRSemble already indicate that any potential misalignment is not detrimental to the fused canonical Gaussians. To further address the concern, we will add quantitative results in the revised manuscript: mean pose and expression errors of the pixel-to-FLAME head versus offline FLAME on held-out identities from both pretraining and fine-tuning distributions. This evaluation will be reported in the Methods and Experiments sections to explicitly confirm alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the feed-forward avatar reconstruction pipeline

full rationale

The paper describes a standard supervised neural architecture (Multi-View Query-Former plus pixel-to-FLAME regressor) trained via a three-stage curriculum on external monocular video data and a separate multi-view fine-tuning set, then evaluated on the independent NeRSemble benchmark. No equation or claim reduces a reported prediction or generalization result to a fitted parameter or self-citation by construction; the performance numbers and “no offline extraction” advantage are presented as empirical outcomes of the learned model rather than tautological re-statements of the training inputs. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted or audited from the provided text.

pith-pipeline@v0.9.0 · 5812 in / 1212 out tokens · 56230 ms · 2026-05-19T15:35:09.932931+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

    V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces.Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 157–164, 2023

  2. [2]

    Generalizable and animatable gaussian head avatar

    Xuangeng Chu and Tatsuya Harada. Generalizable and animatable gaussian head avatar. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview. net/forum?id=gVM2AZ5xA6

  3. [3]

    GPA- vatar: Generalizable and precise head avatar from image(s)

    Xuangeng Chu, Yu Li, Ailing Zeng, Tianyu Yang, Lijian Lin, Yunfei Liu, and Tatsuya Harada. GPA- vatar: Generalizable and precise head avatar from image(s). InInternational Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=hgehGq2bDv

  4. [4]

    Black, and Timo Bolkart

    Radek Danecek, Michael J. Black, and Timo Bolkart. EMOCA: Emotion driven monocular face capture and animation. InConference on Computer Vision and Pattern Recognition (CVPR), pages 20311–20322, 2022

  5. [5]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

  6. [6]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR. OpenReview.net, 2021

  7. [7]

    Black, and Timo Bolkart

    Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. InACM Transactions on Graphics, (Proc. SIGGRAPH), volume 40, 2021

  8. [8]

    Dynamic neural radiance fields for monocular 4d facial avatar reconstruction

    Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8649–8658, June 2021

  9. [9]

    Lam: Large avatar model for one-shot animatable gaussian head

    Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, and Liefeng Bo. Lam: Large avatar model for one-shot animatable gaussian head. InProceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–13, 2025

  10. [10]

    Headnerf: A real-time nerf-based parametric head model

    Yang Hong, Bo Peng, Haiyao Xiao, Ligang Liu, and Juyong Zhang. Headnerf: A real-time nerf-based parametric head model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  11. [11]

    LRM: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InThe Twelfth International Conference on Learning Representations, 2024

  12. [12]

    Realistic one-shot mesh-based head avatars

    Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. Realistic one-shot mesh-based head avatars. InComputer Vision – ECCV 2022, pages 345–362, 2022

  13. [13]

    Sapiens: Foundation for human vision models

    Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vision models. InComputer Vision – ECCV 2024, pages 206–228, 2025

  14. [14]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR (Poster), 2015

  15. [15]

    Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars

    Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Niessner, and Shunsuke Saito. Avat3r: Large animatable gaussian reconstruction model for high-fidelity 3d head avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025

  16. [16]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 10

  17. [17]

    Learning a model of facial shape and expression from 4d scans.ACM Trans

    Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4d scans.ACM Trans. Graph., 36(6):194–1, 2017

  18. [18]

    Jewett, Simon Venshtain, Christopher Heilman, Yueh-Tung Chen, Sidi Fu, Mohamed Ezzeldin A

    Julieta Martinez, Emily Kim, Javier Romero, Timur Bagautdinov, Shunsuke Saito, Shoou-I Yu, Stuart Anderson, Michael Zollhöfer, Te-Li Wang, Shaojie Bai, Chenghui Li, Shih-En Wei, Rohan Joshi, Wyatt Borsos, Tomas Simon, Jason Saragih, Paul Theodosis, Alexander Greene, Anjani Josyula, Silvio Mano Maeta, Andrew I. Jewett, Simon Venshtain, Christopher Heilman,...

  19. [19]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  20. [20]

    Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, and Bernard Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International C...

  21. [21]

    Omni-id: Holistic identity representation designed for generative tasks

    Guocheng Qian, Kuan-Chieh Wang, Or Patashnik, Negin Heravi, Daniil Ostashev, Sergey Tulyakov, Daniel Cohen-Or, and Kfir Aberman. Omni-id: Holistic identity representation designed for generative tasks. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8786–8795, 2025

  22. [22]

    Composeme: Attribute-specific image prompts for controllable human image generation

    Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan- Chieh Jackson Wang, and Kfir Aberman. Composeme: Attribute-specific image prompts for controllable human image generation. InSIGGRAPH Asia 2025 Conference Papers, 2025

  23. [23]

    Vhap: Versatile head alignment with adaptive appearance priors, sep 2024

    Shenhan Qian. Vhap: Versatile head alignment with adaptive appearance priors, sep 2024. URL https: //github.com/ShenhanQian/VHAP

  24. [24]

    Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians

    Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20299–20309, 2024

  25. [25]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens Muller, and Andreas Geiger. Projected gans converge faster. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  26. [26]

    SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

    Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  27. [27]

    Face2face: Real-time face capture and reenactment of rgb videos.Commun

    Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Niessner. Face2face: Real-time face capture and reenactment of rgb videos.Commun. ACM, 62(1):96–104, December 2018. ISSN 0001-0782. doi: 10.1145/3292039. URLhttp://doi.acm.org/10.1145/3292039

  28. [28]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

  29. [29]

    Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians

    Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  30. [30]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018. 11

  31. [31]

    Differentiable augmentation for data- efficient gan training

    Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data- efficient gan training. InConference on Neural Information Processing Systems (NeurIPS), 2020

  32. [32]

    Bühler, Xu Chen, Michael J

    Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I M Avatar: Implicit morphable head avatars from videos. InComputer Vision and Pattern Recognition (CVPR), 2022

  33. [33]

    Instant volumetric head avatars

    Wojciech Zielonka, Timo Bolkart, and Justus Thies. Instant volumetric head avatars. InConference on Computer Vision and Pattern Recognition, 2023. 12 A Experiment Setup Details Baselines.We compare FFAvatarwith state-of-the-art feed-forward head avatar generation methods including GAGAvatar [2] and LAM [9]. Avat3r [15] is not compared because its checkpoi...