pith. sign in

arxiv: 2605.25751 · v1 · pith:HY3M2QPAnew · submitted 2026-05-25 · 💻 cs.CV

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

Pith reviewed 2026-06-29 22:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian Splattingone-shot head avatarautoregressive splittinggraph neural networkanimatable avatardensity control
0
0 comments X

The pith

An autoregressive network splits Gaussians progressively to add fine expression detail to one-shot head avatars.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a graph neural network can guide the repeated splitting of 3D Gaussians from coarse to fine, overcoming the detail shortfall that arises when image-based and 3DMM-based avatar methods produce mismatched Gaussian counts. By extending mesh topology after each split and adding a gated density controller, the method claims to keep the underlying graph consistent while allowing region-specific refinement. The autoregressive process is presented as the mechanism that lets the model synthesize precise facial features during rendering from a single input image. If the claim holds, single-view avatar reconstruction would gain the ability to represent subtle expression changes without requiring additional views or manual detail injection.

Core claim

The autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

What carries the argument

Graph splitting network that performs GNN-guided autoregressive splitting of Gaussians, paired with mesh topology extension to maintain connectivity after each split.

If this is right

  • Progressive splitting captures finer facial details that fixed-count methods miss.
  • Gated density control limits over-densification while preserving real-time rendering speed.
  • Delayed filtering avoids repeated topology recomputation and stabilizes training.
  • The resulting avatars remain animatable from a single input image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coarse-to-fine splitting logic could be tested on full-body or dynamic scene reconstruction to check whether the GNN guidance transfers.
  • Running the method on inputs with large expression changes would test whether the autoregressive steps remain stable outside the training distribution.
  • If splitting improves detail without extra parameters, similar refinement stages might be added to other point-based rendering pipelines.

Load-bearing premise

The mesh topology extension successfully aligns the GNN connectivity with the new Gaussian count after each split without introducing inconsistencies or artifacts.

What would settle it

Train the model and then render a sequence of test expressions; if the split version shows no measurable gain in fine facial detail or visible artifacts appear relative to a non-split baseline, the improvement claim is falsified.

Figures

Figures reproduced from arXiv: 2605.25751 by Chuhua Xian, Fa-Ting Hong, Haiyang Liu, Hongmin Cai, Hongzhe Liao.

Figure 1
Figure 1. Figure 1: SplitAvatar achieves high-quality one-shot head avatar synthesis by generating identity Gaussians and expression Gaussians separately. Expression details are then progressively refined from coarse to fine using an autoregressive architecture combined with a masked Gaussian split network. Abstract. 3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotro… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SplitAvatar. (a) we use a gaussian generator to predict initial expression Gaussians using the 3DMM estimator and identity features. We utilize a Gaussian feature adapter to predict identity Gaussians by visual features extracted from Dinov2. (b) the autoregressive graph splitting network predicts the next layer of expression Gaussians based on the previous layer. (Sec. 3.3) We also apply a sof… view at source ↗
Figure 3
Figure 3. Figure 3: Graph Splitting Network. We embed the Gaussian attribute features from the previous layer into the GSN and predict Gaussian features through a GNN with an attention mechanism. The features are then decoded to obtain the split Gaussian attributes for the next layer. Given the graph information of the current layer, we define the edge topology of the next layer of GSN as two components: Topological Inheritan… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-reenactment qualitative results on VFHQ datasets of different methods. Our method has more facial details [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on VFHQ datasets. w/o AR NGSN means replacing autore￾gressive GSN with a single layer and set the splitting factor k to 1. w/o GNN means replacing the GNN with an MLP. FE denotes to feature embedding to positions, rota￾tions, scales and opacities. w/o NMask means removing the mask network at training and using all the predicted expression Gaussians. Our full method capture the most accurat… view at source ↗
Figure 6
Figure 6. Figure 6: Gaussians visualization of our method. Results means the final image after neural rendering. Final Expression Gaussians present multi-view distribution of all masked expression Gaussians Gexp. Total Gaussians visualized both identity Gaussians and expression Gaussians. The Gaussian distribution is concentrated around the facial features, indicating that our method dynamically controls the density of the ex… view at source ↗
read the original abstract

3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents SplitAvatar, a one-shot method for reconstructing animatable head avatars from a single image via 3D Gaussian Splatting. It introduces a Graph splitting network with an autoregressive architecture that progressively generates Gaussians from coarse to fine, a mesh topology extension to maintain GNN connectivity after splits, a gating mechanism within density control to produce soft masks and avoid over-densification, and a delayed filtering strategy to enable efficient training without repeated topology recomputation. The central claim is that this autoregressive GNN-guided splitting improves expression representation by synthesizing finer facial details and yields higher reconstruction quality than prior 3DGS-based avatar methods.

Significance. If the results hold, the work could advance one-shot avatar reconstruction by providing a mechanism to dynamically increase Gaussian density in expression-critical regions through progressive splitting, addressing the noted mismatch between image-based and 3DMM-based Gaussian counts. The integration of autoregressive refinement with GNN guidance and gated density control represents a potentially useful architectural direction for controllable facial detail.

major comments (2)
  1. Abstract: The claim that 'our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians' and that 'GNN-guided splitting synthesizes more precise facial details' is unsupported by any quantitative results, ablation studies, or implementation details. Without these, it is impossible to verify whether the progressive splitting delivers measurable gains in expression fidelity or reconstruction quality.
  2. Abstract: The manuscript provides no description of the mesh topology extension algorithm, no proof or analysis that it preserves GNN connectivity without inconsistencies or artifacts after each split, and no ablation isolating its effect. This is load-bearing for the central claim, as unreliable GNN messages would undermine the autoregressive refinement process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of results and algorithmic details.

read point-by-point responses
  1. Referee: [—] Abstract: The claim that 'our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians' and that 'GNN-guided splitting synthesizes more precise facial details' is unsupported by any quantitative results, ablation studies, or implementation details. Without these, it is impossible to verify whether the progressive splitting delivers measurable gains in expression fidelity or reconstruction quality.

    Authors: The abstract summarizes findings whose supporting evidence appears in the experimental section, which reports quantitative comparisons against prior 3DGS avatar methods and ablation studies on the autoregressive splitting component. To make this link explicit and address the concern directly, we will revise the abstract to reference the relevant quantitative metrics and ablation results, and we will expand the implementation details subsection for clarity. revision: yes

  2. Referee: [—] Abstract: The manuscript provides no description of the mesh topology extension algorithm, no proof or analysis that it preserves GNN connectivity without inconsistencies or artifacts after each split, and no ablation isolating its effect. This is load-bearing for the central claim, as unreliable GNN messages would undermine the autoregressive refinement process.

    Authors: We agree that the mesh topology extension requires explicit documentation. In the revised manuscript we will add a dedicated subsection with the full algorithm description (including pseudocode), a connectivity-preservation argument, and an ablation study that isolates its contribution to reconstruction quality. This will directly support the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a novel autoregressive Graph splitting network, mesh topology extension, gating-based density control, and delayed filtering on top of 3DGS. The central claim that progressive splitting improves expression detail is presented as an empirical outcome of these new architectural choices rather than a reduction to fitted parameters, self-citations, or renamed inputs. No equations or steps in the abstract or described method reduce by construction to prior results from the same authors; the topology extension and GNN guidance are described as independent engineering solutions without load-bearing self-referential justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard 3DGS assumptions and introduces new network modules whose hyperparameters are not enumerated in the abstract.

free parameters (1)
  • splitting and gating thresholds
    Parameters that control when and how aggressively Gaussians are split and masked; these are typically fitted or hand-tuned in such pipelines.
axioms (1)
  • domain assumption 3D Gaussian Splatting provides an efficient base for high-quality scene and avatar reconstruction
    The entire approach is built on top of 3DGS as stated in the opening sentences of the abstract.

pith-pipeline@v0.9.1-grok · 5784 in / 1255 out tokens · 31941 ms · 2026-06-29T22:59:07.574726+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Cgs-gan: 3d consistent gaus- sian splatting gans for high resolution human head synthesis

    Barthel, F., Morgenstern, W., Hinzer, P., Hilsmann, A., Eisert, P.: Cgs-gan: 3d consistent gaussian splatting gans for high resolution human head synthesis. arXiv preprint arXiv:2505.17590 (2025)

  2. [2]

    In: ICCV (2017)

    Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face align- ment problem?(and a dataset of 230,000 3d facial landmarks). In: ICCV (2017)

  3. [3]

    Muse: Text-to-image generation via masked generative transformers

    Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.H., Murphy, K., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704 (2023)

  4. [4]

    In: CVPR (2022)

    Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked gener- ative image transformer. In: CVPR (2022)

  5. [5]

    In: ICLR (2020)

    Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Gener- ative pretraining from pixels. In: ICLR (2020)

  6. [6]

    NeurIPS (2024)

    Chu, X., Harada, T.: Generalizable and animatable gaussian head avatar. NeurIPS (2024)

  7. [7]

    In: ICLR (2024)

    Chu, X., Li, Y., Zeng, A., Yang, T., Lin, L., Liu, Y., Harada, T.: Gpavatar: Gen- eralizable and precise head avatar from image (s). In: ICLR (2024)

  8. [8]

    NeurIPS (2015)

    Courbariaux, M., Bengio, Y., David, J.P.: Binaryconnect: Training deep neural networks with binary weights during propagations. NeurIPS (2015)

  9. [9]

    Vision Transformers Need Registers

    Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need reg- isters. arXiv preprint arXiv:2309.16588 (2023)

  10. [10]

    In: CVPR (2019)

    Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR (2019)

  11. [11]

    In: CVPR (2024)

    Deng, Y., Wang, D., Ren, X., Chen, X., Wang, B.: Portrait4d: Learning one-shot 4d head avatar synthesis using synthetic data. In: CVPR (2024)

  12. [12]

    In: ECCV (2024)

    Deng, Y., Wang, D., Wang, B.: Portrait4d-v2: Pseudo multi-view data creates better 4d head synthesizer. In: ECCV (2024)

  13. [13]

    In: CVPR Workshops (2019)

    Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3d face reconstruc- tion with weakly-supervised learning: From single image to image set. In: CVPR Workshops (2019)

  14. [14]

    In: CVPR (2021)

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

  15. [15]

    In: SIGGRAPH (2025)

    He, Y., Gu, X., Ye, X., Xu, C., Zhao, Z., Dong, Y., Yuan, W., Dong, Z., Bo, L.: Lam: large avatar model for one-shot animatable gaussian head. In: SIGGRAPH (2025)

  16. [16]

    In: CVPR (2024)

    Hu, L., Zhang, H., Zhang, Y., Zhou, B., Liu, B., Zhang, S., Nie, L.: Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. In: CVPR (2024)

  17. [17]

    NeurIPS (2024)

    Hyun, S., Heo, J.P.: Gsgan: Adversarial learning for hierarchical generation of 3d gaussian splats. NeurIPS (2024)

  18. [18]

    In: ECCV (2016)

    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV (2016)

  19. [19]

    TOG42(4), 139–1 (2023) 16 H

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. TOG42(4), 139–1 (2023) 16 H. Liao et al

  20. [20]

    In: ECCV (2022)

    Khakhulin, T., Sklyarova, V., Lempitsky, V., Zakharov, E.: Realistic one-shot mesh-based head avatars. In: ECCV (2022)

  21. [21]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  22. [22]

    In: CVPR (2025)

    Kumbong, H., Liu, X., Lin, T.Y., Liu, M.Y., Liu, X., Liu, Z., Fu, D.Y., Re, C., Romero, D.W.: Hmar: Efficient hierarchical masked auto-regressive image genera- tion. In: CVPR (2025)

  23. [23]

    In: ECCV (2024)

    Li, J., Zhang, J., Bai, X., Zheng, J., Ning, X., Zhou, J., Gu, L.: Talkinggaussian: Structure-persistent 3d talking head synthesis via gaussian splatting. In: ECCV (2024)

  24. [24]

    In: CVPR (2023)

    Li, T., Chang, H., Mishra, S., Zhang, H., Katabi, D., Krishnan, D.: Mage: Masked generative encoder to unify representation learning and image synthesis. In: CVPR (2023)

  25. [25]

    TOG36(6), 194–1 (2017)

    Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. TOG36(6), 194–1 (2017)

  26. [26]

    NeurIPS (2023)

    Li, X., De Mello, S., Liu, S., Nagano, K., Iqbal, U., Kautz, J.: Generalizable one- shot 3d neural head avatar. NeurIPS (2023)

  27. [27]

    In: WACV (2024)

    Ma, H., Zhang, T., Sun, S., Yan, X., Han, K., Xie, X.: Cvthead: One-shot control- lable head avatar with vertex-feature transformer. In: WACV (2024)

  28. [28]

    In: SIGGRAPH (2024)

    Ma, S., Weng, Y., Shao, T., Zhou, K.: 3d gaussian blendshapes for head avatar animation. In: SIGGRAPH (2024)

  29. [29]

    In: CVPR (2023)

    Ma, Z., Zhu, X., Qi, G.J., Lei, Z., Zhang, L.: Otavatar: One-shot talking face avatar with controllable tri-plane rendering. In: CVPR (2023)

  30. [30]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  32. [32]

    In: CVPR (2024)

    Qian, Z., Wang, S., Mihajlovic, M., Geiger, A., Tang, S.: 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. In: CVPR (2024)

  33. [33]

    NeurIPS (2019)

    Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. NeurIPS (2019)

  34. [34]

    In: ICCV (2021)

    Ren, Y., Li, G., Chen, Y., Li, T.H., Liu, S.: Pirenderer: Controllable portrait image generation via semantic neural rendering. In: ICCV (2021)

  35. [35]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  36. [36]

    NeurIPS (2024)

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalable image generation via next-scale prediction. NeurIPS (2024)

  37. [37]

    NeurIPS (2017)

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. NeurIPS (2017)

  38. [38]

    In: ICLR (2018)

    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)

  39. [39]

    In: CVPR (2025)

    Wang, C., Kang, D., Sun, H., Qian, S., Wang, Z., Bao, L., Zhang, S.H.: Mega: Hybrid mesh-gaussian head avatar for high-fidelity rendering and head editing. In: CVPR (2025)

  40. [40]

    TIP13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP13(4), 600–612 (2004)

  41. [41]

    In: AAAI (2025) SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting 17

    Wei, X., Chen, P., Lu, M., Chen, H., Tian, F.: Graphavatar: Compact head avatars with gnn-generated 3d gaussians. In: AAAI (2025) SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting 17

  42. [42]

    In: CVPR (2022)

    Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: CVPR (2022)

  43. [43]

    In: CVPR (2020)

    Xu, S., Yang, J., Chen, D., Wen, F., Deng, Y., Jia, Y., Tong, X.: Deep 3d portrait from a single image. In: CVPR (2020)

  44. [44]

    In: CVPR (2024)

    Xu, Y., Chen, B., Li, Z., Zhang, H., Wang, L., Zheng, Z., Liu, Y.: Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. In: CVPR (2024)

  45. [45]

    arXiv preprint arXiv:2401.08503 (2024)

    Ye, Z., Zhong, T., Ren, Y., Yang, J., Li, W., Huang, J., Jiang, Z., He, J., Huang, R., Liu, J., et al.: Real3d-portrait: One-shot realistic 3d talking portrait synthesis. arXiv preprint arXiv:2401.08503 (2024)

  46. [46]

    In: ECCV (2022)

    Yin, F., Zhang, Y., Cun, X., Cao, M., Fan, Y., Wang, X., Bai, Q., Wu, B., Wang, J., Yang, Y.: Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In: ECCV (2022)

  47. [48]

    In: CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

  48. [49]

    NeurIPS (2024)

    Zhang, S., Fei, X., Liu, F., Song, H., Duan, Y.: Gaussian graph network: Learn- ing efficient and generalizable gaussian representations from multi-view images. NeurIPS (2024)

  49. [50]

    In: CVPR (2021) 18 H

    Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: CVPR (2021) 18 H. Liao et al. A More Details on Experiments A.1 Implementation The Gaussian feature adapter comprises two trainable Vision Transformers [31], both configured with depth= 12and heads= 8. We extract features fro...