Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
Pith reviewed 2026-05-13 22:14 UTC · model grok-4.3
The pith
Replacing SMPL with the Momentum Human Rig yields higher PSNR in a minimal Gaussian avatar pipeline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that body model representational capacity has been the primary bottleneck in avatar reconstruction. A minimal pipeline built on the Momentum Human Rig estimated via SAM-3D-Body, without learned deformations or pose-dependent corrections, reaches the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. The two ablations that translate SAM-3D-Body meshes into SMPL-X and SMPL poses into MHR, both retrained identically, establish that gains arise from both improved mesh capacity and pose estimation quality.
What carries the argument
The Momentum Human Rig (MHR) body model estimated via SAM-3D-Body, which supplies greater expressiveness than SMPL while requiring no additional learned deformation networks.
If this is right
- Simpler pipelines without learned corrections can surpass elaborate networks when the underlying body model is more expressive.
- Both mesh representational capacity and pose estimation quality contribute independently to reconstruction fidelity.
- Performance gains from the rig change appear consistently on standard human avatar benchmarks.
Where Pith is reading between the lines
- Adopting more expressive rigs may reduce reliance on large deformation networks in real-time avatar systems.
- The same ablation logic could be applied to other parametric models or reconstruction pipelines beyond Gaussians.
- Further improvements to rig expressiveness might produce additional gains without any increase in network size.
Load-bearing premise
The two controlled ablations fully isolate pose estimation quality from the body model's intrinsic representational capacity when everything else is retrained identically.
What would settle it
An experiment that applies the identical pose estimates to both SMPL and MHR rigs under the same Gaussian pipeline and checks whether the quality gap remains.
Figures
read the original abstract
Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset's SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline's gains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that replacing SMPL with the Momentum Human Rig (MHR) estimated via SAM-3D-Body enables a minimal Gaussian splatting pipeline (no learned deformations or pose-dependent corrections) to achieve the highest reported PSNR and competitive/superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. Two controlled ablations—translating SAM-3D-Body meshes to SMPL-X and original SMPL poses to MHR, both retrained identically—are presented to disentangle pose estimation quality from body model representational capacity, concluding that body model expressiveness has been the primary bottleneck.
Significance. If the ablation controls hold, the result is significant: it shows that a simpler, parameter-light pipeline can outperform increasingly complex deformation networks by improving the underlying body rig, shifting focus from architectural scaling to representational fidelity in Gaussian avatars. The empirical gains on standard datasets provide a falsifiable benchmark for future rig comparisons.
major comments (2)
- [Ablation experiments] Ablation experiments: the two translations (SAM-3D-Body meshes to SMPL-X; SMPL poses to MHR) are asserted to be controlled and lossless under identical retraining, yet no vertex-to-vertex or pose-error metrics (e.g., MPJPE or mesh-to-mesh distance) are reported on the translated data. Without these, approximation artifacts cannot be ruled out as a confound, undermining the isolation of representational capacity from translation fidelity.
- [Results] Results tables: the claim of 'highest reported PSNR' is load-bearing for the central thesis, but the manuscript does not include an exhaustive comparison table listing all cited prior methods with identical metrics, training iterations, and hardware; this prevents verification that the reported gains are not due to unstated differences in optimization schedule.
minor comments (2)
- [Method] Notation: MHR is introduced as an 'invented entity' without an explicit equation or parameter count in the main text; a short table comparing degrees of freedom (joints, blend shapes, etc.) versus SMPL/SMPL-X would clarify the expressiveness claim.
- [Figures] Figure clarity: the ablation diagrams should annotate the exact fitting/optimization steps used in each translation direction so readers can assess potential error sources.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the strength of our claims. We respond to each major point below and will incorporate revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Ablation experiments] Ablation experiments: the two translations (SAM-3D-Body meshes to SMPL-X; SMPL poses to MHR) are asserted to be controlled and lossless under identical retraining, yet no vertex-to-vertex or pose-error metrics (e.g., MPJPE or mesh-to-mesh distance) are reported on the translated data. Without these, approximation artifacts cannot be ruled out as a confound, undermining the isolation of representational capacity from translation fidelity.
Authors: We agree that explicit quantitative metrics on translation fidelity would strengthen the controlled nature of the ablations. In the revised manuscript we will add MPJPE for the SMPL-to-MHR pose translations and mean vertex-to-vertex Euclidean distances for the SAM-3D-Body-to-SMPL-X mesh translations, computed on the same subjects used in the main experiments. These numbers will be reported in a new supplementary table together with a brief discussion of any residual error. revision: yes
-
Referee: [Results] Results tables: the claim of 'highest reported PSNR' is load-bearing for the central thesis, but the manuscript does not include an exhaustive comparison table listing all cited prior methods with identical metrics, training iterations, and hardware; this prevents verification that the reported gains are not due to unstated differences in optimization schedule.
Authors: We will expand the main results table to list every method cited in the paper, reporting the PSNR, LPIPS and SSIM values exactly as published by the original authors. Where training iteration counts or hardware details appear in the source papers we will include them; otherwise we will mark the entry as “not reported.” A revised table caption will explicitly note that cross-paper comparisons are subject to implementation and hardware differences and that our own runs used the same 300k-iteration schedule and hardware for all ablations. revision: partial
Circularity Check
No significant circularity: empirical ablation study with externally falsifiable metrics
full rationale
The paper reports an empirical ablation comparing SMPL and MHR body models in Gaussian splatting avatars, using two controlled mesh/pose translations retrained identically on PeopleSnapshot and ZJU-MoCap. No derivation chain, equations, or predictions are present that reduce to fitted inputs by construction. Claims rest on reported PSNR/LPIPS/SSIM values, which are externally replicable and falsifiable. No self-citation load-bearing steps, self-definitional relations, or ansatz smuggling appear in the abstract or described pipeline. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption PeopleSnapshot and ZJU-MoCap are appropriate and representative benchmarks for evaluating avatar reconstruction quality.
invented entities (1)
-
Momentum Human Rig (MHR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Video based reconstruc- tion of 3d people models
Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video based reconstruc- tion of 3d people models. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8387– 8397, 2018. 3
work page 2018
-
[2]
Aaron Ferguson, Ahmed A. A. Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, David Otte, Eric Vignola, Fabian Prada, Federica Bogo, Igor Santeste- ban, Javier Romero, Jenna Zarate, Jeongseok Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioffi, ...
work page 2025
-
[3]
Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. Gaussianavatar: Towards realistic human avatar model- ing from a single video via animatable 3d gaussians. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 2, 3
work page 2024
-
[4]
In- stantavatar: Learning avatars from monocular video in 60 seconds.arXiv, 2022
Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. In- stantavatar: Learning avatars from monocular video in 60 seconds.arXiv, 2022. 1, 2
work page 2022
-
[5]
Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd In- ternational Conference on Learning Representations (ICLR),
-
[6]
Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human gaussian splats. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 505– 515, 2024. 2
work page 2024
-
[7]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and ex- pression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6):194:1–194:17, 2017. 2
work page 2017
-
[8]
Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model.ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 1, 2
work page 2015
-
[9]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InProceedings of the European Conference on Com- puter Vision (ECCV), 2020. 1
work page 2020
-
[10]
Ahmed A A Osman, Timo Bolkart, and Michael J. Black. STAR: A sparse trained articulated human body regressor. InEuropean Conference on Computer Vision (ECCV), pages 598–613, 2020. 4
work page 2020
-
[11]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 2
work page 2019
-
[12]
Ani- matable neural radiance fields for modeling dynamic human bodies
Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InICCV, 2021. 2, 3
work page 2021
-
[13]
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InCVPR,
-
[14]
Sen Peng, Weixing Xie, Zilong Wang, Xiaohu Guo, Zhong- gui Chen, Baorong Yang, and Xiao Dong. RMAvatar: Photo- realistic human avatar reconstruction from monocular video based on rectified mesh-embedded gaussians.arXiv preprint arXiv:2501.07104, 2025. 1
-
[15]
Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. Gaus- sianavatars: Photorealistic head avatars with rigged 3d gaus- sians. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20299–20309,
-
[16]
3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting
Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5020–5030, 2024. 1, 2, 3
work page 2024
-
[17]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bod- ies together.ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia), 36(6), 2017. 2
work page 2017
-
[18]
SplattingAvatar: Realistic real-time human avatars with mesh-embedded gaussian splatting
Zhijing Shao, Zhaolong Wang, Zhuang Li, Duotun Wang, Xiangru Lin, Yu Zhang, Mingming Fan, and Zeyu Wang. SplattingAvatar: Realistic real-time human avatars with mesh-embedded gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1606–1616, 2024. 2, 3
work page 2024
-
[19]
Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. Hu- manNeRF: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022. 1
work page 2022
-
[20]
Ghum & ghuml: Generative 3d human shape and articulated pose models
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, Bill Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Ghum & ghuml: Generative 3d human shape and articulated pose models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (Oral), pages 6184–6193, 2020. 4
work page 2020
-
[21]
Sam 3d body: Robust full-body human mesh recovery, 2026
Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani. Sam 3d body: Robust full-body hu- man mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 1, 2, 3
-
[22]
Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting.Journal of Ma- chine Learning Research, 26(34):1–17, 2025. 2
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.