Recognition: no theorem link
GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3
The pith
GenLCA trains a 3D diffusion model on millions of in-the-wild videos to generate and edit photorealistic full-body avatars from text or images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenLCA is a diffusion-based generative model for full-body avatars that trains natively in 3D on tokens extracted from millions of real-world videos. It repurposes a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer and introduces a visibility-aware strategy that replaces invalid regions with learnable tokens while computing losses only on valid regions. This combination enables scaling the training dataset while preserving the photorealism and animation properties of the original tokenizer.
What carries the argument
Visibility-aware diffusion training applied to structured 3D tokens produced by a pretrained feed-forward avatar reconstruction model, which handles partial observations in real videos.
If this is right
- Generated avatars stay faithful to text or image inputs while supporting high-fidelity facial and full-body animations.
- The model generalizes to new identities and poses because it trains on diverse real-world video data.
- Editing operations such as pose or appearance changes become possible within the same diffusion framework.
- Performance exceeds prior avatar generation methods on photorealism and input fidelity metrics.
Where Pith is reading between the lines
- The same tokenizer-plus-visibility pattern could let other 3D diffusion models scale training on partial video observations beyond avatars.
- Continued growth in available in-the-wild video volume would directly increase model capacity without changes to architecture or loss design.
- The method may reduce dependence on curated synthetic datasets for photorealistic 3D generative tasks.
Load-bearing premise
The pretrained feed-forward avatar reconstruction model produces reliable structured 3D tokens from partial 2D observations, and replacing invalid regions with learnable tokens while restricting losses to valid regions removes artifacts without adding new biases or quality loss.
What would settle it
Generate avatars from held-out in-the-wild videos with known ground-truth 3D reconstructions, then render them from novel viewpoints; if blurring, transparency, or loss of detail appears systematically in unobserved regions, the central claim fails.
Figures
read the original abstract
We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GenLCA, a diffusion-based generative model for photorealistic full-body avatars from text and image inputs. It repurposes a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer to encode in-the-wild video frames into structured 3D tokens, enabling training on millions of partially observable real-world videos. A visibility-aware diffusion training strategy replaces invalid regions with learnable tokens and computes losses only on valid regions to avoid blurring and transparency artifacts. A flow-based diffusion model is then trained on this token dataset to support high-fidelity generation, editing, and animation while preserving photorealism and animatability, with claims of outperforming prior methods by a large margin.
Significance. If the central claims hold, this work is significant for enabling scalable training of native 3D diffusion models on large-scale in-the-wild video data without full 3D supervision. The paradigm of using a pretrained reconstructor as tokenizer combined with visibility-aware masking directly addresses partial observability, a key bottleneck in avatar synthesis. Strengths include the architectural details for tokenization and loss formulation, the maintenance of animatability from the base model, and the potential for improved generalizability through dataset scale. This could influence future work on data-efficient 3D generative models in computer vision.
major comments (2)
- [§3.2] §3.2 (visibility-aware training): The strategy of replacing invalid regions with learnable tokens while restricting losses to valid regions is load-bearing for the claim that partial observations can be scaled without artifacts. However, the manuscript lacks an ablation quantifying whether these learnable tokens propagate biases into the diffusion prior (e.g., via distribution shift in occluded body parts), which is needed to confirm the strategy fully mitigates blurring/transparency without new quality loss.
- [§4] §4 (quantitative results): The claim of outperforming existing solutions by a large margin is central but rests on unspecified metrics and baselines. The evaluation should report concrete numbers (e.g., FID, LPIPS, or user-study percentages) with error analysis and exact comparison methods to substantiate the scalability benefit; absence of these details weakens the evidence for the core contribution.
minor comments (3)
- The abstract and introduction would benefit from explicit citations to the specific prior works being outperformed, to ground the 'large margin' claim.
- [§2] Notation for the structured 3D tokens (e.g., how visibility masks are encoded) could be formalized more clearly in §2 to aid reproducibility.
- Figure captions for qualitative results should include input conditions (text/image) and failure cases to improve clarity and balance the presentation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback highlighting areas where additional evidence can strengthen the claims regarding visibility-aware training and quantitative evaluations. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (visibility-aware training): The strategy of replacing invalid regions with learnable tokens while restricting losses to valid regions is load-bearing for the claim that partial observations can be scaled without artifacts. However, the manuscript lacks an ablation quantifying whether these learnable tokens propagate biases into the diffusion prior (e.g., via distribution shift in occluded body parts), which is needed to confirm the strategy fully mitigates blurring/transparency without new quality loss.
Authors: We agree that an ablation would provide valuable confirmation that the learnable tokens do not introduce unintended biases. In the revised manuscript, we will add an ablation study comparing the full visibility-aware training (learnable tokens for invalid regions with loss restricted to valid regions) against a baseline that masks or ignores invalid regions without learnable tokens. We will report quantitative metrics such as FID and LPIPS on generated full-body avatars, along with qualitative analysis of occluded body parts, to assess any distribution shift or quality degradation. This will demonstrate that the strategy enables scaling without new artifacts, as the tokens are optimized end-to-end to produce plausible content consistent with visible observations. revision: yes
-
Referee: [§4] §4 (quantitative results): The claim of outperforming existing solutions by a large margin is central but rests on unspecified metrics and baselines. The evaluation should report concrete numbers (e.g., FID, LPIPS, or user-study percentages) with error analysis and exact comparison methods to substantiate the scalability benefit; absence of these details weakens the evidence for the core contribution.
Authors: We acknowledge the need for more explicit quantitative details to support the performance claims. In the revised Section 4, we will include concrete numerical results for metrics such as FID, LPIPS, and user-study preference percentages (e.g., from A/B tests on photorealism and animatability). We will also provide error analysis with standard deviations, specify the exact baselines (including prior methods and their configurations), detail the evaluation protocol (e.g., sample counts, rendering settings, and test set composition), and describe comparison methods to substantiate the advantages from large-scale in-the-wild training. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's central derivation proceeds from an external pretrained feed-forward avatar reconstruction model (treated as a black-box 3D tokenizer) to a visibility-aware diffusion training procedure on the resulting tokens, followed by a flow-based generative model. This chain is self-contained: the visibility-aware masking and loss computation are explicitly formulated as a training strategy to handle partial observations, without reducing any claimed prediction or output to a fitted parameter or self-referential definition. No load-bearing step invokes a self-citation chain, uniqueness theorem from the same authors, or ansatz smuggled via prior work; the pretrained component is presented as an independent input whose outputs are then processed. The reported results (generation, editing, quantitative comparisons) are downstream evaluations rather than tautological restatements of the inputs. The derivation therefore does not collapse to its own assumptions by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable tokens for invalid regions
axioms (1)
- domain assumption The pretrained feed-forward avatar reconstruction model can serve as an effective animatable 3D tokenizer from 2D video frames.
Reference graph
Works this paper leans on
-
[1]
In Advances in Neural Information Processing Systems, Vol
PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation. In Advances in Neural Information Processing Systems, Vol. 36. 13664–13677. Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Ji...
-
[2]
Wan-Animate: Unified Character Animation and Replacement with Holistic Replication. arXiv:2509.14055 [cs.CV] Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, Daxuan Ren, Lei Yang, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Bo Dai, and Kwan-Yee Lin. 2023. DNA-Render...
-
[3]
LTX-Video: Realtime Video Latent Diffusion
Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts. InThe Twelfth International Conference on Learning Representations, ICLR 2024. Xuangeng Chu and Tatsuya Harada. 2024. Generalizable and Animatable Gaussian Head Avatar. InAdvances in Neural Information Processing Systems, Vol. 37. 57642–57670. Jiahao ...
work page internal anchor Pith review arXiv 2024
-
[4]
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining. arXiv:2604.02320 [cs.CV] https://arxiv.org/abs/2604.02320 Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiaowei Chi, Siyu Xia, Yan-Pei Cao, Wei Xue, Wenhan Luo, and Yike Guo. 2025b. PSHuman: Photorealistic Single-image 3D Human Reconstru...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
InAdvances in Neural Information Processing Systems, Vol
Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars. InAdvances in Neural Information Processing Systems, Vol. 37. 83008–83023. Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, and Xuansong Xie
-
[6]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 9981–9991. Marko Mihajlovic, Aayush Bansal, Michael Zollhöfer, Siyu Tang, and Shunsuke Saito
-
[7]
InComputer Vision - ECCV 2022 - 17th European Conference (Lecture Notes in Computer Science, Vol
KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints. InComputer Vision - ECCV 2022 - 17th European Conference (Lecture Notes in Computer Science, Vol. 13675). 179–197. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Mas...
2022
-
[8]
Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions. In International Conference on 3D Vision, 3DV 2025, Singapore, March 25-28, 2025. IEEE, 1583–1593. Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2024. GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians. InPro...
-
[9]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
High-Resolution Image Synthesis With Latent Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695. 10 Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. 2024. Relightable Gaussian Codec Avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Patter...
2024
-
[10]
Generative Human Geometry Distribution
Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Computer Vision Foundation / IEEE, 3664–3673. Xiangjun Tang, Biao Zhang, and Peter Wonka. 2025a. Generative Human Geometry Distribution.CoRRabs/2503.01448 (2025). Xiangjun Tang, Biao ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
XAGen: 3D Expressive Human Avatars Generation. InAdvances in Neural Information Processing Systems, Vol. 36. 34852–34865. Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. 2024. Human- 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models. InAdvances in Neural Information Processing Systems, Vol. 37. 99601–99645. ...
work page Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.