arxiv: 2604.07273 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

Yiqian Wu , Rawal Khirodkar , Egor Zakharov , Timur Bagautdinov , Lei Xiao , Zhaoen Su , Shunsuke Saito , Xiaogang Jin

show 1 more author

Junxuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords full-body avatars3D diffusion modelsin-the-wild videostext-to-avatarimage-to-avataravatar animationphotorealistic generationvisibility-aware training

0 comments

The pith

GenLCA trains a 3D diffusion model on millions of in-the-wild videos to generate and edit photorealistic full-body avatars from text or images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build a generative system that produces faithful, animatable 3D full-body avatars directly from text or single-image prompts. It addresses the data bottleneck by converting partial 2D video frames into structured 3D tokens using a pretrained reconstruction model, then training a flow-based diffusion model on those tokens at large scale. A visibility-aware training rule replaces missing body regions with learnable tokens and restricts the loss to observed regions, preventing blurring or transparency. If this works, the resulting avatars maintain input fidelity, support high-quality animation, and generalize across diverse real-world appearances better than models trained only on limited synthetic data.

Core claim

GenLCA is a diffusion-based generative model for full-body avatars that trains natively in 3D on tokens extracted from millions of real-world videos. It repurposes a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer and introduces a visibility-aware strategy that replaces invalid regions with learnable tokens while computing losses only on valid regions. This combination enables scaling the training dataset while preserving the photorealism and animation properties of the original tokenizer.

What carries the argument

Visibility-aware diffusion training applied to structured 3D tokens produced by a pretrained feed-forward avatar reconstruction model, which handles partial observations in real videos.

If this is right

Generated avatars stay faithful to text or image inputs while supporting high-fidelity facial and full-body animations.
The model generalizes to new identities and poses because it trains on diverse real-world video data.
Editing operations such as pose or appearance changes become possible within the same diffusion framework.
Performance exceeds prior avatar generation methods on photorealism and input fidelity metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenizer-plus-visibility pattern could let other 3D diffusion models scale training on partial video observations beyond avatars.
Continued growth in available in-the-wild video volume would directly increase model capacity without changes to architecture or loss design.
The method may reduce dependence on curated synthetic datasets for photorealistic 3D generative tasks.

Load-bearing premise

The pretrained feed-forward avatar reconstruction model produces reliable structured 3D tokens from partial 2D observations, and replacing invalid regions with learnable tokens while restricting losses to valid regions removes artifacts without adding new biases or quality loss.

What would settle it

Generate avatars from held-out in-the-wild videos with known ground-truth 3D reconstructions, then render them from novel viewpoints; if blurring, transparency, or loss of detail appears systematically in unobserved regions, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.07273 by Egor Zakharov, Junxuan Li, Lei Xiao, Rawal Khirodkar, Shunsuke Saito, Timur Bagautdinov, Xiaogang Jin, Yiqian Wu, Zhaoen Su.

**Figure 1.** Figure 1: GenLCA is a diffusion-based generative model for generating and editing full-body 3D Gaussian avatars from text and image inputs. (A) Generation. GenLCA generates avatars that are visually realistic and consistent with both the identity in the input face image and the semantic descriptions in the input text, while supporting high-fidelity facial and full-body animations. We present zoomed-in and animated f… view at source ↗

**Figure 2.** Figure 2: The architecture of the reconstruction model. The transformer takes image tokens and query point embeddings as inputs, and outputs GS tokens. The GS tokens are decoded to get dynamic GS attributes. The resulting Gaussian splats are rendered using LBS to obtain the final renderings. architecture and training strategy of GenLCA (Sec. 3.2). GenLCA first employs a compressor to compress the 3D tokens into com… view at source ↗

**Figure 3.** Figure 3: (A) Training pipeline of GenLCA. During training, the high dimensional GS tokens are first encoded into compact GS latent by the compressor encoder. For conditional inputs, we use CLIP [Radford et al. 2021] to extract text embeddings and DINOv2 [Oquab et al. 2024] to extract scribble and body part embeddings. To prevent the training process from being affected by corrupted information, we replace invalid r… view at source ↗

**Figure 4.** Figure 4: (A) The tokens accurately reconstruct the visible regions of the images. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: 3D avatars generated by GenLCA from texts. All results are generated with CFG scale = 5.0, 50 sampling steps, and animated with random poses. quality and appear realistic, whereas the invalid ones are typically blurry or even transparent. 4 Implementation Details 4.1 Training dataset We construct the training dataset for GenLCA by encoding frames from monocular videos into structured 3D tokens using the to… view at source ↗

**Figure 6.** Figure 6: We compare our GenLCA with SOTA methods, including SDS based approaches: TADA [Liao et al [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: We conduct ablation studies by individually removing the visibility-aware training, learnable placeholder, and in-the-wild training data components to [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents GenLCA, a diffusion-based generative model for photorealistic full-body avatars from text and image inputs. It repurposes a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer to encode in-the-wild video frames into structured 3D tokens, enabling training on millions of partially observable real-world videos. A visibility-aware diffusion training strategy replaces invalid regions with learnable tokens and computes losses only on valid regions to avoid blurring and transparency artifacts. A flow-based diffusion model is then trained on this token dataset to support high-fidelity generation, editing, and animation while preserving photorealism and animatability, with claims of outperforming prior methods by a large margin.

Significance. If the central claims hold, this work is significant for enabling scalable training of native 3D diffusion models on large-scale in-the-wild video data without full 3D supervision. The paradigm of using a pretrained reconstructor as tokenizer combined with visibility-aware masking directly addresses partial observability, a key bottleneck in avatar synthesis. Strengths include the architectural details for tokenization and loss formulation, the maintenance of animatability from the base model, and the potential for improved generalizability through dataset scale. This could influence future work on data-efficient 3D generative models in computer vision.

major comments (2)

[§3.2] §3.2 (visibility-aware training): The strategy of replacing invalid regions with learnable tokens while restricting losses to valid regions is load-bearing for the claim that partial observations can be scaled without artifacts. However, the manuscript lacks an ablation quantifying whether these learnable tokens propagate biases into the diffusion prior (e.g., via distribution shift in occluded body parts), which is needed to confirm the strategy fully mitigates blurring/transparency without new quality loss.
[§4] §4 (quantitative results): The claim of outperforming existing solutions by a large margin is central but rests on unspecified metrics and baselines. The evaluation should report concrete numbers (e.g., FID, LPIPS, or user-study percentages) with error analysis and exact comparison methods to substantiate the scalability benefit; absence of these details weakens the evidence for the core contribution.

minor comments (3)

The abstract and introduction would benefit from explicit citations to the specific prior works being outperformed, to ground the 'large margin' claim.
[§2] Notation for the structured 3D tokens (e.g., how visibility masks are encoded) could be formalized more clearly in §2 to aid reproducibility.
Figure captions for qualitative results should include input conditions (text/image) and failure cases to improve clarity and balance the presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback highlighting areas where additional evidence can strengthen the claims regarding visibility-aware training and quantitative evaluations. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (visibility-aware training): The strategy of replacing invalid regions with learnable tokens while restricting losses to valid regions is load-bearing for the claim that partial observations can be scaled without artifacts. However, the manuscript lacks an ablation quantifying whether these learnable tokens propagate biases into the diffusion prior (e.g., via distribution shift in occluded body parts), which is needed to confirm the strategy fully mitigates blurring/transparency without new quality loss.

Authors: We agree that an ablation would provide valuable confirmation that the learnable tokens do not introduce unintended biases. In the revised manuscript, we will add an ablation study comparing the full visibility-aware training (learnable tokens for invalid regions with loss restricted to valid regions) against a baseline that masks or ignores invalid regions without learnable tokens. We will report quantitative metrics such as FID and LPIPS on generated full-body avatars, along with qualitative analysis of occluded body parts, to assess any distribution shift or quality degradation. This will demonstrate that the strategy enables scaling without new artifacts, as the tokens are optimized end-to-end to produce plausible content consistent with visible observations. revision: yes
Referee: [§4] §4 (quantitative results): The claim of outperforming existing solutions by a large margin is central but rests on unspecified metrics and baselines. The evaluation should report concrete numbers (e.g., FID, LPIPS, or user-study percentages) with error analysis and exact comparison methods to substantiate the scalability benefit; absence of these details weakens the evidence for the core contribution.

Authors: We acknowledge the need for more explicit quantitative details to support the performance claims. In the revised Section 4, we will include concrete numerical results for metrics such as FID, LPIPS, and user-study preference percentages (e.g., from A/B tests on photorealism and animatability). We will also provide error analysis with standard deviations, specify the exact baselines (including prior methods and their configurations), detail the evaluation protocol (e.g., sample counts, rendering settings, and test set composition), and describe comparison methods to substantiate the advantages from large-scale in-the-wild training. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central derivation proceeds from an external pretrained feed-forward avatar reconstruction model (treated as a black-box 3D tokenizer) to a visibility-aware diffusion training procedure on the resulting tokens, followed by a flow-based generative model. This chain is self-contained: the visibility-aware masking and loss computation are explicitly formulated as a training strategy to handle partial observations, without reducing any claimed prediction or output to a fitted parameter or self-referential definition. No load-bearing step invokes a self-citation chain, uniqueness theorem from the same authors, or ansatz smuggled via prior work; the pretrained component is presented as an independent input whose outputs are then processed. The reported results (generation, editing, quantitative comparisons) are downstream evaluations rather than tautological restatements of the inputs. The derivation therefore does not collapse to its own assumptions by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper's approach rests on the effectiveness of the pretrained tokenizer and the visibility-aware training to scale to millions of videos without introducing artifacts.

free parameters (1)

learnable tokens for invalid regions
Introduced ad hoc to replace invalid regions in 3D tokens during visibility-aware training.

axioms (1)

domain assumption The pretrained feed-forward avatar reconstruction model can serve as an effective animatable 3D tokenizer from 2D video frames.
Central to encoding unstructured video into structured 3D tokens as described in the abstract.

pith-pipeline@v0.9.0 · 5615 in / 1486 out tokens · 96534 ms · 2026-05-10T18:34:41.308369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages · 3 internal anchors

[1]

In Advances in Neural Information Processing Systems, Vol

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation. In Advances in Neural Information Processing Systems, Vol. 36. 13664–13677. Gang Cheng, Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Ju Li, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Feng Wang, Guangyuan Wang, Qi Wang, Zhongjian Wang, Ji...
[2]

Wan-Animate: Unified Character Animation and Replacement with Holistic Replication. arXiv:2509.14055 [cs.CV] Wei Cheng, Ruixiang Chen, Siming Fan, Wanqi Yin, Keyu Chen, Zhongang Cai, Jingbo Wang, Yang Gao, Zhengming Yu, Zhengyu Lin, Daxuan Ren, Lei Yang, Ziwei Liu, Chen Change Loy, Chen Qian, Wayne Wu, Dahua Lin, Bo Dai, and Kwan-Yee Lin. 2023. DNA-Render...

work page arXiv 2023
[3]

LTX-Video: Realtime Video Latent Diffusion

Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts. InThe Twelfth International Conference on Learning Representations, ICLR 2024. Xuangeng Chu and Tatsuya Harada. 2024. Generalizable and Animatable Gaussian Head Avatar. InAdvances in Neural Information Processing Systems, Vol. 37. 57642–57670. Jiahao ...

work page internal anchor Pith review arXiv 2024
[4]

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining. arXiv:2604.02320 [cs.CV] https://arxiv.org/abs/2604.02320 Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiaowei Chi, Siyu Xia, Yan-Pei Cao, Wei Xue, Wenhan Luo, and Yike Guo. 2025b. PSHuman: Photorealistic Single-image 3D Human Reconstru...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

InAdvances in Neural Information Processing Systems, Vol

Codec Avatar Studio: Paired Human Captures for Complete, Driveable, and Generalizable Avatars. InAdvances in Neural Information Processing Systems, Vol. 37. 83008–83023. Yifang Men, Biwen Lei, Yuan Yao, Miaomiao Cui, Zhouhui Lian, and Xuansong Xie
[6]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 9981–9991. Marko Mihajlovic, Aayush Bansal, Michael Zollhöfer, Siyu Tang, and Shunsuke Saito
[7]

InComputer Vision - ECCV 2022 - 17th European Conference (Lecture Notes in Computer Science, Vol

KeypointNeRF: Generalizing Image-Based Volumetric Avatars Using Relative Spatial Encoding of Keypoints. InComputer Vision - ECCV 2022 - 17th European Conference (Lecture Notes in Computer Science, Vol. 13675). 179–197. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Mas...

2022
[8]

Pf-lhm: 3d animatable avatar reconstruc- tion from pose-free articulated human images.arXiv preprint arXiv:2506.13766, 2025

Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions. In International Conference on 3D Vision, 3DV 2025, Singapore, March 25-28, 2025. IEEE, 1583–1593. Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2024. GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians. InPro...

work page arXiv 2025
[9]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

High-Resolution Image Synthesis With Latent Diffusion Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695. 10 Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, and Giljoo Nam. 2024. Relightable Gaussian Codec Avatars. InProceedings of the IEEE/CVF Conference on Computer Vision and Patter...

2024
[10]

Generative Human Geometry Distribution

Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Computer Vision Foundation / IEEE, 3664–3673. Xiangjun Tang, Biao Zhang, and Peter Wonka. 2025a. Generative Human Geometry Distribution.CoRRabs/2503.01448 (2025). Xiangjun Tang, Biao ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

XAGen: 3D Expressive Human Avatars Generation. InAdvances in Neural Information Processing Systems, Vol. 36. 34852–34865. Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. 2024. Human- 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models. InAdvances in Neural Information Processing Systems, Vol. 37. 99601–99645. ...

work page Pith review arXiv 2024