arxiv: 2604.18583 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

MUA: Mobile Ultra-detailed Animatable Avatars

Heming Zhu , Guoxing Sun , Marc Habermann

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords animatable avatarsmobile VRmodel distillationwavelet decompositionblendshapesreal-time renderingdigital humansclothing dynamics

0 comments

The pith

Wavelet-guided blendshapes distill high-fidelity avatar details into a compact form that runs real-time on mobile VR headsets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a compact representation for full-body animatable avatars that maintains detailed clothing motion and appearance while slashing resource demands. It starts from an expensive high-quality teacher model and uses a distillation process to move the essential information into a lighter student model. The key step couples wavelet-based decomposition of textures at several scales with low-rank factorization to keep dynamics plausible. This matters to readers because it removes the need for server GPUs to experience detailed digital humans in VR or other immersive settings.

Core claim

By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resembling those of the teacher model. The representation, called Wavelet-guided Multi-level Spatial Factorized Blendshapes, runs at over 180 FPS on desktop hardware and achieves native real-time performance at 24 FPS on a standalone Meta Quest 3.

What carries the argument

Wavelet-guided Multi-level Spatial Factorized Blendshapes, which applies multi-level wavelet decomposition to avatar textures and pairs it with low-rank factorization to encode dynamic geometry and appearance in a compact form.

If this is right

Outperforms existing methods designed for mobile platforms in rendering quality while matching or exceeding most server-only approaches.
Enables over 180 FPS on desktop PCs and native 24 FPS on standalone devices such as the Meta Quest 3.
Makes high-fidelity full-body avatars practical for immersive VR and AR applications without requiring server-class GPUs.
Reduces model size by roughly 10X while keeping visually plausible motion and appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could let other heavy 3D models run on phones or headsets by moving computation into a compact spectral form.
Real-time on-device performance removes the need for constant cloud streaming, which could expand personalized avatar use in consumer games and social VR.
If the wavelet levels can be adjusted dynamically, the method might support graceful quality scaling based on available battery or bandwidth.

Load-bearing premise

The distillation pipeline transfers the motion-aware clothing dynamics and fine appearance details from the teacher model without introducing noticeable artifacts or fidelity loss at the reduced resolution and compute budget.

What would settle it

A direct visual comparison on Meta Quest 3 hardware between the distilled model and the original teacher at 24 FPS that reveals missing clothing folds, blurred textures, or new artifacts.

Figures

Figures reproduced from arXiv: 2604.18583 by Guoxing Sun, Heming Zhu, Marc Habermann.

**Figure 2.** Figure 2: Overview. Given root-normalized skeletal motion θ¯ f as input, we first train a teacher model that models the coarse geometry with a template mesh V¯ f and fine geometry and appearance with 3D Gaussian splat textures T gs f . We further decompose T gs f with a wavelet transform to obtain multi-level supervision for distillation. To derive a compact, mobile-ready representation, we model the coarse geometry… view at source ↗

**Figure 3.** Figure 3: Intuition. The proposed Wavelet-guided Multi-level Factorized Gaussian Texture is based on the observation that different wavelet subbands of the Gaussian texture T gs f exhibit distinct structural properties. The coarsest lowfrequency subband TLL f contains most of the signal energy but has a low spatial resolution. The intermediate detail subbands Dl,f , l ∈ {2, 3}, are sparse and thus well suited for 1… view at source ↗

**Figure 4.** Figure 4: Qualitative Rendering Results. Given limited computation and memory budget, MUA produces detailed rendering with motion aware appearance and wrinkles. Please zoom-in to better observe the details. 5.2 Dataset We conduct all experiments on the dataset released by UMA [11]. The dataset comprises five subjects wearing garments with rich non-rigid clothing dynamics and intricate texture patterns. For each subj… view at source ↗

**Figure 5.** Figure 5: Qualitative Geometry Results. Given limited computation and memory budget, MUA and synthesize high-fidelity geometry with motion-aware wrinkles. 5.4 Quantitative Results Image Synthesis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison. The qualitative comparison between our method and the state-of-the-art approaches. Specifically, Animatable Gaussians, ASH and UMA belongs to server-based approaches, while 3DGS-Avatar and Tao Avatar belongs to mobile-based approaches. Please zoom-in to better inspect the detailed clothing wrinkles and appearances [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Ablations. Qualitative comparison of our full model against the design alternatives. Our full model better preserves wrinkle and appearance details with lowest computational overhead. teacher design in Sec. 3.2. Specifically, the LL subband TLL f and the detail subbands {Ts l,f } are predicted by separate 2D convolutional networks, each taking the positional and normal textures at the corresponding resolut… view at source ↗

**Figure 8.** Figure 8: Standalone VR demo. Screenshot of our standalone VR demo running on Meta Quest 3. Users can inspect the dynamic avatar, detailed geometry, skeletal pose, and live shadows in the VR environment at a frame rate of 24 FPS. All computations are performed on-the-fly on the headset. local spatial awareness in the 1D factorized vectors does not improve over our full method. This further proves that the learned bl… view at source ↗

read the original abstract

Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper distills high-fidelity blendshape avatars to mobile sizes via wavelet decomposition plus low-rank texture factorization, but the abstract gives no quantitative proof that fine clothing dynamics survive the cut.

read the letter

The main thing here is a new representation called Wavelet-guided Multi-level Spatial Factorized Blendshapes, paired with a distillation pipeline that takes a heavy teacher model and compresses it for standalone VR hardware. They claim 2000X lower compute, 10X smaller size, and 24 FPS native on a Quest 3 while keeping appearance and motion close to the original. The technical move that stands out is the explicit use of multi-level wavelets in texture space to isolate frequency bands before the low-rank factorization step; that combination is not a direct copy of earlier avatar compression work I have seen referenced in the abstract framing. The practical payoff is real: most prior mobile avatars either drop dynamics hard or stay too heavy for on-device use, so hitting real-time with plausible clothing motion would matter for games and social VR. What the paper does cleanly is target that exact gap and ship a pipeline that avoids training everything from scratch. The soft spot is the validation of the dynamics claim. The abstract only offers the phrases “visually plausible dynamics” and “closely resemble,” with no per-region temporal error, frequency-band PSNR, or wrinkle-specific metrics reported. If the high-frequency wavelet coefficients that carry fold motion get lost or aliased during factorization, the student model could look static even when static frames match. The stress-test note on high-frequency temporal fidelity is on point given what is shown. Comparisons to other methods are asserted but lack protocol details or baseline numbers in the summary, so it is difficult to judge the size of the improvement. This is for people who need deployable avatars on consumer hardware rather than for pure theory readers. A practitioner working on real-time VR characters would get concrete ideas from the representation and distillation steps. It deserves a serious referee because the problem is timely and the architectural choice is a fresh combination, even though the current evidence on motion fidelity is thin.

Referee Report

2 major / 1 minor

Summary. The paper introduces Wavelet-guided Multi-level Spatial Factorized Blendshapes as a compact animatable avatar representation together with a distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance from a high-quality teacher model. It claims up to 2000X lower computational cost and 10X smaller model size than the teacher while preserving visually plausible dynamics, enabling >180 FPS on desktop and real-time 24 FPS native performance on Meta Quest 3, and outperforming prior mobile avatar methods.

Significance. If the efficiency and fidelity claims hold with rigorous validation, the work would meaningfully advance practical deployment of high-detail full-body avatars on consumer VR/AR hardware by combining spectral decomposition with low-rank factorization in a distillation setting. The approach addresses a clear gap between server-only ultra-fidelity models and lightweight but low-dynamic alternatives.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline claims of 2000X lower cost and 10X smaller size are stated without accompanying quantitative tables, baseline comparisons, error bars, or evaluation protocol details. No per-region or frequency-band metrics (e.g., high-frequency PSNR on clothing wrinkles, temporal coherence scores) are reported to verify that the distillation retains motion-dependent dynamics rather than smoothing them.
[§3] §3 (Method): The Wavelet-guided Multi-level Spatial Factorized Blendshapes representation depends on free parameters (number of wavelet levels, low-rank factorization rank) whose effect on preserving high-frequency temporal components of clothing deformation is not analyzed; truncation or aliasing in the wavelet bands during factorization could produce the very artifacts the method aims to avoid, yet no sensitivity study or spectral error analysis is provided.

minor comments (1)

[Abstract] Abstract: The sentence 'appearance details closely resemble those of the teacher model' is grammatically incomplete and should be rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the quantitative validation and analysis.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claims of 2000X lower cost and 10X smaller size are stated without accompanying quantitative tables, baseline comparisons, error bars, or evaluation protocol details. No per-region or frequency-band metrics (e.g., high-frequency PSNR on clothing wrinkles, temporal coherence scores) are reported to verify that the distillation retains motion-dependent dynamics rather than smoothing them.

Authors: We agree that the efficiency claims require more explicit quantitative backing. In the revised manuscript we will add a dedicated table in §4 reporting measured computational cost (FLOPs and wall-clock inference time on desktop and Quest 3 hardware), model size (parameters and MB), and direct comparisons against the teacher model as well as prior mobile and server-based baselines. Error bars from repeated runs will be included where relevant, and the evaluation protocol (sequences, hardware, measurement methodology) will be fully specified. To confirm retention of motion-dependent dynamics we will additionally report per-region metrics on clothing areas together with frequency-band PSNR and temporal coherence scores computed over animation sequences. revision: yes
Referee: [§3] §3 (Method): The Wavelet-guided Multi-level Spatial Factorized Blendshapes representation depends on free parameters (number of wavelet levels, low-rank factorization rank) whose effect on preserving high-frequency temporal components of clothing deformation is not analyzed; truncation or aliasing in the wavelet bands during factorization could produce the very artifacts the method aims to avoid, yet no sensitivity study or spectral error analysis is provided.

Authors: The number of wavelet levels and factorization rank were chosen via preliminary experiments to balance compactness and fidelity. We acknowledge that a dedicated sensitivity study is missing. In the revision we will add an ablation study (in §4 or an appendix) that systematically varies both parameters and reports their effect on high-frequency detail preservation using spectral error metrics, temporal coherence, and visual comparisons. This analysis will also address potential truncation or aliasing artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; novel representation and distillation are independent

full rationale

The paper proposes a new animatable avatar representation (Wavelet-guided Multi-level Spatial Factorized Blendshapes) together with a distillation pipeline from a pre-trained teacher model. Performance claims (2000X lower cost, 10X smaller size, real-time FPS) are presented as empirical outcomes of this architecture and transfer process rather than quantities defined by the same fitted parameters or reduced by construction to prior self-citations. No equations or steps in the provided text equate the claimed results to inputs via self-definition, fitted-input renaming, or load-bearing self-citation chains. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so exact parameter counts and assumptions cannot be audited in full. The method relies on standard neural avatar and knowledge-distillation assumptions plus new design choices whose tuning details are not stated.

free parameters (2)

number of wavelet decomposition levels
Multi-level spectral decomposition depth chosen to trade detail against efficiency; value not specified.
low-rank factorization rank
Rank of structural factorization in texture space selected for compression; value not specified.

axioms (1)

domain assumption Pre-trained ultra-high-quality teacher model supplies accurate motion-aware clothing dynamics and fine appearance details that can be distilled
Central transfer step assumes the teacher is a reliable source of ground-truth dynamics.

invented entities (1)

Wavelet-guided Multi-level Spatial Factorized Blendshapes no independent evidence
purpose: Compact efficient representation for high-fidelity animatable avatars
New proposed structure whose independent validation rests on the paper's own experiments.

pith-pipeline@v0.9.0 · 5580 in / 1414 out tokens · 44317 ms · 2026-05-10T04:41:30.943902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 6 canonical work pages

[1]

Exploring the design space of immersive urban analytics,

Z. Chen, Y. Wang, T. Sun, X. Gao, W. Chen, Z. Pan, H. Qu, and Y. Wu, “Exploring the design space of immersive urban analytics,” Visual Informatics, vol. 1, no. 2, pp. 132–142, 2017

2017
[2]

Educational twin: the influence of artificial xr expert duplicates on future learning,

C. Sayffaerth, “Educational twin: the influence of artificial xr expert duplicates on future learning,”arXiv preprint arXiv:2504.13896, 2025

work page arXiv 2025
[3]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P . P . Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” inEur. Conf. Comput. Vis., 2020

2020
[4]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM T rans. Graph., vol. 42, no. 4, pp. 1–14, 2023

2023
[5]

Neus2: Fast learning of neural implicit surfaces for multi- view reconstruction,

Y. Wang, Q. Han, M. Habermann, K. Daniilidis, C. Theobalt, and L. Liu, “Neus2: Fast learning of neural implicit surfaces for multi- view reconstruction,” inInt. Conf. Comput. Vis., 2023

2023
[6]

Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling,

Z. Li, Z. Zheng, L. Wang, and Y. Liu, “Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 711–19 722

2024
[7]

Hdhumans: A hybrid approach for high-fidelity digital humans,

M. Habermann, L. Liu, W. Xu, G. Pons-Moll, M. Zollhoefer, and C. Theobalt, “Hdhumans: A hybrid approach for high-fidelity digital humans,”Proceedings of the ACM on Computer Graphics and Interactive T echniques, vol. 6, no. 3, pp. 1–23, 2023. 14

2023
[8]

Tava: Template-free animatable volumetric actors,

R. Li, J. Tanke, M. Vo, M. Zollhofer, J. Gall, A. Kanazawa, and C. Lassner, “Tava: Template-free animatable volumetric actors,” 2022

2022
[9]

Arah: Animatable volume rendering of articulated human sdfs,

S. Wang, K. Schwarz, A. Geiger, and S. Tang, “Arah: Animatable volume rendering of articulated human sdfs,” inEur. Conf. Comput. Vis., 2022

2022
[10]

Ash: Animatable gaussian splats for efficient and photoreal human rendering,

H. Pang, H. Zhu, A. Kortylewski, C. Theobalt, and M. Habermann, “Ash: Animatable gaussian splats for efficient and photoreal human rendering,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2024, pp. 1165–1175

2024
[11]

Uma: Ultra- detailed human avatars via multi-level surface alignment,

H. Zhu, G. Sun, C. Theobalt, and M. Habermann, “Uma: Ultra- detailed human avatars via multi-level surface alignment,”arXiv preprint arXiv:2506.01802, 2025

work page arXiv 2025
[12]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 18–35

2024
[13]

Squeezeme: Mobile-ready distillation of gaussian full-body avatars,

F. Iandola, S. Pidhorskyi, I. Santesteban, D. Gupta, A. Pahuja, N. Bartolovic, F. Yu, E. Garbin, T. Simon, and S. Saito, “Squeezeme: Mobile-ready distillation of gaussian full-body avatars,” inProceed- ings of the Special Interest Group on Computer Graphics and Interactive T echniques Conference Conference Papers, 2025, pp. 1–11

2025
[14]

Taoa- vatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting,

J. Chen, J. Hu, G. Wang, Z. Jiang, T. Zhou, Z. Chen, and C. Lv, “Taoa- vatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 723–10 734

2025
[15]

Expressive body capture: 3d hands, face, and body from a single image,

G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 975–10 985

2019
[16]

Tensorf: Tensorial radiance fields,

A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” inEuropean conference on computer vision. Springer, 2022, pp. 333–350

2022
[17]

Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,

S. Peng, Y. Zhang, Y. Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 9054–9063

2021
[18]

Mixture of volumetric primitives for efficient neural rendering,

S. Lombardi, T. Simon, G. Schwartz, M. Zollhofer, Y. Sheikh, and J. M. Saragih, “Mixture of volumetric primitives for efficient neural rendering,”ACM T rans. Graph., vol. 40, no. 4, pp. 59:1–59:13, 2021

2021
[19]

Learning compositional radiance fields of dynamic human heads,

Z. Wang, T. Bagautdinov, S. Lombardi, T. Simon, J. Saragih, J. Hodgins, and M. Zollhofer, “Learning compositional radiance fields of dynamic human heads,” 2020

2020
[20]

HumanNeRF: Free-viewpoint ren- dering of moving people from monocular video,

C.-Y. Weng, B. Curless, P . P . Srinivasan, J. T. Barron, and I. Kemelmacher-Shlizerman, “HumanNeRF: Free-viewpoint ren- dering of moving people from monocular video,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2022, pp. 16 210–16 220

2022
[21]

Humanrf: High-fidelity neural radiance fields for humans in motion,

M. I¸ sık, M. Runz, M. Georgopoulos, T. Khakhulin, J. Starck, L. Agapito, and M. Niessner, “Humanrf: High-fidelity neural radiance fields for humans in motion,”ACM T rans. Graph., vol. 42, no. 4, pp. 1–12, 2023

2023
[22]

Repre- senting long volumetric video with temporal gaussian hierarchy,

Z. Xu, Y. Xu, Z. Yu, S. Peng, J. Sun, H. Bao, and X. Zhou, “Repre- senting long volumetric video with temporal gaussian hierarchy,” ACM T ransactions on Graphics (TOG), vol. 43, no. 6, pp. 1–18, 2024

2024
[23]

Reperformer: Immersive human-centric volumetric videos from playback to photoreal reperformance,

Y. Jiang, Z. Shen, C. Guo, Y. Hong, Z. Su, Y. Zhang, M. Habermann, and L. Xu, “Reperformer: Immersive human-centric volumetric videos from playback to photoreal reperformance,”arXiv preprint arXiv:2503.12242, 2025

work page arXiv 2025
[24]

Modeling clothing as a separate layer for an animatable human avatar,

D. Xiang, F. Prada, T. Bagautdinov, W. Xu, Y. Dong, H. Wen, J. Hodgins, and C. Wu, “Modeling clothing as a separate layer for an animatable human avatar,”ACM T rans. Graph., vol. 40, no. 6, pp. 1–15, 2021

2021
[25]

Detailed human avatars from monocular video,

T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll, “Detailed human avatars from monocular video,” inInternational Conference on 3D Vision, Sep. 2018, pp. 98–109

2018
[26]

Learning to reconstruct people in clothing from a single RGB camera,

T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons- Moll, “Learning to reconstruct people in clothing from a single RGB camera,” inIEEE Conf. Comput. Vis. Pattern Recog., Jun. 2019, pp. 1175–1186

2019
[27]

Deepcap: Monocular human performance capture using weak supervision,

M. Habermann, W. Xu, M. Zollhofer, G. Pons-Moll, and C. Theobalt, “Deepcap: Monocular human performance capture using weak supervision,” inIEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 5052–5063

2020
[28]

Livecap: Real-time human performance capture from monocular video,

M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt, “Livecap: Real-time human performance capture from monocular video,”ACM T ransactions On Graphics (TOG), vol. 38, no. 2, pp. 1–17, 2019

2019
[29]

ECON: Explicit Clothed humans Optimized via Normal integration,

Y. Xiu, J. Yang, X. Cao, D. Tzionas, and M. J. Black, “ECON: Explicit Clothed humans Optimized via Normal integration,” inIEEE Conf. Comput. Vis. Pattern Recog., June 2023

2023
[30]

Sifu: Side-view conditioned im- plicit function for real-world usable clothed human reconstruction,

Z. Zhang, Z. Yang, and Y. Yang, “Sifu: Side-view conditioned im- plicit function for real-world usable clothed human reconstruction,” inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9936–9947

2024
[31]

Gstar: Gaussian surface tracking and reconstruction,

C. Zheng, L. Xue, J. Zarate, and J. Song, “Gstar: Gaussian surface tracking and reconstruction,”arXiv preprint arXiv:2501.10283, 2025

work page arXiv 2025
[32]

Registering explicit to implicit: Towards high-fidelity garment mesh reconstruction from single images,

H. Zhu, L. Qiu, Y. Qiu, and X. Han, “Registering explicit to implicit: Towards high-fidelity garment mesh reconstruction from single images,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 3845– 3854

2022
[33]

Neural human performer: Learning generalizable radiance fields for human performance rendering,

Y. Kwon, D. Kim, D. Ceylan, and H. Fuchs, “Neural human performer: Learning generalizable radiance fields for human performance rendering,”Adv. Neural Inform. Process. Syst., 2021

2021
[34]

Ibrnet: Learning multi-view image-based rendering,

Q. Wang, Z. Wang, K. Genova, P . Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser, “Ibrnet: Learning multi-view image-based rendering,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021

2021
[35]

Drivable volumetric avatars using texel-aligned features,

E. Remelli, T. M. Bagautdinov, S. Saito, C. Wu, T. Simon, S. Wei, K. Guo, Z. Cao, F. Prada, J. M. Saragih, and Y. Sheikh, “Drivable volumetric avatars using texel-aligned features,” inSIGGRAPH (Conference Paper T rack), 2022, pp. 56:1–56:9

2022
[36]

Holoported characters: Real-time free-viewpoint rendering of humans from sparse rgb cameras,

A. Shetty, M. Habermann, G. Sun, D. Luvizon, V . Golyanik, and C. Theobalt, “Holoported characters: Real-time free-viewpoint rendering of humans from sparse rgb cameras,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1206–1215

2024
[37]

Metacap: Meta-learning priors from multi-view imagery for sparse- view human performance capture and rendering,

G. Sun, R. Dabral, P . Fua, C. Theobalt, and M. Habermann, “Metacap: Meta-learning priors from multi-view imagery for sparse- view human performance capture and rendering,” inECCV, 2024

2024
[38]

Real-time free-view human rendering from sparse-view rgb videos using double unprojected textures,

G. Sun, R. Dabral, H. Zhu, P . Fua, C. Theobalt, and M. Habermann, “Real-time free-view human rendering from sparse-view rgb videos using double unprojected textures,” June 2025

2025
[39]

Giga: Generalizable sparse image-driven gaussian humans,

A. Zubekhin, H. Zhu, P . Gotardo, T. Beeler, M. Habermann, and C. Theobalt, “Giga: Generalizable sparse image-driven gaussian humans,”arXiv, 2025

2025
[40]

Blender,

Blender Foundation, “Blender,” 2025. [Online]. Available: https://www.blender.org

2025
[41]

Video- based reconstruction of animatable human characters,

C. Stoll, J. Gall, E. De Aguiar, S. Thrun, and C. Theobalt, “Video- based reconstruction of animatable human characters,”TOG, vol. 29, no. 6, pp. 1–10, 2010

2010
[42]

Drape: Dressing any person,

P . Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black, “Drape: Dressing any person,”TOG, vol. 31, no. 4, pp. 1–10, 2012

2012
[43]

Video-based characters: creating new human performances from a multi-view video database,

F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H.-P . Seidel, J. Kautz, and C. Theobalt, “Video-based characters: creating new human performances from a multi-view video database,” inACM SIGGRAPH 2011 papers, 2011, pp. 1–10

2011
[44]

4d video textures for interactive character appearance,

D. Casas, M. Volino, J. Collomosse, and A. Hilton, “4d video textures for interactive character appearance,”Comput. Graph. Forum, vol. 33, no. 2, p. 371–380, May 2014

2014
[45]

Textured neural avatars,

A. Shysheya, E. Zakharov, K.-A. Aliev, R. Bashirov, E. Burkov, K. Iskakov, A. Ivakhnenko, Y. Malkov, I. Pasechnik, D. Ulyanov et al., “Textured neural avatars,” inIEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 2387–2397

2019
[46]

Driving-signal aware full-body avatars,

T. Bagautdinov, C. Wu, T. Simon, F. Prada, T. Shiratori, S.-E. Wei, W. Xu, Y. Sheikh, and J. Saragih, “Driving-signal aware full-body avatars,”ACM T ransactions on Graphics (TOG), vol. 40, no. 4, pp. 1–17, 2021

2021
[47]

Dressing avatars: Deep photorealistic appearance for physically simulated clothing,

D. Xiang, T. Bagautdinov, T. Stuyck, F. Prada, J. Romero, W. Xu, S. Saito, J. Guo, B. Smith, T. Shiratoriet al., “Dressing avatars: Deep photorealistic appearance for physically simulated clothing,”ACM T rans. Graph., vol. 41, no. 6, pp. 1–15, 2022

2022
[48]

Real-time deep dynamic characters,

M. Habermann, L. Liu, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt, “Real-time deep dynamic characters,”ACM T rans. Graph., vol. 40, no. 4, aug 2021

2021
[49]

Embedded deformation for shape manipulation,

R. W. Sumner, J. Schmid, and M. Pauly, “Embedded deformation for shape manipulation,”ACM T rans. Graph., vol. 26, no. 3, p. 80–es, jul 2007

2007
[50]

Meshavatar: Learning high-quality triangular human avatars from multi-view videos,

Y. Chen, Z. Zheng, Z. Li, C. Xu, and Y. Liu, “Meshavatar: Learning high-quality triangular human avatars from multi-view videos,” in Eur. Conf. Comput. Vis.Springer, 2024, pp. 250–269

2024
[51]

Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction,

P . Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: learning neural implicit surfaces by volume rendering for multi-view reconstruction,” inProceedings of the 35th International Conference on Neural Information Processing Systems, 2021, pp. 27 171– 27 183. 15

2021
[52]

SMPL: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,”ACM T rans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, Oct 2015

2015
[53]

Star: Sparse trained articulated human body regressor,

A. Osman, T. Bolkart, and M. J. Black, “Star: Sparse trained articulated human body regressor,” inEur. Conf. Comput. Vis., 2020, pp. 598–613

2020
[54]

Total capture: A 3d deformation model for tracking faces, hands, and bodies,

H. Joo, T. Simon, and Y. Sheikh, “Total capture: A 3d deformation model for tracking faces, hands, and bodies,” inIEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 8320–8329

2018
[55]

Neural actor: Neural free-view synthesis of human actors with pose control,

L. Liu, M. Habermann, V . Rudnev, K. Sarkar, J. Gu, and C. Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,”ACM T rans. Graph.(ACM SIGGRAPH Asia), 2021

2021
[56]

Animatable neural radiance fields for modeling dynamic human bodies,

S. Peng, J. Dong, Q. Wang, S. Zhang, Q. Shuai, X. Zhou, and H. Bao, “Animatable neural radiance fields for modeling dynamic human bodies,” inInt. Conf. Comput. Vis., 2021, pp. 14 314–14 323

2021
[57]

H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion,

H. Xu, T. Alldieck, and C. Sminchisescu, “H-nerf: Neural radiance fields for rendering and temporal reconstruction of humans in motion,”Adv. Neural Inform. Process. Syst., vol. 34, pp. 14 955–14 966, 2021

2021
[58]

Neural novel actor: Learning a generalized animatable neural representation for human actors,

Q. Gao, Y. Wang, L. Liu, L. Liu, C. Theobalt, and B. Chen, “Neural novel actor: Learning a generalized animatable neural representation for human actors,”IEEE T rans. Vis. Comput. Graph., 2023

2023
[59]

Avatarrex: Real- time expressive full-body avatars,

Z. Zheng, X. Zhao, H. Zhang, B. Liu, and Y. Liu, “Avatarrex: Real- time expressive full-body avatars,”ACM T rans. Graph., vol. 42, no. 4, 2023

2023
[60]

Deliffas: Deformable light fields for fast avatar synthesis,

Y. Kwon, L. Liu, H. Fuchs, M. Habermann, and C. Theobalt, “Deliffas: Deformable light fields for fast avatar synthesis,”Adv. Neural Inform. Process. Syst., 2023

2023
[61]

Trihuman: A real-time and controllable tri-plane representation for de- tailed human geometry and appearance synthesis,

H. Zhu, F. Zhan, C. Theobalt, and M. Habermann, “Trihuman: A real-time and controllable tri-plane representation for de- tailed human geometry and appearance synthesis,”arXiv preprint arXiv:2312.05161, 2023

work page arXiv 2023
[62]

Efficient geometry-aware 3D generative adversarial networks,

E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. D. Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, T. Karras, and G. Wetzstein, “Efficient geometry-aware 3D generative adversarial networks,” inCVPR, 2022

2022
[63]

Scale: Modeling clothed humans with a surface codec of articulated local elements,

Q. Ma, S. Saito, J. Yang, S. Tang, and M. J. Black, “Scale: Modeling clothed humans with a surface codec of articulated local elements,” inCVPR, 2021, pp. 16 082–16 093

2021
[64]

The power of points for modeling humans in clothing,

Q. Ma, J. Yang, S. Tang, and M. J. Black, “The power of points for modeling humans in clothing,” inICCV, 2021, pp. 10 974–10 984

2021
[65]

Learning implicit templates for point-based clothed human modeling,

S. Lin, H. Zhang, Z. Zheng, R. Shao, and Y. Liu, “Learning implicit templates for point-based clothed human modeling,” inECCV. Springer, 2022, pp. 210–228

2022
[66]

Smpl: A skinned multi-person linear model,

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,”ACM T ransactions on Graphics, vol. 34, no. 6, 2015

2015
[67]

Gart: Gaussian articulated template models,

J. Lei, Y. Wang, G. Pavlakos, L. Liu, and K. Daniilidis, “Gart: Gaussian articulated template models,” inCVPR, 2024

2024
[68]

3dgs- avatar: Animatable avatars via deformable 3d gaussian splatting,

Z. Qian, S. Wang, M. Mihajlovic, A. Geiger, and S. Tang, “3dgs- avatar: Animatable avatars via deformable 3d gaussian splatting,” inCVPR, 2024

2024
[69]

Gauhuman: Articulated gaussian splatting from monocular human videos,

S. Hu and Z. Liu, “Gauhuman: Articulated gaussian splatting from monocular human videos,” inCVPR, 2024

2024
[70]

Hugs: Human gaussian splats,

M. Kocabas, J.-H. R. Chang, J. Gabriel, O. Tuzel, and A. Ranjan, “Hugs: Human gaussian splats,” inCVPR, 2024

2024
[71]

Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians,

L. Hu, H. Zhang, Y. Zhang, B. Zhou, B. Liu, S. Zhang, and L. Nie, “Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians,” inCVPR, 2024

2024
[72]

Pixel codec avatars,

S. Ma, T. Simon, J. Saragih, D. Wang, Y. Li, F. De la Torre, and Y. Sheikh, “Pixel codec avatars,” inCVPR, June 2021, pp. 64–73

2021
[73]

Morf: Mobile realistic fullbody avatars from a monocular video,

R. Bashirov, A. Larionov, E. Ustinova, M. Sidorenko, D. Svitov, I. Zakharkin, and V . Lempitsky, “Morf: Mobile realistic fullbody avatars from a monocular video,” inCVPR, 2024, pp. 3545–3555

2024
[74]

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,

Z. Shao, Z. Wang, Z. Li, D. Wang, X. Lin, Y. Zhang, M. Fan, and Z. Wang, “SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,” inComputer Vision and Pattern Recognition (CVPR), 2024

2024
[75]

The Captury,

TheCaptury, “The Captury,” http://www.thecaptury.com/, 2020

2020
[76]

Skinning with dual quaternions,

L. Kavan, S. Collins, J. Žára, and C. O’Sullivan, “Skinning with dual quaternions,” inProceedings of the 2007 symposium on Interactive 3D graphics and games, 2007, pp. 39–46

2007
[77]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P . Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oct 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241

2015
[78]

Image inpainting via generative multi-column convolutional neural networks,

Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, “Image inpainting via generative multi-column convolutional neural networks,”Advances in neural information processing systems, vol. 31, 2018

2018
[79]

Principal components analysis (pca),

A. Ma´ ckiewicz and W. Ratajczak, “Principal components analysis (pca),”Computers & Geosciences, vol. 19, no. 3, pp. 303–342, 1993

1993
[80]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y. Shen, P . Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

2022

Showing first 80 references.