PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction

Baoyuan Wang; Guanying Chen; Lingteng Qiu; Shuguang Cui; Xianggang Yu; Xiaoguang Han; Xiaohang Ren

arxiv: 2604.13918 · v1 · submitted 2026-04-15 · 💻 cs.CV

PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction

Xianggang Yu , Lingteng Qiu , Xiaohang Ren , Guanying Chen , Shuguang Cui , Xiaoguang Han , Baoyuan Wang This is my paper

Pith reviewed 2026-05-10 13:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords neural radiance fieldsfacial avatar reconstructionanimatable avatarspart-based deformationinverse skinningmonocular videofacial motion modelingimplicit representations

0 comments

The pith

A part-based deformation field using multiple local MLPs allows neural radiance fields to reconstruct animatable facial avatars that generalize to unseen expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build animatable 3D face models from ordinary video by first warping observed points into a standard shape using a head template and then applying a deformation model that treats each facial region separately. Instead of one network for the whole face, it uses several small networks whose outputs are blended, so each part can move in its own way. This lets the model handle new expressions it never saw during training and pick up small movements like skin wrinkles that global methods miss. A reader would care because it points to a practical way to create realistic moving digital faces without expensive equipment or lots of data.

Core claim

The authors establish that applying inverse skinning from a parametric head model to map points to canonical space, followed by a part-based deformation field composed of multiple local MLPs that partition the space adaptively and aggregate deformations via soft-weighting, enables the neural radiance field to model fine-scale facial motions and generalize to unseen expressions, outperforming prior methods.

What carries the argument

The part-based deformation field, which consists of multiple local MLPs that adaptively partition the canonical space into different facial parts, with the deformation of each point computed by soft-weighting the predictions from all local MLPs.

Load-bearing premise

The parametric head model must supply accurate inverse skinning that maps every observed point into a single canonical space without leaving residual errors.

What would settle it

A test where the method is applied to a video with extreme unseen expressions or very fine motions like eyelid wrinkles, and the output shows visible artifacts or failure to match ground truth geometry would disprove the generalization claim.

Figures

Figures reproduced from arXiv: 2604.13918 by Baoyuan Wang, Guanying Chen, Lingteng Qiu, Shuguang Cui, Xianggang Yu, Xiaoguang Han, Xiaohang Ren.

**Figure 2.** Figure 2: Overview of PartNerFace. Our method represents the facial avatar as a neural radiance fields in the canonical space. Given a 3D point in the posed space, it is first transformed to the canonical space through inverse LBS based on the FLAME model, then followed by a part assigner which predicts the probability of its part ascription. This probability is used to aggregate the output from a set of local defor… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on the reconstruction of Testset. Some regions are zoomed in for detailed comparisons. cause capacity degradation in local deformation networks, i.e., the space partition produced by the part assigner is less meaningful, and the deformation field will be dominated by some of the local networks. To solve this problem, we adopt a two-stage optimization strategy. we first utilize a heur… view at source ↗

**Figure 4.** Figure 4: Visual comparison of animation results with unseen poses and expressions. camera at a resolution of 1920×1080 pixels. The images are cropped and resized to 512×512. The videos include subjects engaging normal conversations, such as smiling and head rotations. For each video, we use the last 1000 frames for testing, while the remaining for training (roughly 5000 frames), following the practice of (Gafni e… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of ablation studies. Columns 1-5: the visual comparison of between different variants and ground truths. Columns 6-8: the visualized face partitions on the learned surface. We use separate colors for each part and use the brightness to indicate the probability of surface points belonging this part (for hard part assignment, its probability is all ones) [PITH_FULL_IMAGE:figures/full_… view at source ↗

read the original abstract

We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PartNerFace maps points to canonical space with parametric inverse skinning then deforms them via a soft-weighted set of local MLPs, but the whole thing rests on the skinning step staying accurate for unseen expressions.

read the letter

The main thing here is a NeRF face avatar that first applies inverse skinning from a parametric head model to send observed points into canonical space, then models deformation with several local MLPs whose outputs are blended by soft weights so each facial region can move on its own. The claim is that this handles new expressions and fine motions better than just conditioning one network on parameters or learning a single canonical field. That combination is the concrete step the abstract puts forward. Earlier avatar NeRFs often blur small details because they treat the whole face uniformly, so breaking the deformation into parts and letting them aggregate softly is a direct response to that problem. The paper lays out the motivation cleanly and describes the architecture without unnecessary layers. The part-based field is a reasonable engineering choice once you accept the canonical-space premise. The soft spot is exactly the one the stress-test note flags. If the parametric model leaves any misalignment when it skins points from expressions outside the training set, those errors land in the wrong canonical locations before the local MLPs ever see them, and soft weighting cannot undo geometric mistakes at that stage. The abstract gives no sign of residual correction, uncertainty on the skinning, or experiments that measure how much error the inverse skinning actually introduces on held-out expressions. Without those checks the generalization and fine-motion results sit on an assumption that is known to be fragile in practice. The stated outperformance over prior methods is asserted but not quantified in what is visible, so the size of any real gain is still open. This is the sort of targeted follow-up that groups already working on video-driven face avatars will want to look at. Readers who know FLAME-style models and the standard NeRF avatar baselines will get the most from it and can decide quickly whether the local-MLP trick is worth porting. It deserves a serious referee because the problem is well-defined, the architecture is concrete, and the experiments can be checked directly once the full numbers and ablations are in front of people.

Referee Report

2 major / 2 minor

Summary. The paper introduces PartNerFace for reconstructing animatable facial avatars from monocular RGB videos. It first applies inverse skinning via a parametric head model (e.g., FLAME) to map observed points into a canonical space, then models fine-scale motions using a part-based deformation field composed of multiple local MLPs whose outputs are aggregated by a soft-weighting mechanism. The central claims are improved generalization to unseen expressions and superior capture of fine facial details relative to prior NeRF-based methods that either condition directly on morphable parameters or learn a single global canonical field.

Significance. If the quantitative and qualitative results hold, the part-based deformation approach would represent a useful incremental advance in animatable facial NeRF avatars by explicitly decomposing deformation modeling across facial regions. The soft-weighting aggregation is a natural and lightweight extension of existing local deformation techniques, and the method's reliance on established parametric skinning makes it relatively easy to reproduce.

major comments (2)

[§3.1] §3.1 (Inverse Skinning): The generalization claim to unseen expressions rests on the assumption that the parametric head model's inverse skinning maps every observed point to a single, consistent canonical space without residuals large enough to affect fine-motion modeling. No error analysis, uncertainty modeling, or residual-correction term is described; any misalignment introduced at this stage cannot be corrected by the subsequent soft-weighted local MLPs.
[§4] §4 (Experiments): The abstract asserts quantitative and qualitative outperformance on unseen expressions and fine-scale motions, yet the provided text does not include the specific metrics, ablation studies on the number of local MLPs, or comparisons isolating the contribution of the part-based field versus the skinning stage. Without these, it is impossible to verify that the claimed improvements are not artifacts of the parametric model alone.

minor comments (2)

[§3.2] Notation for the soft-weighting coefficients (Eq. 3 or equivalent) should be defined explicitly before first use and related to the part-partitioning loss if one exists.
[Figure 3] Figure 3 (qualitative results) would benefit from side-by-side error maps or zoomed insets highlighting the fine-scale motions claimed to be recovered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made.

read point-by-point responses

Referee: [§3.1] §3.1 (Inverse Skinning): The generalization claim to unseen expressions rests on the assumption that the parametric head model's inverse skinning maps every observed point to a single, consistent canonical space without residuals large enough to affect fine-motion modeling. No error analysis, uncertainty modeling, or residual-correction term is described; any misalignment introduced at this stage cannot be corrected by the subsequent soft-weighted local MLPs.

Authors: We agree that the inverse skinning step from FLAME is a key assumption underlying generalization to unseen expressions. The part-based deformation field with local MLPs and soft-weighting is specifically designed to model and compensate for fine-scale residuals and deviations from the parametric mapping in canonical space. We will add a dedicated paragraph in Section 3.1 discussing potential skinning inaccuracies and how the subsequent deformation stage mitigates them, along with qualitative visualizations of residual corrections. A full quantitative error analysis of the skinning alone is not currently available from our experiments, so this is a partial revision focused on clarification and supporting evidence. revision: partial
Referee: [§4] §4 (Experiments): The abstract asserts quantitative and qualitative outperformance on unseen expressions and fine-scale motions, yet the provided text does not include the specific metrics, ablation studies on the number of local MLPs, or comparisons isolating the contribution of the part-based field versus the skinning stage. Without these, it is impossible to verify that the claimed improvements are not artifacts of the parametric model alone.

Authors: Section 4 of the full manuscript reports quantitative results using PSNR, SSIM, and LPIPS on unseen expressions, with qualitative comparisons to prior NeRF-based methods. We will expand this section to explicitly include an ablation table on the number of local MLPs (testing 4, 8, and 16 parts) and a direct comparison isolating the part-based deformation field against a baseline that uses only the inverse skinning without the local MLPs. These additions will clarify that the improvements stem from the part-based modeling rather than the parametric model alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper applies an external parametric head model for inverse skinning to map points into canonical space, then introduces a novel part-based deformation field consisting of multiple local MLPs aggregated via soft-weighting. This structure is presented as an architectural choice rather than a self-referential definition or a fitted parameter renamed as a prediction. Generalization claims to unseen expressions rest on empirical experiments comparing against baselines, not on quantities defined solely by the paper's own fitted values or self-citations. No load-bearing self-citation chains, ansatz smuggling, or uniqueness theorems imported from prior author work are evident in the derivation. The central performance claims therefore remain independent of the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method rests on standard NeRF and parametric head model assumptions plus the new part-based deformation component; no free parameters are explicitly fitted in the abstract description.

axioms (2)

domain assumption A parametric head model supplies reliable inverse skinning that maps observed points into a consistent canonical space.
Invoked to enable the subsequent deformation modeling step.
standard math Neural radiance fields can represent view-dependent facial appearance once geometry is correctly deformed.
Underlying representation used after deformation.

invented entities (1)

Part-based deformation field consisting of multiple local MLPs with soft-weighting no independent evidence
purpose: To adaptively model fine-scale motions differently across facial regions.
Introduced to overcome limitations of single canonical field or direct parameter conditioning.

pith-pipeline@v0.9.0 · 5505 in / 1433 out tokens · 40832 ms · 2026-05-10T13:50:52.482990+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

2 Lin, J.; Yuan, Y .; Shao, T.; and Zhou, K. 2020. Towards high-fidelity 3D face reconstruction from in-the-wild im- ages using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5891–5900. 2 Liu, L.; Habermann, M.; Rudnev, V .; Sarkar, K.; Gu, J.; and Theobalt, C. 2021. Neural actor: Neural ...

work page 2020
[2]

arXiv preprint arXiv:2102.06199 , year=

Emotional facial expression transfer from a single image via generative adversarial nets.Computer Animation and Virtual Worlds, 29(3-4): e1819. 2 Siarohin, A.; Lathuili `ere, S.; Tulyakov, S.; Ricci, E.; and Sebe, N. 2019. Animating arbitrary objects via deep motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

work page arXiv 2019
[3]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Structured Local Radiance Fields for Human Avatar Modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7 Zhu, W.; Wu, H.; Chen, Z.; Vesdapunt, N.; and Wang, B

work page
[4]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4958–4967

Reda: reinforced differentiable attribute for 3d face reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4958–4967. 2

work page

[1] [1]

2 Lin, J.; Yuan, Y .; Shao, T.; and Zhou, K. 2020. Towards high-fidelity 3D face reconstruction from in-the-wild im- ages using graph convolutional networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5891–5900. 2 Liu, L.; Habermann, M.; Rudnev, V .; Sarkar, K.; Gu, J.; and Theobalt, C. 2021. Neural actor: Neural ...

work page 2020

[2] [2]

arXiv preprint arXiv:2102.06199 , year=

Emotional facial expression transfer from a single image via generative adversarial nets.Computer Animation and Virtual Worlds, 29(3-4): e1819. 2 Siarohin, A.; Lathuili `ere, S.; Tulyakov, S.; Ricci, E.; and Sebe, N. 2019. Animating arbitrary objects via deep motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

work page arXiv 2019

[3] [3]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Structured Local Radiance Fields for Human Avatar Modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7 Zhu, W.; Wu, H.; Chen, Z.; Vesdapunt, N.; and Wang, B

work page

[4] [4]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4958–4967

Reda: reinforced differentiable attribute for 3d face reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4958–4967. 2

work page