Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency

Jingi Kim; Wonjun Kim

arxiv: 2604.25466 · v1 · submitted 2026-04-28 · 💻 cs.CV

Generalizable Human Gaussian Splatting via Multi-view Semantic Consistency

Jingi Kim , Wonjun Kim This is my paper

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords generalizable human rendering3D Gaussian splattingsparse-view inputsmulti-view consistencycross-view attentiondepth unprojectionbody part alignment

0 comments

The pith

Unprojecting multi-view latent embeddings into shared 3D space with cross-view attention improves 3D Gaussian localization for sparse-view human rendering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix inconsistent feature representations that arise when rendering humans from only a few camera views. Existing approaches use geometric constraints or fixed body models, yet still produce errors in textured areas and occluded limbs because features from different views do not align reliably. The proposed method encodes each view into latent embeddings, projects those embeddings into a common 3D volume using predicted depth, and then uses cross-view attention to group and recalibrate embeddings that belong to the same body part. If this alignment step succeeds, the resulting 3D Gaussians are placed more accurately, yielding higher-quality novel-view images without requiring dense input views or hand-crafted skeletons.

Core claim

The central claim is that unprojecting latent embeddings encoded from each viewpoint into a shared 3D space through predicted depth maps and recalibrating them belonging to the same body part based on cross-view attention resolves spatial ambiguity in highly textured regions and occluded body parts, thereby producing more accurate 3D Gaussian placements for generalizable human rendering from sparse inputs.

What carries the argument

Multi-view semantic consistency module that unprojects per-view latent embeddings via predicted depth maps into 3D and applies cross-view attention to re-align features of the same body part.

If this is right

Accurate 3D Gaussian placement becomes possible without explicit skeleton fitting or dense geometric supervision.
Rendering quality on benchmark human datasets improves for novel views when only a few input images are available.
The same unprojection-plus-attention pattern can be inserted into other generalizable Gaussian splatting pipelines that currently rely on per-view features alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If depth prediction quality continues to improve, this style of semantic recalibration could extend to non-human dynamic scenes such as animals or deformable objects.
The method implicitly trades reliance on explicit geometry for reliance on learned attention; future work could measure how much depth accuracy is actually required before performance collapses.

Load-bearing premise

Predicted depth maps must be accurate enough for reliable unprojection and cross-view attention must correctly match features of the same body part even when articulations are complex and view overlap is limited.

What would settle it

A test set of sparse-view captures where an independent depth estimator produces large errors on textured clothing or self-occluded limbs; if Gaussian localization error rises sharply and rendering quality drops below baseline methods on those cases, the method's premise is falsified.

Figures

Figures reproduced from arXiv: 2604.25466 by Jingi Kim, Wonjun Kim.

**Figure 1.** Figure 1: Examples of generalizable human Gaussian splatting. view at source ↗

**Figure 2.** Figure 2: Overall architecture of the proposed method. The VGGT [ view at source ↗

**Figure 3.** Figure 3: An example of cross-view attention to recalibrate latent view at source ↗

**Figure 4.** Figure 4: Results of novel view synthesis via generalizable human Gaussian splatting on ZJU-Mocap [ view at source ↗

**Figure 5.** Figure 5: Results of novel view synthesis via generalizable human Gaussian splatting on the THuman2.0 [ view at source ↗

**Figure 6.** Figure 6: Results of novel view synthesis (top row) and the corre view at source ↗

**Figure 8.** Figure 8: Novel-view rendering results on the THuman2.0 dataset. view at source ↗

read the original abstract

Recently, generalizable human Gaussian splatting from sparse-view inputs has been actively studied for the photorealistic human rendering. Most existing methods rely on explicit geometric constraints or predefined structural representations to accurately position 3D Gaussians. Although these approaches have shown the remarkable progress in this field, they still suffer from inconsistent feature representations across multi-view inputs due to complex articulations of the human body and limited overlaps between different views. To address this problem, we propose a novel method to accurately localize 3D Gaussians and ultimately improve the quality of human rendering. The key idea is to unproject latent embeddings encoded from each viewpoint into a shared 3D space through predicted depth maps and recalibrate them belonging to the same body part based on cross-view attention. This helps the model resolve the spatial ambiguity occurring in highly textured regions as well as occluded body parts, thus leading to the accurate localization of 3D Gaussians. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of generalizable human Gaussian splatting from sparse-view inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The depth-unprojection plus cross-view attention idea targets inconsistency in sparse human views but the abstract gives no metrics, ablations, or error analysis so the claims stay untested.

read the letter

The core move here is to encode latents from each input view, lift them into a shared 3D space using predicted depth maps, and then run cross-view attention to pull features that belong to the same body part into better alignment before the Gaussians are placed. That is the concrete new piece relative to prior work that leans on explicit geometry or fixed body templates. It is a direct attempt to reduce the spatial ambiguity that shows up in textured skin or occluded limbs when views barely overlap. The framing is clean and the mechanism is described at a level that makes sense on paper. It avoids adding more hand-crafted priors and instead tries to let attention do the recalibration. That is a reasonable direction for generalizable human splatting. The soft spots are straightforward. The abstract states that benchmark experiments show improvement, yet it supplies no PSNR, SSIM, LPIPS numbers, no ablation tables on the attention module or the depth step, and no analysis of how depth prediction noise affects the unprojected locations. The stress-test concern holds up from the description: depth errors of a few centimeters are common in sparse-view human depth estimation, and once the embeddings land in the wrong 3D spots, attention operating on mismatched positions has little chance of recovering the correct body-part matches. The paper would need to show either that the depths are reliable enough or that the attention is robust to realistic noise. This is for researchers already working on human-specific Gaussian splatting or novel-view synthesis from limited inputs. A reader who is extending recent sparse-view methods might want to see the full pipeline and results to decide whether to build on it. I would send it to peer review because the problem is real, the proposed fix is focused, and the write-up is coherent enough for referees to evaluate once the missing quantitative evidence and robustness checks are added.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a method for generalizable human Gaussian splatting from sparse-view inputs. Latent embeddings are encoded per viewpoint, unprojected into a shared 3D space via predicted depth maps, and recalibrated for semantic consistency across body parts using cross-view attention. This is intended to resolve spatial ambiguities in textured regions and occluded parts, enabling more accurate 3D Gaussian localization and improved photorealistic rendering. The abstract states that experiments on benchmark datasets demonstrate performance improvements over prior approaches.

Significance. If the multi-view semantic consistency mechanism works as described, the approach could offer a useful alternative to explicit geometric constraints for handling complex human articulations and limited view overlaps in sparse-input rendering. This has potential value for applications in VR/AR and animation where high-quality human models must be generated from few cameras. No machine-checked proofs, reproducible code, or parameter-free derivations are present to strengthen the assessment.

major comments (2)

Abstract: The central claim that 'experimental results on benchmark datasets show that the proposed method efficiently improves the performance' is unsupported, as the manuscript text supplies no quantitative metrics, ablation results, implementation details, or error analysis. This is load-bearing for the paper's assertion of improvement in generalizable human Gaussian splatting.
Key idea (unprojection and cross-view attention paragraph): The construction unprojects 2D latent embeddings into 3D using predicted depth maps before applying cross-view attention for recalibration. No analysis of depth prediction error propagation or attention robustness under realistic depth noise is provided, despite depth errors of a few centimeters being common in sparse-view human depth estimation. This assumption is load-bearing because inaccurate initial 3D positions would prevent attention from correctly aligning features belonging to the same body part.

minor comments (1)

The abstract would be strengthened by briefly stating the specific benchmark datasets and the nature of the reported improvements (e.g., PSNR gains).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment point by point below, outlining honest revisions to strengthen the manuscript without overstating current content.

read point-by-point responses

Referee: Abstract: The central claim that 'experimental results on benchmark datasets show that the proposed method efficiently improves the performance' is unsupported, as the manuscript text supplies no quantitative metrics, ablation results, implementation details, or error analysis. This is load-bearing for the paper's assertion of improvement in generalizable human Gaussian splatting.

Authors: We acknowledge that the abstract's claim is not directly supported by specific numbers or details in the current text. To resolve this, we will revise the abstract to incorporate key quantitative metrics from our experiments, such as PSNR, SSIM, and LPIPS improvements over baselines on the benchmark datasets. We will also ensure the results section explicitly presents the supporting metrics, ablations, and analysis so the claim is fully substantiated. revision: yes
Referee: Key idea (unprojection and cross-view attention paragraph): The construction unprojects 2D latent embeddings into 3D using predicted depth maps before applying cross-view attention for recalibration. No analysis of depth prediction error propagation or attention robustness under realistic depth noise is provided, despite depth errors of a few centimeters being common in sparse-view human depth estimation. This assumption is load-bearing because inaccurate initial 3D positions would prevent attention from correctly aligning features belonging to the same body part.

Authors: We agree this is a substantive gap, as the manuscript provides no dedicated analysis of depth error effects or robustness under noise. In the revision, we will add a new paragraph or subsection with sensitivity analysis, including experiments that inject realistic depth noise to evaluate how cross-view attention recalibrates features and maintains performance despite initial 3D localization inaccuracies. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an independent architectural proposal

full rationale

The paper presents a new pipeline that encodes per-view latents, unprojects them via predicted depth, and applies cross-view attention for semantic recalibration. No equation or claim reduces a target quantity to a fitted parameter or self-citation by construction. The central claim (improved 3D Gaussian localization) is justified by the proposed operations themselves rather than by re-deriving an input quantity or invoking an author-specific uniqueness theorem. The approach is therefore self-contained; any performance gain is an empirical question outside the logical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, background axioms, or newly postulated entities; the method appears to extend existing neural rendering components without introducing new ones.

pith-pipeline@v0.9.0 · 5477 in / 1250 out tokens · 60698 ms · 2026-05-07T16:45:25.523738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling

Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, et al. HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling. InProc. Eur. Conf. Comput. Vis., pages 557–577, 2022. 5, 6

work page 2022
[2]

TensoRF: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. TensoRF: Tensorial radiance fields. InProc. Eur. Conf. Comput. Vis., pages 333–350, 2022. 2

work page 2022
[3]

MeshAvatar: Learning high-quality triangular human avatars from multi-view videos

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. MeshAvatar: Learning high-quality triangular human avatars from multi-view videos. InProc. Eur. Conf. Comput. Vis., pages 250–269, 2024. 2

work page 2024
[4]

Relighting4d: Neural re- lightable human from videos

Zhaoxi Chen and Ziwei Liu. Relighting4d: Neural re- lightable human from videos. InProc. Eur. Conf. Comput. Vis., pages 606–623, 2022. 2

work page 2022
[5]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 605–613, 2017. 5

work page 2017
[6]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5501–5510, 2022. 2

work page 2022
[7]

K- Planes: Explicit radiance fields in space, time, and appear- ance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K- Planes: Explicit radiance fields in space, time, and appear- ance. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 12479–12488, 2023. 2

work page 2023
[8]

Mps-nerf: Generalizable 3d hu- man rendering from multiview images.IEEE Trans

Xiangjun Gao, Jiaolong Yang, Jongyoo Kim, Sida Peng, Zicheng Liu, and Xin Tong. Mps-nerf: Generalizable 3d hu- man rendering from multiview images.IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2, 7

work page 2022
[9]

GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 634–644,

work page
[10]

Sherf: Generalizable human nerf from a single image

Shoukang Hu, Fangzhou Hong, Liang Pan, Haiyi Mei, Lei Yang, and Ziwei Liu. Sherf: Generalizable human nerf from a single image. InProc. Int. Conf. Comput. Vis., pages 9352– 9364, 2023. 2, 6, 7

work page 2023
[11]

Eva-gaussian: 3d gaussian- based real-time human novel view synthesis under diverse multi-view camera settings,

Yingdong Hu, Zhening Liu, Jiawei Shao, Zehong Lin, and Jun Zhang. EV A-Gaussian: 3d gaussian-based real-time human novel view synthesis under diverse camera settings. arXiv preprint arXiv:2410.01425, 2024. 1, 2, 6, 7

work page arXiv 2024
[12]

Odin: A single model for 2d and 3d segmentation

Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, and Katerina Fragkiadaki. Odin: A single model for 2d and 3d segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3564–3574, 2024. 3

work page 2024
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[14]

HUGS: Human gaussian splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human gaussian splats. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 505–515, 2024. 1

work page 2024
[15]

Neural human performer: Learning generalizable ra- diance fields for human performance rendering.Adv

Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry Fuchs. Neural human performer: Learning generalizable ra- diance fields for human performance rendering.Adv. Neural Inform. Process. Syst., 34:24741–24752, 2021. 2, 5, 7

work page 2021
[16]

Generalizable human gaussians for sparse view synthesis

Youngjoong Kwon, Baole Fang, Yixing Lu, Haoye Dong, Cheng Zhang, Francisco Vicente Carrasco, Albert Mosella- Montoro, Jianjin Xu, Shingo Takagi, Daeil Kim, Aayush Prakash, and Fernando De la Torre. Generalizable human gaussians for sparse view synthesis. InProc. Eur. Conf. Com- put. Vis., pages 451–468, 2024. 1, 2, 5, 6, 7

work page 2024
[17]

GART: Gaussian articulated template models

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. GART: Gaussian articulated template models. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 19876–19887, 2024. 1

work page 2024
[18]

TA V A: Template-free animatable volumetric actors

Ruilong Li, Julian Tanke, Minh V o, Michael Zollh ¨ofer, J¨urgen Gall, Angjoo Kanazawa, and Christoph Lassner. TA V A: Template-free animatable volumetric actors. InProc. Eur. Conf. Comput. Vis., pages 419–436, 2022. 2

work page 2022
[19]

Neural Actor: Neural free-view synthesis of human actors with pose con- trol.ACM Trans

Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. Neural Actor: Neural free-view synthesis of human actors with pose con- trol.ACM Trans. Graph., 40(6):1–16, 2021. 2

work page 2021
[20]

SMPL: A skinned multi- person linear model.ACM Trans

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi- person linear model.ACM Trans. Graph., 34(6):248:1– 248:16, 2015. 1, 2

work page 2015
[21]

KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encod- ing of keypoints

Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, and Shunsuke Saito. KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encod- ing of keypoints. InProc. Eur. Conf. Comput. Vis., pages 179–197, 2022. 2

work page 2022
[22]

NeRF: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. InProc. Eur. Conf. Comput. Vis., pages 405–421,

work page
[23]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Trans

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Trans. Graph., 41(4):1–15,

work page
[24]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research,

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research,

work page
[25]

TransHuman: A transformer-based human representa- tion for generalizable neural human rendering

Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, and Yi Yang. TransHuman: A transformer-based human representa- tion for generalizable neural human rendering. InProc. Int. Conf. Comput. Vis., pages 3544–3555, 2023. 7

work page 2023
[26]

Nerfies: Deformable neural radiance fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5865–5874, 2021. 2

work page 2021
[27]

HyperNeRF: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. HyperNeRF: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans. Graph., 40(6):1–12, 2021. 2

work page 2021
[28]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 5

work page 2017
[29]

Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 9054–9063, 2021. 5

work page 2021
[30]

3DGS-Avatar: Animatable avatars via deformable 3d gaussian splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3d gaussian splatting. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5020–5030, 2024. 1

work page 2024
[31]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProc. Int. Conf. Comput. Vis., pages 12179–12188, 2021. 3, 4

work page 2021
[32]

A-NeRF: Articulated neural radiance fields for learning human shape, appearance, and pose.Adv

Shih-Yang Su, Frank Yu, Michael Zollh ¨ofer, and Helge Rhodin. A-NeRF: Articulated neural radiance fields for learning human shape, appearance, and pose.Adv. Neural Inform. Process. Syst., 34:12278–12291, 2021. 2

work page 2021
[33]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5294–5306, 2025. 3

work page 2025
[34]

IBRNet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet: Learning multi-view image-based rendering. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4690– 4699, 2021. 2

work page 2021
[35]

ARAH: Animatable volume rendering of articulated human sdfs

Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu Tang. ARAH: Animatable volume rendering of articulated human sdfs. InProc. Eur. Conf. Comput. Vis., pages 1–19,

work page
[36]

Image quality assessment: from error visibility to structural similarity.IEEE Trans

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4): 600–612, 2004. 6

work page 2004
[37]

GoMAvatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh

Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G Schwing, and Shenlong Wang. GoMAvatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. InProc. IEEE Conf. Comput. Vis. Pat- tern Recog., pages 2059–2069, 2024. 1

work page 2059
[38]

LIFe-GoM: Generalizable human rendering with learned iterative feed- back over multi-resolution gaussians-on-mesh

Jing Wen, Alex Schwing, and Shenlong Wang. LIFe-GoM: Generalizable human rendering with learned iterative feed- back over multi-resolution gaussians-on-mesh. InProc. Int. Conf. Learn. Represent., 2025. 1, 2

work page 2025
[39]

RoGSplat: Learning robust generalizable human gaussian splatting from sparse multi-view images

Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, and Wei- Shi Zheng. RoGSplat: Learning robust generalizable human gaussian splatting from sparse multi-view images. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5980–5990,

work page
[40]

Pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. Pixelnerf: Neural radiance fields from one or few images. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4578– 4587, 2021. 2

work page 2021
[41]

Function4D: Real-time human vol- umetric capture from very sparse consumer rgbd sensors

Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qiong- hai Dai, and Yebin Liu. Function4D: Real-time human vol- umetric capture from very sparse consumer rgbd sensors. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5746– 5756, 2021. 5, 6, 7

work page 2021
[42]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018. 6

work page 2018
[43]

GPS- Gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. GPS- Gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 19680–19690, 2024. 1, 2, 4, 6, 7

work page 2024

[1] [1]

HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling

Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, et al. HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling. InProc. Eur. Conf. Comput. Vis., pages 557–577, 2022. 5, 6

work page 2022

[2] [2]

TensoRF: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. TensoRF: Tensorial radiance fields. InProc. Eur. Conf. Comput. Vis., pages 333–350, 2022. 2

work page 2022

[3] [3]

MeshAvatar: Learning high-quality triangular human avatars from multi-view videos

Yushuo Chen, Zerong Zheng, Zhe Li, Chao Xu, and Yebin Liu. MeshAvatar: Learning high-quality triangular human avatars from multi-view videos. InProc. Eur. Conf. Comput. Vis., pages 250–269, 2024. 2

work page 2024

[4] [4]

Relighting4d: Neural re- lightable human from videos

Zhaoxi Chen and Ziwei Liu. Relighting4d: Neural re- lightable human from videos. InProc. Eur. Conf. Comput. Vis., pages 606–623, 2022. 2

work page 2022

[5] [5]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 605–613, 2017. 5

work page 2017

[6] [6]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5501–5510, 2022. 2

work page 2022

[7] [7]

K- Planes: Explicit radiance fields in space, time, and appear- ance

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K- Planes: Explicit radiance fields in space, time, and appear- ance. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 12479–12488, 2023. 2

work page 2023

[8] [8]

Mps-nerf: Generalizable 3d hu- man rendering from multiview images.IEEE Trans

Xiangjun Gao, Jiaolong Yang, Jongyoo Kim, Sida Peng, Zicheng Liu, and Xin Tong. Mps-nerf: Generalizable 3d hu- man rendering from multiview images.IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2, 7

work page 2022

[9] [9]

GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians

Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, and Liqiang Nie. GaussianAvatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 634–644,

work page

[10] [10]

Sherf: Generalizable human nerf from a single image

Shoukang Hu, Fangzhou Hong, Liang Pan, Haiyi Mei, Lei Yang, and Ziwei Liu. Sherf: Generalizable human nerf from a single image. InProc. Int. Conf. Comput. Vis., pages 9352– 9364, 2023. 2, 6, 7

work page 2023

[11] [11]

Eva-gaussian: 3d gaussian- based real-time human novel view synthesis under diverse multi-view camera settings,

Yingdong Hu, Zhening Liu, Jiawei Shao, Zehong Lin, and Jun Zhang. EV A-Gaussian: 3d gaussian-based real-time human novel view synthesis under diverse camera settings. arXiv preprint arXiv:2410.01425, 2024. 1, 2, 6, 7

work page arXiv 2024

[12] [12]

Odin: A single model for 2d and 3d segmentation

Ayush Jain, Pushkal Katara, Nikolaos Gkanatsios, Adam W Harley, Gabriel Sarch, Kriti Aggarwal, Vishrav Chaudhary, and Katerina Fragkiadaki. Odin: A single model for 2d and 3d segmentation. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 3564–3574, 2024. 3

work page 2024

[13] [13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[14] [14]

HUGS: Human gaussian splats

Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human gaussian splats. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 505–515, 2024. 1

work page 2024

[15] [15]

Neural human performer: Learning generalizable ra- diance fields for human performance rendering.Adv

Youngjoong Kwon, Dahun Kim, Duygu Ceylan, and Henry Fuchs. Neural human performer: Learning generalizable ra- diance fields for human performance rendering.Adv. Neural Inform. Process. Syst., 34:24741–24752, 2021. 2, 5, 7

work page 2021

[16] [16]

Generalizable human gaussians for sparse view synthesis

Youngjoong Kwon, Baole Fang, Yixing Lu, Haoye Dong, Cheng Zhang, Francisco Vicente Carrasco, Albert Mosella- Montoro, Jianjin Xu, Shingo Takagi, Daeil Kim, Aayush Prakash, and Fernando De la Torre. Generalizable human gaussians for sparse view synthesis. InProc. Eur. Conf. Com- put. Vis., pages 451–468, 2024. 1, 2, 5, 6, 7

work page 2024

[17] [17]

GART: Gaussian articulated template models

Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, and Kostas Daniilidis. GART: Gaussian articulated template models. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 19876–19887, 2024. 1

work page 2024

[18] [18]

TA V A: Template-free animatable volumetric actors

Ruilong Li, Julian Tanke, Minh V o, Michael Zollh ¨ofer, J¨urgen Gall, Angjoo Kanazawa, and Christoph Lassner. TA V A: Template-free animatable volumetric actors. InProc. Eur. Conf. Comput. Vis., pages 419–436, 2022. 2

work page 2022

[19] [19]

Neural Actor: Neural free-view synthesis of human actors with pose con- trol.ACM Trans

Lingjie Liu, Marc Habermann, Viktor Rudnev, Kripasindhu Sarkar, Jiatao Gu, and Christian Theobalt. Neural Actor: Neural free-view synthesis of human actors with pose con- trol.ACM Trans. Graph., 40(6):1–16, 2021. 2

work page 2021

[20] [20]

SMPL: A skinned multi- person linear model.ACM Trans

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi- person linear model.ACM Trans. Graph., 34(6):248:1– 248:16, 2015. 1, 2

work page 2015

[21] [21]

KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encod- ing of keypoints

Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang, and Shunsuke Saito. KeypointNeRF: Generalizing image-based volumetric avatars using relative spatial encod- ing of keypoints. InProc. Eur. Conf. Comput. Vis., pages 179–197, 2022. 2

work page 2022

[22] [22]

NeRF: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. InProc. Eur. Conf. Comput. Vis., pages 405–421,

work page

[23] [23]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Trans

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Trans. Graph., 41(4):1–15,

work page

[24] [24]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research,

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research,

work page

[25] [25]

TransHuman: A transformer-based human representa- tion for generalizable neural human rendering

Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, and Yi Yang. TransHuman: A transformer-based human representa- tion for generalizable neural human rendering. InProc. Int. Conf. Comput. Vis., pages 3544–3555, 2023. 7

work page 2023

[26] [26]

Nerfies: Deformable neural radiance fields

Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5865–5874, 2021. 2

work page 2021

[27] [27]

HyperNeRF: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. HyperNeRF: a higher- dimensional representation for topologically varying neural radiance fields.ACM Trans. Graph., 40(6):1–12, 2021. 2

work page 2021

[28] [28]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- ban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 5

work page 2017

[29] [29]

Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural Body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 9054–9063, 2021. 5

work page 2021

[30] [30]

3DGS-Avatar: Animatable avatars via deformable 3d gaussian splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3d gaussian splatting. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5020–5030, 2024. 1

work page 2024

[31] [31]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProc. Int. Conf. Comput. Vis., pages 12179–12188, 2021. 3, 4

work page 2021

[32] [32]

A-NeRF: Articulated neural radiance fields for learning human shape, appearance, and pose.Adv

Shih-Yang Su, Frank Yu, Michael Zollh ¨ofer, and Helge Rhodin. A-NeRF: Articulated neural radiance fields for learning human shape, appearance, and pose.Adv. Neural Inform. Process. Syst., 34:12278–12291, 2021. 2

work page 2021

[33] [33]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5294–5306, 2025. 3

work page 2025

[34] [34]

IBRNet: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet: Learning multi-view image-based rendering. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4690– 4699, 2021. 2

work page 2021

[35] [35]

ARAH: Animatable volume rendering of articulated human sdfs

Shaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu Tang. ARAH: Animatable volume rendering of articulated human sdfs. InProc. Eur. Conf. Comput. Vis., pages 1–19,

work page

[36] [36]

Image quality assessment: from error visibility to structural similarity.IEEE Trans

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Trans. Image Process., 13(4): 600–612, 2004. 6

work page 2004

[37] [37]

GoMAvatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh

Jing Wen, Xiaoming Zhao, Zhongzheng Ren, Alexander G Schwing, and Shenlong Wang. GoMAvatar: Efficient animatable human modeling from monocular video using gaussians-on-mesh. InProc. IEEE Conf. Comput. Vis. Pat- tern Recog., pages 2059–2069, 2024. 1

work page 2059

[38] [38]

LIFe-GoM: Generalizable human rendering with learned iterative feed- back over multi-resolution gaussians-on-mesh

Jing Wen, Alex Schwing, and Shenlong Wang. LIFe-GoM: Generalizable human rendering with learned iterative feed- back over multi-resolution gaussians-on-mesh. InProc. Int. Conf. Learn. Represent., 2025. 1, 2

work page 2025

[39] [39]

RoGSplat: Learning robust generalizable human gaussian splatting from sparse multi-view images

Junjin Xiao, Qing Zhang, Yonewei Nie, Lei Zhu, and Wei- Shi Zheng. RoGSplat: Learning robust generalizable human gaussian splatting from sparse multi-view images. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5980–5990,

work page

[40] [40]

Pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. Pixelnerf: Neural radiance fields from one or few images. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 4578– 4587, 2021. 2

work page 2021

[41] [41]

Function4D: Real-time human vol- umetric capture from very sparse consumer rgbd sensors

Tao Yu, Zerong Zheng, Kaiwen Guo, Pengpeng Liu, Qiong- hai Dai, and Yebin Liu. Function4D: Real-time human vol- umetric capture from very sparse consumer rgbd sensors. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., pages 5746– 5756, 2021. 5, 6, 7

work page 2021

[42] [42]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 586–595, 2018. 6

work page 2018

[43] [43]

GPS- Gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. GPS- Gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. InProc. IEEE Conf. Comput. Vis. Pattern Recog., pages 19680–19690, 2024. 1, 2, 4, 6, 7

work page 2024