pith. sign in

arxiv: 2503.09640 · v2 · submitted 2025-03-12 · 💻 cs.GR · cs.CV

Physically Plausible Human-Object Rendering from Sparse Views via 3D Gaussian Splatting

Pith reviewed 2026-05-23 01:02 UTC · model grok-4.3

classification 💻 cs.GR cs.CV
keywords 3D Gaussian SplattingHuman-Object InteractionSparse View RenderingPhysical PlausibilityDynamic GaussiansPose RefinementContact PredictionGeometric Constraints
0
0 comments X

The pith

HOGS renders physically plausible human-object interactions from sparse views by optimizing dynamic 3D Gaussians with contact constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HOGS, a framework that represents humans and objects as dynamic 3D Gaussians and optimizes them directly to enforce geometric consistency. This prevents inter-penetration or floating contacts while producing high-quality renderings from limited camera inputs. Two supporting pre-trained modules refine human poses under occlusion and predict contact regions to guide the optimization losses. Experiments on interaction datasets show the method reaches state-of-the-art visual quality at high speed.

Core claim

HOGS represents both humans and objects as dynamic 3D Gaussians. A novel optimization process operates directly on these Gaussians to enforce geometric consistency, preventing inter-penetration or floating contacts, thereby achieving physical plausibility. Two pre-trained modules—an optimization-guided Human Pose Refiner and a Human-Object Contact Predictor—supply accurate pose and contact estimates to support the optimization under sparse-view ambiguity.

What carries the argument

Dynamic 3D Gaussians optimized via contact and separation losses, guided by a Human Pose Refiner and Human-Object Contact Predictor.

If this is right

  • Enables rendering of human-object and hand-object scenes from sparse views while maintaining physical plausibility.
  • Achieves state-of-the-art rendering quality alongside high computational efficiency.
  • The direct Gaussian optimization enforces no inter-penetration and proper contact without post-processing.
  • The framework supports both full-body and hand-scale interactions on existing datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Gaussian representation and loss structure could apply to other dynamic scenes requiring geometric constraints, such as multi-object stacking.
  • If the contact predictor generalizes, the method may reduce reliance on dense views in real-world capture setups.
  • Efficiency gains suggest possible use in interactive applications where both realism and speed matter.
  • Failure modes in the refiner module would likely appear first under heavy occlusion or unusual poses.

Load-bearing premise

The pre-trained pose refiner and contact predictor modules produce sufficiently accurate estimates from sparse views to guide the losses without introducing new errors.

What would settle it

If rendered outputs on the test datasets exhibit interpenetrations or floating contacts where ground-truth interactions show touching, or if rendering metrics fall below prior methods.

Figures

Figures reproduced from arXiv: 2503.09640 by Jun Xiao, Long Chen, Weiquan Wang, Yi Yang, Yueting Zhuang.

Figure 1
Figure 1. Figure 1: Comparison of state-of-the-art sparse-view HOI rendering methods. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HOGS pipeline. Given some sparse views of a dynamic HOI scene, HOGS first deforms human and object representations using a Human-Object Deformation process, which includes LBS for humans and rigid transformations for objects, along with a Human Pose Refinement module to enhance target pose accuracy. Deformed human and object Gaussians are then composed into a unified 3D space to form the Composed Gaussian … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of sparse-view human pose refinement [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The workflow of sparse-view contact prediction [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative evaluation of novel view synthesis for HOI rendering on the HODome dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extensibility of HOGS [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of physical loss. (a) Without physical loss, the rendered human floats above the chair, exhibiting a lack of phys￾ical contact. (b) Incorporating physical loss results in plausible contact and a more physically consistent rendering. diction module leads to a slight decrease in rendering qual￾ity but a significant increase in rendering efficiency (Ta￾ble 2). This phenomenon arises from the focused sc… view at source ↗
read the original abstract

Rendering realistic human-object interactions (HOIs) from sparse-view inputs is a challenging yet crucial task for various real-world applications. Existing methods often struggle to simultaneously achieve high rendering quality, physical plausibility, and computational efficiency. To address these limitations, we propose HOGS (Human-Object Rendering via 3D Gaussian Splatting), a novel framework for efficient HOI rendering with physically plausible geometric constraints from sparse views. HOGS represents both humans and objects as dynamic 3D Gaussians. Central to HOGS is a novel optimization process that operates directly on these Gaussians to enforce geometric consistency (i.e., preventing inter-penetration or floating contacts) to achieve physical plausibility. To support this core optimization under sparse-view ambiguity, our framework incorporates two pre-trained modules: an optimization-guided Human Pose Refiner for robust estimation under sparse-view occlusions, and a Human-Object Contact Predictor that efficiently identifies interaction regions to guide our novel contact and separation losses. Extensive experiments on both human-object and hand-object interaction datasets demonstrate that HOGS achieves state-of-the-art rendering quality and maintains high computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents HOGS, a framework for rendering human-object interactions (HOIs) from sparse views. It represents humans and objects as dynamic 3D Gaussians and performs optimization directly on these Gaussians to enforce physical plausibility via novel contact and separation losses that prevent inter-penetration and floating contacts. Two fixed pre-trained modules—an optimization-guided Human Pose Refiner and a Human-Object Contact Predictor—supply the contact regions and refined poses that guide the losses under sparse-view ambiguity. Experiments on human-object and hand-object interaction datasets are reported to achieve state-of-the-art rendering quality while maintaining computational efficiency.

Significance. If the pre-trained modules prove reliable under the targeted sparse-view occlusions, the approach could advance efficient, physically constrained rendering of interactions by combining dynamic 3D Gaussian Splatting with geometric losses. The direct optimization on Gaussians and the use of contact-aware terms address limitations in prior methods. The significance is limited, however, by the absence of independent validation for the modules that supply the physical constraints.

major comments (1)
  1. [Sections 3.3 and 3.4] Sections 3.3 and 3.4: The contact and separation losses are defined directly on the outputs of the fixed pre-trained Human Pose Refiner and Human-Object Contact Predictor. No ablation studies isolate the accuracy of these modules (pose error, contact precision/recall) on sparse-view data against ground truth. Because the modules remain frozen during Gaussian optimization, errors they produce under occlusion would propagate into the physical-plausibility constraints with no independent recovery mechanism, undermining the central claim that the optimization enforces geometric consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Sections 3.3 and 3.4: The contact and separation losses are defined directly on the outputs of the fixed pre-trained Human Pose Refiner and Human-Object Contact Predictor. No ablation studies isolate the accuracy of these modules (pose error, contact precision/recall) on sparse-view data against ground truth. Because the modules remain frozen during Gaussian optimization, errors they produce under occlusion would propagate into the physical-plausibility constraints with no independent recovery mechanism, undermining the central claim that the optimization enforces geometric consistency.

    Authors: We acknowledge that the current manuscript does not include isolated ablation studies evaluating the pose error or contact precision/recall of the fixed pre-trained Human Pose Refiner and Human-Object Contact Predictor specifically on sparse-view inputs against ground truth. The modules are indeed held fixed during the Gaussian optimization, as stated in Sections 3.3 and 3.4, so any inaccuracies under heavy occlusion would directly influence the contact and separation losses. Our defense of the central claim rests on the end-to-end experimental results: HOGS achieves state-of-the-art rendering quality and physical-plausibility metrics on both human-object and hand-object datasets, outperforming baselines that lack these geometric constraints. This indicates that the overall optimization produces plausible outputs in practice. To strengthen the presentation, we will add the requested module-level ablations (pose error and contact metrics on sparse-view test data) to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses external pre-trained modules and novel losses

full rationale

The paper presents HOGS as a forward proposal that represents humans and objects as dynamic 3D Gaussians and introduces a new optimization process with contact and separation losses. These losses are guided by two explicitly pre-trained modules (Human Pose Refiner and Human-Object Contact Predictor) described as fixed inputs. No derivation, equation, or claim in the provided text reduces a performance quantity to a fitted parameter from the same data, renames a known result, or relies on a load-bearing self-citation chain. The method is therefore self-contained against external benchmarks and the central rendering claims do not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all technical details remain opaque.

pith-pipeline@v0.9.0 · 5737 in / 1186 out tokens · 47308 ms · 2026-05-23T01:02:52.872122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 5.0

    MM-GS combines per-instance multi-view fusion with scene-level interaction modeling on 3D Gaussians to render high-fidelity multi-human multi-object scenes from sparse views.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper

  1. [1]

    Differentiable render- ing of neural sdfs through reparameterization

    Sai Praveen Bangaru, Michael Gharbi, Fujun Luan, Tzu-Mao Li, Kalyan Sunkavalli, Milos Hasan, Sai Bi, Zexiang Xu, Gilbert Bernstein, and Fredo Durand. Differentiable render- ing of neural sdfs through reparameterization. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 3

  2. [2]

    4d visualization of dynamic events from unconstrained multi-view videos

    Aayush Bansal, Minh V o, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 4d visualization of dynamic events from unconstrained multi-view videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5366–5375, 2020. 2

  3. [3]

    Interaction networks for learning about objects, relations and physics

    Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. Advances in neural in- formation processing systems, 29, 2016. 3

  4. [4]

    Method for registration of 3-d shapes

    Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, pages 586–606. Spie, 1992. 5

  5. [5]

    Behave: Dataset and method for tracking human object in- teractions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object in- teractions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935– 15946, 2022. 2

  6. [6]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11- 14, 2016, Proceedings, Part V 14, pages 561–578. Springer,

  7. [7]

    Flashback: Immersive virtual reality on mobile devices via rendering memoization

    Kevin Boos, David Chu, and Eduardo Cuervo. Flashback: Immersive virtual reality on mobile devices via rendering memoization. In Proceedings of the 14th Annual Interna- tional Conference on Mobile Systems, Applications, and Ser- vices, pages 291–304, 2016. 2

  8. [8]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 130–141, 2023. 3

  9. [9]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2025. 2

  10. [10]

    High-quality streamable free-viewpoint video

    Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (ToG) , 34(4):1–13,

  11. [11]

    point diffu- sion implicit function for large-scale scene neural represen- tation

    Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang Yu, and Tao Chen. point diffu- sion implicit function for large-scale scene neural represen- tation. Advances in Neural Information Processing Systems, 36, 2024. 3

  12. [12]

    Motion2fusion: Real-time volumetric performance capture

    Mingsong Dou, Philip Davidson, Sean Ryan Fanello, Sameh Khamis, Adarsh Kowdle, Christoph Rhemann, Vladimir Tankovich, and Shahram Izadi. Motion2fusion: Real-time volumetric performance capture. ACM Transactions on Graphics (ToG), 36(6):1–16, 2017. 2

  13. [13]

    3d gaussian splatting as new era: A survey

    Ben Fei, Jingyi Xu, Rui Zhang, Qingyuan Zhou, Weidong Yang, and Ying He. 3d gaussian splatting as new era: A survey. IEEE Transactions on Visualization and Computer Graphics, 2024. 2

  14. [14]

    Associated reality: A cognitive human–machine layer for autonomous driving

    Felipe Fernandez, Angel Sanchez, Jose F Velez, and Belen Moreno. Associated reality: A cognitive human–machine layer for autonomous driving. Robotics and Autonomous Systems, 133:103624, 2020. 1

  15. [15]

    Miruoto: Sports event atmo- sphere visual rendering through real-time image and sound processing system

    Guillaume Gourmelen, Shutaro Toriya, Eiko Miya, Naohisa Shioura, and Hiroyasu Iwata. Miruoto: Sports event atmo- sphere visual rendering through real-time image and sound processing system. In ACM SIGGRAPH 2024 Emerging Technologies, pages 1–2. 2024. 1

  16. [16]

    Observing human-object interactions: Using spatial and functional compatibility for recognition

    Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE transactions on pattern analysis and machine intelligence , 31(10):1775– 1789, 2009. 3

  17. [17]

    Resolving 3d human pose ambiguities with 3d scene constraints

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3d human pose ambiguities with 3d scene constraints. In Proceedings of the IEEE/CVF international conference on computer vision , pages 2282– 2292, 2019. 3

  18. [18]

    Populating 3d scenes by learning human-scene interaction

    Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dim- itrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14708–14718, 2021. 3

  19. [19]

    Hand-object interaction controller (hoic): Deep reinforce- ment learning for reconstructing interactions with physics

    Haoyu Hu, Xinyu Yi, Zhe Cao, Jun-Hai Yong, and Feng Xu. Hand-object interaction controller (hoic): Deep reinforce- ment learning for reconstructing interactions with physics. In ACM SIGGRAPH 2024 Conference Papers , pages 1–10,

  20. [20]

    Gauhuman: Articu- lated gaussian splatting from monocular human videos

    Shoukang Hu, Tao Hu, and Ziwei Liu. Gauhuman: Articu- lated gaussian splatting from monocular human videos. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20418–20431, 2024. 2, 3, 4

  21. [21]

    Capturing and inferring dense full-body human-scene contact

    Chun-Hao P Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J Black. Capturing and inferring dense full-body human-scene contact. In Proceedings of 9 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13274–13285, 2022. 1

  22. [22]

    Arch: Animatable reconstruction of clothed hu- mans

    Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony Tung. Arch: Animatable reconstruction of clothed hu- mans. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 3093–3102,

  23. [23]

    Interactive synthesis of human- object interaction

    Sumit Jain and C Karen Liu. Interactive synthesis of human- object interaction. In Proceedings of the 2009 ACM SIG- GRAPH/Eurographics Symposium on Computer Animation, pages 47–53, 2009. 3

  24. [24]

    Flexnerf: Photorealistic free- viewpoint rendering of moving humans from sparse views

    Vinoj Jayasundara, Amit Agrawal, Nicolas Heron, Abhinav Shrivastava, and Larry S Davis. Flexnerf: Photorealistic free- viewpoint rendering of moving humans from sparse views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21118–21127, 2023. 2

  25. [25]

    Neuralhofu- sion: Neural volumetric rendering under human-object in- teractions

    Yuheng Jiang, Suyi Jiang, Guoxing Sun, Zhuo Su, Kai- wen Guo, Minye Wu, Jingyi Yu, and Lan Xu. Neuralhofu- sion: Neural volumetric rendering under human-object in- teractions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6155– 6165, 2022. 2

  26. [26]

    Instant-nvr: Instant neural volumetric ren- dering for human-object interactions from monocular rgbd stream

    Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, and Lan Xu. Instant-nvr: Instant neural volumetric ren- dering for human-object interactions from monocular rgbd stream. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 595–605,

  27. [27]

    End-to-end recovery of human shape and pose

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 7122–7131, 2018. 4, 14

  28. [28]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  29. [29]

    Hugs: Human gaussian splats

    Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. Hugs: Human gaussian splats. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 505–515, 2024. 2, 3, 4

  30. [30]

    Human action recognition and predic- tion: A survey

    Yu Kong and Yun Fu. Human action recognition and predic- tion: A survey. International Journal of Computer Vision , 130(5):1366–1401, 2022. 1

  31. [31]

    Gen- eralizable human gaussians for sparse view synthesis

    Youngjoong Kwon, Baole Fang, Yixing Lu, Haoye Dong, Cheng Zhang, Francisco Vicente Carrasco, Albert Mosella- Montoro, Jianjin Xu, Shingo Takagi, Daeil Kim, et al. Gen- eralizable human gaussians for sparse view synthesis. In European Conference on Computer Vision, pages 451–468. Springer, 2025. 3

  32. [32]

    Gp- nerf: Generalized perception nerf for context-aware 3d scene understanding

    Hao Li, Dingwen Zhang, Yalun Dai, Nian Liu, Lechao Cheng, Jingfeng Li, Jingdong Wang, and Junwei Han. Gp- nerf: Generalized perception nerf for context-aware 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21708– 21718, 2024. 3

  33. [33]

    Task-oriented human-object interactions generation with implicit neural representations

    Quanzhou Li, Jingbo Wang, Chen Change Loy, and Bo Dai. Task-oriented human-object interactions generation with implicit neural representations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3035–3044, 2024. 1

  34. [34]

    Para- metric model-based 3d human shape and pose estimation from multiple views

    Zhongguo Li, Anders Heyden, and Magnus Oskarsson. Para- metric model-based 3d human shape and pose estimation from multiple views. In Image Analysis: 21st Scandinavian Conference, SCIA 2019, Norrk ¨oping, Sweden, June 11–13, 2019, Proceedings 21, pages 336–347. Springer, 2019. 4

  35. [35]

    Learning implicit templates for point-based clothed human modeling

    Siyou Lin, Hongwen Zhang, Zerong Zheng, Ruizhi Shao, and Yebin Liu. Learning implicit templates for point-based clothed human modeling. In European Conference on Com- puter Vision, pages 210–228. Springer, 2022. 4

  36. [36]

    Hosnerf: Dynamic human-object-scene neural ra- diance fields from a single video

    Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Hosnerf: Dynamic human-object-scene neural ra- diance fields from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 18483–18494, 2023. 2, 4

  37. [37]

    Humangaus- sian: Text-driven 3d human generation with gaussian splat- ting

    Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, and Ziwei Liu. Humangaus- sian: Text-driven 3d human generation with gaussian splat- ting. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 6646–6657,

  38. [38]

    Revisit human-scene interaction via space occu- pancy

    Xinpeng Liu, Haowen Hou, Yanchao Yang, Yong-Lu Li, and Cewu Lu. Revisit human-scene interaction via space occu- pancy. In European Conference on Computer Vision, pages 1–19. Springer, 2025. 1

  39. [39]

    Neural rays for occlusion-aware image-based render- ing

    Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping Wang. Neural rays for occlusion-aware image-based render- ing. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 7824–7833,

  40. [40]

    Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

    Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2025. 3

  41. [41]

    Smpl: a skinned multi- person linear model

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi- person linear model. ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 13

  42. [42]

    Splatfields: Neural gaussian splats for sparse 3d and 4d re- construction

    Marko Mihajlovic, Sergey Prokudin, Siyu Tang, Robert Maier, Federica Bogo, Tony Tung, and Edmond Boyer. Splatfields: Neural gaussian splats for sparse 3d and 4d re- construction. In European Conference on Computer Vision, pages 313–332. Springer, 2025. 2

  43. [43]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 3

  44. [44]

    Human gaussian 10 splatting: Real-time rendering of animatable avatars

    Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, and Eduardo P ´erez-Pellitero. Human gaussian 10 splatting: Real-time rendering of animatable avatars. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 788–798, 2024. 3, 4

  45. [45]

    Instant neural graphics primitives with a mul- tiresolution hash encoding

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 3

  46. [46]

    Coherentgs: Sparse novel view synthesis with coherent 3d gaussians

    Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, and Nima Khademi Kalan- tari. Coherentgs: Sparse novel view synthesis with coherent 3d gaussians. In European Conference on Computer Vision, pages 19–37. Springer, 2025. 2

  47. [47]

    Sparse multi-view hand-object reconstruction for unseen environ- ments

    Yik Lung Pang, Changjae Oh, and Andrea Cavallaro. Sparse multi-view hand-object reconstruction for unseen environ- ments. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 803–810, 2024. 1

  48. [48]

    Ani- matable neural radiance fields for modeling dynamic human bodies

    Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14314–14323, 2021. 4, 14

  49. [49]

    Gendr: A generalized differentiable ren- derer

    Felix Petersen, Bastian Goldluecke, Christian Borgelt, and Oliver Deussen. Gendr: A generalized differentiable ren- derer. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 4002–4011,

  50. [50]

    Manus: Markerless grasp capture using articulated 3d gaussians

    Chandradeep Pokhariya, Ishaan Nikhil Shah, Angela Xing, Zekun Li, Kefan Chen, Avinash Sharma, and Srinath Srid- har. Manus: Markerless grasp capture using articulated 3d gaussians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 2197– 2208, 2024. 2, 6, 7, 14

  51. [51]

    3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting

    Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3dgs-avatar: Animatable avatars via deformable 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5020–5030, 2024. 3

  52. [52]

    Web ar: A promising future for mobile augmented reality—state of the art, chal- lenges, and insights

    Xiuquan Qiao, Pei Ren, Schahram Dustdar, Ling Liu, Huadong Ma, and Junliang Chen. Web ar: A promising future for mobile augmented reality—state of the art, chal- lenges, and insights. Proceedings of the IEEE, 107(4):651– 666, 2019. 2

  53. [53]

    Em- bodied hands: modeling and capturing hands and bodies to- gether

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: modeling and capturing hands and bodies to- gether. ACM Transactions on Graphics (TOG), 36(6):1–17,

  54. [54]

    Em- bodied hands: Modeling and capturing hands and bodies to- gether

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Em- bodied hands: Modeling and capturing hands and bodies to- gether. arXiv preprint arXiv:2201.02610, 2022. 4

  55. [55]

    Image quality assessment through fsim, ssim, mse and psnr—a comparative study

    Umme Sara, Morium Akter, and Mohammad Shorif Ud- din. Image quality assessment through fsim, ssim, mse and psnr—a comparative study. Journal of Computer and Com- munications, 7(3):8–18, 2019. 14

  56. [56]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 4104–4113, 2016. 2

  57. [57]

    Swings: sliding windows for dynamic 3d gaussian splatting

    Richard Shaw, Michal Nazarczuk, Jifei Song, Arthur Moreau, Sibi Catley-Chandar, Helisa Dhamo, and Eduardo P´erez-Pellitero. Swings: sliding windows for dynamic 3d gaussian splatting. In European Conference on Computer Vision, pages 37–54. Springer, 2025. 2

  58. [58]

    Holo- ported characters: Real-time free-viewpoint rendering of humans from sparse rgb cameras

    Ashwath Shetty, Marc Habermann, Guoxing Sun, Diogo Lu- vizon, Vladislav Golyanik, and Christian Theobalt. Holo- ported characters: Real-time free-viewpoint rendering of humans from sparse rgb cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1206–1215, 2024. 2

  59. [59]

    Review of image-based rendering techniques

    Harry Shum and Sing Bing Kang. Review of image-based rendering techniques. Visual Communications and Image Processing 2000, 4067:2–13, 2000. 2

  60. [60]

    Free viewpoint video extraction, representation, coding, and rendering

    Aljoscha Smolic, Karsten Mueller, Philipp Merkle, Tobias Rein, Matthias Kautzner, Peter Eisert, and Thomas Wiegand. Free viewpoint video extraction, representation, coding, and rendering. In 2004 International Conference on Image Pro- cessing, 2004. ICIP’04., pages 3287–3290. IEEE, 2004. 2

  61. [61]

    Npc: Neural point characters from video

    Shih-Yang Su, Timur Bagautdinov, and Helge Rhodin. Npc: Neural point characters from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 14795–14805, 2023. 2

  62. [62]

    Neural free-viewpoint performance rendering under complex human-object interactions

    Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingyi Yu, and Jingya Wang. Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4651–4660,

  63. [63]

    Neuralhumanfvv: Real-time neural volumetric human performance rendering using rgb cameras

    Xin Suo, Yuheng Jiang, Pei Lin, Yingliang Zhang, Minye Wu, Kaiwen Guo, and Lan Xu. Neuralhumanfvv: Real-time neural volumetric human performance rendering using rgb cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6226–6237,

  64. [64]

    Grab: A dataset of whole-body human grasp- ing of objects

    Omid Taheri, Nima Ghorbani, Michael J Black, and Dim- itrios Tzionas. Grab: A dataset of whole-body human grasp- ing of objects. In Computer Vision–ECCV 2020: 16th Eu- ropean Conference, Glasgow, UK, August 23–28, 2020, Pro- ceedings, Part IV 16, pages 581–600. Springer, 2020. 5

  65. [65]

    Neurad: Neural rendering for autonomous driving

    Adam Tonderski, Carl Lindstr ¨om, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14895–14904, 2024. 1

  66. [66]

    Deco: Dense estimation of 3d human-scene contact in the wild

    Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J Black. Deco: Dense estimation of 3d human-scene contact in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8001–8013, 2023. 5

  67. [67]

    Ibr- net: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 7, 14 11

  68. [68]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 6, 14

  69. [69]

    Hu- mannerf: Free-viewpoint rendering of moving people from monocular video

    Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu- mannerf: Free-viewpoint rendering of moving people from monocular video. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern Recognition , pages 16210–16220, 2022. 2

  70. [70]

    Differentiable render- ing of parametric geometry

    Markus Worchel and Marc Alexa. Differentiable render- ing of parametric geometry. ACM Transactions on Graphics (TOG), 42(6):1–18, 2023. 3

  71. [71]

    Object- compositional neural implicit surfaces

    Qianyi Wu, Xian Liu, Yuedong Chen, Kejie Li, Chuanxia Zheng, Jianfei Cai, and Jianmin Zheng. Object- compositional neural implicit surfaces. In European Con- ference on Computer Vision, pages 197–213. Springer, 2022. 3

  72. [72]

    Space-time neural irradiance fields for free-viewpoint video

    Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9421–9431,

  73. [73]

    Relightable and animatable neural avatar from sparse-view video

    Zhen Xu, Sida Peng, Chen Geng, Linzhan Mou, Zihan Yan, Jiaming Sun, Hujun Bao, and Xiaowei Zhou. Relightable and animatable neural avatar from sparse-view video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 990–1000, 2024. 3

  74. [74]

    Neural render- ing in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured ob- jects

    Bangbang Yang, Yinda Zhang, Yijin Li, Zhaopeng Cui, Sean Fanello, Hujun Bao, and Guofeng Zhang. Neural render- ing in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured ob- jects. ACM Transactions on Graphics (TOG) , 41(4):1–10,

  75. [75]

    Neural- dome: A neural modeling pipeline on multi-view human- object interactions

    Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8834–8845, 2023. 2, 5, 6, 7, 14

  76. [76]

    Hoi-mˆ 3: Capture multiple humans and objects in- teraction within contextual environment

    Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Hoi-mˆ 3: Capture multiple humans and objects in- teraction within contextual environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 516–526, 2024. 3

  77. [77]

    Cor-gs: sparse-view 3d gaussian splatting via co-regularization

    Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, Jin Zheng, and Xiao Bai. Cor-gs: sparse-view 3d gaussian splatting via co-regularization. In European Conference on Computer Vision, pages 335–352. Springer, 2025. 2, 3

  78. [78]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6, 14

  79. [79]

    I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions

    Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 729–741, 2024. 1

  80. [80]

    In-place scene labelling and understanding with implicit scene representation

    Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15838–15847, 2021. 3

Showing first 80 references.