pith. sign in

arxiv: 2605.17014 · v1 · pith:Z43EXVQMnew · submitted 2026-05-16 · 💻 cs.CV

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

Pith reviewed 2026-05-19 20:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructionhuman-object interactionmonocular videonovel objectsneural fieldscontact priorsmotion separation4D capture
0
0 comments X

The pith

A three-step method reconstructs 3D humans, novel objects, and scenes together from monocular videos of interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a person, a never-before-seen object they are using, and the room around them can all be turned into accurate 3D models from a single video taken with a moving camera. This matters because most videos of real actions mix the camera's shake with the object's movement and hide depths, making separate reconstruction of each part difficult and their interactions unrealistic. The work demonstrates this by breaking the problem into getting initial 3D hints from modern AI models, pulling apart the motions to line things up, and then polishing the models with a neural system that respects how things should touch.

Core claim

The paper establishes that by first using cues from 3D foundation models to get rough shapes and motions for both the object and scene despite poor texture, then removing the camera's movement to find the object's real motion and align it with the estimated human pose in a shared space, and finally optimizing a joint neural shape model with rules that pull surfaces together at contacts while keeping them from overlapping, one can obtain coherent 3D reconstructions of the full interaction.

What carries the argument

The compositional neural field built from separate signed-distance functions for each part, which supports adding contact attraction and interpenetration penalties during optimization.

Load-bearing premise

The approach depends on 3D foundation models providing reliable cues to build initial shapes from motion cues even on smooth object surfaces, and on the subtraction of camera movement from apparent motion leaving a clean estimate of the object's true movement.

What would settle it

A clear falsifier would be if the final 3D object model deviates significantly from the actual shape measured by the volumetric capture system, or if the human and object models show overlapping volumes at their contact points despite the priors.

Figures

Figures reproduced from arXiv: 2605.17014 by Chen Guo, Chengwei Zheng, Dimitrios Tzionas, Georgios Paschalidis, Juan Zarate, Lixin Xue, Manuel Kaufmann.

Figure 1
Figure 1. Figure 1: We develop RHINO, a novel framework that reconstructs detailed (dynamic) 3D human-object interactions (HOI) and the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Camera & Object motion (Sec. 3.1, Sec. 3.2). When both the camera and object move, their motion entangles into “apparent” motion. To disentangle them, we (a) estimate camera motion in the world frame via SfM on scene-only pixels, (b) estimate apparent motion via SfM on object-only pixels, (c–d) estimate object motion in the world frame by “removing” the camera motion from the apparent one. pose, so their p… view at source ↗
Figure 4
Figure 4. Figure 4: Method Overview. Starting with initialized global human and decoupled object poses (Sec. 3.1, Sec. 3.2), we sample points along the camera ray for the human, object and static scene. To enable consistent representation, sampled points are warped into canonical space using inverse LBS for the human and the estimated rigid transformation for the object. All components are then rendered holistically via compo… view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation on shape reconstruction (Sec. 4.3) on our BenchRHINO dataset (Sec. 4.1). HOLD [14] struggles with noisy object poses (rows 2, 4) and fails to model interaction. InterTrack [69] recovers reasonable object shape but fails to model the interaction due to large human and object pose errors. Our method (RHINO) faithfully recovers interactions, which lie closer to the ground truth (GT). Input InterTra… view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation on shape reconstruction on WildRHINO. InterTrack [69] fails on these out-of-distribution objects. While HOLD∗ (vanilla HOLD [14] using our method’s object poses) reconstructs good object shape, it fails to model interactions. RHINO yields reasonable reconstruction reflecting the interaction. 4.5. Ablation Study Object Pose Estimation. We compare our object pose es￾timation, which uses MASt3R [33… view at source ↗
Figure 8
Figure 8. Figure 8: Effects of contact. We show reconstructions of our framework with (“RHINO (Ours)”) and without (“w/o Contact Opt”) the physical losses of Eq. (12) and Eq. (13). For a bigger version with zoom-in impressions, see Supp. Mat. non-repeatable keypoint detection, leading to degenerate re￾construction. Similarly, LoFTR [53] is not accurate enough to find correspondences for objects under complex motions ( [PITH_… view at source ↗
read the original abstract

Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RHINO, a three-stage pipeline for reconstructing 3D humans, novel manipulated objects, and static scenes in a shared world frame from monocular RGB video. Stage 1 uses 3D-aware foundation models to stabilize SfM, yielding coarse object shape/motion from foreground pixels and scene shape/camera motion from background pixels. Stage 2 estimates the human in the camera frame via an off-the-shelf method and subtracts camera motion to register all components. Stage 3 refines the shapes via a compositional neural field with per-component SDFs and differentiable contact priors that encourage surface attraction while penalizing interpenetration. Evaluation on a newly captured synchronized dataset with volumetric ground truth shows outperformance on novel-view synthesis and 4D reconstruction; ablations indicate each stage contributes.

Significance. If the results hold, the work is significant for addressing the ill-posed problem of monocular 3D reconstruction of human-object interactions with unseen objects, without assuming known shapes or cameras. The composition of foundation-model cues, motion subtraction, and contact-aware neural fields is a practical advance with potential applications in AR/VR and robotics. Release of code and data is a clear strength that supports reproducibility.

major comments (2)
  1. [§3.1] §3.1 (SfM stabilization): The central claim that foundation-model depth and motion cues suffice to stabilize SfM on low-texture novel-object regions is load-bearing, yet the manuscript provides no quantitative per-object error analysis or failure-case breakdown on the new dataset; systematic bias here would directly corrupt the coarse object trajectory before registration.
  2. [§3.2] §3.2 (motion subtraction): The subtraction of estimated camera motion from apparent foreground motion to isolate true object motion assumes residuals are negligible; any correlated error between background SfM and foreground cues would misalign the independently estimated human (transformed by the same trajectory) with the object and scene in the world frame, undermining the subsequent compositional SDF stage.
minor comments (2)
  1. [Abstract] The abstract states outperformance on 'novel-view synthesis and 4D reconstruction' but does not name the exact metrics or list the baselines; adding this would improve clarity.
  2. [Table 1] Table 1 (quantitative results): the reported numbers for the new dataset would benefit from standard deviations across sequences to indicate variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below. Revisions have been made where they strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (SfM stabilization): The central claim that foundation-model depth and motion cues suffice to stabilize SfM on low-texture novel-object regions is load-bearing, yet the manuscript provides no quantitative per-object error analysis or failure-case breakdown on the new dataset; systematic bias here would directly corrupt the coarse object trajectory before registration.

    Authors: We appreciate this observation. The overall novel-view synthesis and 4D reconstruction metrics on the new dataset already indicate that the stabilized SfM produces usable coarse trajectories, but we agree a targeted per-object analysis would make the claim more robust. In the revised manuscript we have added a supplementary table reporting per-object SfM reprojection and depth errors on the captured sequences, together with selected failure-case visualizations. These results show that foundation-model cues reduce drift on low-texture regions without introducing systematic bias that would propagate to later stages. revision: yes

  2. Referee: [§3.2] §3.2 (motion subtraction): The subtraction of estimated camera motion from apparent foreground motion to isolate true object motion assumes residuals are negligible; any correlated error between background SfM and foreground cues would misalign the independently estimated human (transformed by the same trajectory) with the object and scene in the world frame, undermining the subsequent compositional SDF stage.

    Authors: We acknowledge the possibility of residual correlation between background SfM and foreground cues. The design separates foreground and background processing in Stage 1 precisely to limit such correlation, and Stage 3’s compositional neural SDF with differentiable contact priors is explicitly intended to correct small registration residuals. Ablation results already demonstrate that omitting motion subtraction degrades final accuracy. To address the concern directly we have added a short sensitivity analysis in the revision (and supplementary material) quantifying how moderate residuals affect final contact and reconstruction metrics; the analysis confirms that the refinement stage compensates for the level of error observed in our data. revision: partial

Circularity Check

0 steps flagged

RHINO pipeline composes external foundation models and off-the-shelf estimators with novel registration and refinement stages

full rationale

The paper's three-step framework first applies 3D-aware foundation models to stabilize SfM on foreground and background pixels, then uses an independent off-the-shelf human estimator followed by camera-motion subtraction for registration into a common frame, and finally refines via a new compositional neural field with SDFs and differentiable contact priors. No step reduces the output to a fitted parameter by construction, renames a known result, or relies on a load-bearing self-citation chain; each stage introduces independent content or external components whose outputs are not equivalent to the inputs by definition. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

With only the abstract available, the ledger is necessarily incomplete. The method relies on the correctness of external foundation models and human estimators whose internal assumptions are not re-derived here.

axioms (2)
  • domain assumption 3D-aware foundation models produce usable coarse shape and motion cues on low-texture foreground regions
    Invoked in the first stage to stabilize SfM
  • domain assumption Off-the-shelf human estimation and SfM camera motion can be subtracted to isolate object motion without large residual errors
    Central to the second registration step

pith-pipeline@v0.9.0 · 5893 in / 1364 out tokens · 38481 ms · 2026-05-19T20:12:38.487550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

  1. [1]

    SDFit: 3D object pose and shape by fitting a morphable SDF to a single image

    Dimitrije Anti ´c, Georgios Paschalidis, Shashank Tripathi, Theo Gevers, Sai Kumar Dwivedi, and Dimitrios Tzionas. SDFit: 3D object pose and shape by fitting a morphable SDF to a single image. InInternational Conference on Computer Vision (ICCV), 2025. 3

  2. [2]

    BEHA VE: Dataset and method for tracking human object interactions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 6

  3. [3]

    FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

    Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 3

  4. [4]

    Back on track: Bundle adjustment for dynamic scene recon- struction

    Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene recon- struction. InInternational Conference on Computer Vision (ICCV), 2025. 2

  5. [5]

    Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense

    Yixin Chen, Siyuan Huang, Tao Yuan, Yixin Zhu, Siyuan Qi, and Song-Chun Zhu. Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense. InInternational Conference on Computer Vision (ICCV), 2019. 3

  6. [6]

    Human3r: Everyone everywhere all at once

    Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Pons-Moll Gerard. Human3R: Everyone every- where all at once. arXiv:2510.06219, 2025. 3

  7. [7]

    High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015

    Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015. 6

  8. [8]

    Black, and Dimitrios Tzionas

    Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InComputer Vision and Pattern Recogni- tion (CVPR), 2025. 3

  9. [9]

    HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar

    Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

  10. [10]

    SuperPoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InComputer Vision and Pattern Recognition Workshops (CVPRw), 2018. 2, 3, 7, 8, 13

  11. [11]

    SO-Pose: Exploiting self- occlusion for direct 6D pose estimation

    Yan Di, Fabian Manhardt, Gu Wang, Xiangyang Ji, Nassir Navab, and Federico Tombari. SO-Pose: Exploiting self- occlusion for direct 6D pose estimation. InInternational Conference on Computer Vision (ICCV), 2021. 3

  12. [12]

    Black, and Dimitrios Tzionas

    Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, and Dimitrios Tzionas. POCO: 3D pose and shape estimation using confidence. InInternational Con- ference on 3D Vision (3DV), 2024. 2

  13. [13]

    Black, and Dim- itrios Tzionas

    Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InComputer Vision and Pattern Recognition (CVPR), 2025. 3, 5, 13

  14. [14]

    HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video

    Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Ot- mar Hilliges. HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6, 7, 8, 14, 15, 16

  15. [15]

    Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects

    Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, and Christian Theobalt. Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects. InInternational Conference on 3D Vision (3DV), 2025. 3

  16. [16]

    Implicit geometric regularization for learning shapes

    Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. InInternational Conference on Machine Learning (ICML), 2020. 14

  17. [17]

    Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition

    Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. InCom- puter Vision and Pattern Recognition (CVPR), 2023. 2, 4, 5

  18. [18]

    Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior

    Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior. InComputer Vision and Pattern Recognition (CVPR), 2025. 2

  19. [19]

    Single path one- shot neural architecture search with uniform sampling

    Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one- shot neural architecture search with uniform sampling. In European Conference on Computer Vision (ECCV), 2020. 3

  20. [20]

    From 3D scene geometry to human workspace

    Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. From 3D scene geometry to human workspace. In Computer Vision and Pattern Recognition (CVPR), 2011. 3

  21. [21]

    Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors

    Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors. InComputer Vision and Pat- tern Recognition (CVPR), 2021. 2, 3 9

  22. [22]

    Resolving 3D human pose ambigui- ties with 3D scene constraints

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3D human pose ambigui- ties with 3D scene constraints. InInternational Conference on Computer Vision (ICCV), 2019. 2, 3

  23. [23]

    OnePose++: Keypoint-free one- shot object pose estimation without CAD models

    Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-free one- shot object pose estimation without CAD models. InNeural Information Processing Systems (NeurIPS), 2022. 2, 3, 7

  24. [24]

    Capturing and inferring dense full-body human-scene contact

    Chun-Hao P Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J Black. Capturing and inferring dense full-body human-scene contact. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3

  25. [25]

    Black, and Dim- itrios Tzionas

    Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction from multi-view RGB-D images.International Journal of Computer Vision (IJCV),

  26. [26]

    NeuMan: Neural human radiance field from a single video

    Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. NeuMan: Neural human radiance field from a single video. InEuropean Conference on Computer Vision (ECCV), 2022. 4

  27. [27]

    NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions

    Yuheng Jiang, Suyi Jiang, Guoxing Sun, Zhuo Su, Kaiwen Guo, Minye Wu, Jingyi Yu, and Lan Xu. NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

  28. [28]

    MultiPly: Re- construction of multiple people from monocular video in the wild

    Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, and Jie Song. MultiPly: Re- construction of multiple people from monocular video in the wild. InComputer Vision and Pattern Recognition (CVPR),

  29. [29]

    EMDB: The electromagnetic database of global 3D human pose and shape in the wild

    Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tian- jian Jiang, Chengcheng Tang, Juan Jos ´e Z ´arate, and Otmar Hilliges. EMDB: The electromagnetic database of global 3D human pose and shape in the wild. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

  30. [30]

    Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito

    Rawal Khirodkar, Timur M. Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models. InEuropean Conference on Computer Vision (ECCV), 2024. 4

  31. [31]

    Hedvig Kjellstr ¨om, Danica Kragic, and Michael J. Black. Tracking people interacting with objects. InComputer Vi- sion and Pattern Recognition (CVPR), 2010. 3

  32. [32]

    Any6D: Model-free 6D pose estimation of novel objects

    Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6D pose estimation of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2025. 3

  33. [33]

    Ground- ing image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2, 3, 7, 13

  34. [34]

    MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes

    Zhi Li, Soshi Shimada, Bernt Schiele, Christian Theobalt, and Vladislav Golyanik. MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes. InInternational Conference on 3D Vision (3DV), 2022. 3

  35. [35]

    SAM-6D: Segment anything model meets zero-shot 6D object pose estimation

    Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment anything model meets zero-shot 6D object pose estimation. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

  36. [36]

    Joint optimization for 4D human-scene reconstruction in the wild

    Zhizheng Liu, Joe Lin, Wayne Wu, and Bolei Zhou. Joint optimization for 4D human-scene reconstruction in the wild. arXiv:2501.02158, 2025. 3

  37. [37]

    Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt

    Diogo C. Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt. Scene-aware 3D multi-human motion capture from a single camera. In Computer Graphics Forum (CGF), 2023. 3

  38. [38]

    iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019

    Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019. 3

  39. [39]

    GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects

    Sungphill Moon, Hyeontae Son, Dongcheol Hur, and Sang- wook Kim. GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

  40. [40]

    Reconstructing people, places, and cameras

    Lea M ¨uller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jiten- dra Malik, and Angjoo Kanazawa. Reconstructing people, places, and cameras. InComputer Vision and Pattern Recog- nition (CVPR), 2025. 3

  41. [41]

    Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer

    Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

  42. [42]

    AprilTag: A robust and flexible visual fidu- cial system

    Edwin Olson. AprilTag: A robust and flexible visual fidu- cial system. InInternational Conference on Robotics and Automation (ICRA), 2011. 6

  43. [43]

    Global structure-from-motion revisited

    Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 3

  44. [44]

    Ar- gyros

    Pantelis Panteleris, Nikolaos Kyriazis, and Antonis A. Ar- gyros. 3D tracking of human hands in interaction with unknown objects. InBritish Machine Vision Conference (BMVC), 2015. 3

  45. [45]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InComputer Vision and Pat- tern Recognition (CVPR), 2019. 3, 5, 14

  46. [46]

    3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting

    Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InComputer Vision and Pattern Recognition (CVPR), 2024. 2

  47. [47]

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, and Ross Girshick et. el. SAM 2: Segment any- thing in images and videos. InInternational Conference on Learning Representations...

  48. [48]

    Grounding dino 1.5: Advance the” edge” of open-set object detection

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, and Lei Zhanget. Grounding DINO 1.5: Advance the “edge” of open-set object detection. arXiv:2405.10300, 2024. 13 10

  49. [49]

    HAMSt3R: Human-aware multi-view stereo 3D reconstruction

    Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Gregory Ro- gez. HAMSt3R: Human-aware multi-view stereo 3D reconstruction. InInternational Conference on Computer Vision (ICCV), 2025. 3

  50. [50]

    SuperGlue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. InComputer Vision and Pattern Recognition (CVPR), 2020. 7, 8, 13

  51. [51]

    PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016

    Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016. 3

  52. [52]

    Total-Recon: Deformable scene reconstruction for embodied view synthesis

    Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, and Deva Ramanan. Total-Recon: Deformable scene reconstruction for embodied view synthesis. InInternational Conference on Computer Vision (ICCV), 2023. 3

  53. [53]

    LoFTR: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InComputer Vision and Pattern Recogni- tion (CVPR), 2021. 2, 3, 7, 8, 13

  54. [54]

    OnePose: One-shot object pose estimation without CAD models

    Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. OnePose: One-shot object pose estimation without CAD models. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 7

  55. [55]

    AiOS: All-in-one-stage ex- pressive human pose and shape estimation

    Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi-Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. AiOS: All-in-one-stage ex- pressive human pose and shape estimation. InComputer Vi- sion and Pattern Recognition (CVPR), 2024. 2, 3, 13

  56. [56]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeu- ral Information Processing Systems (NeurIPS), 2021. 2

  57. [57]

    Huang, Taheri Omid, Michael J

    Shashank Tripathi, Lea M ¨uller, Chun-Hao P. Huang, Taheri Omid, Michael J. Black, and Dimitrios Tzionas. 3D human pose estimation via intuitive physics. InComputer Vision and Pattern Recognition (CVPR), pages 4713–4725, 2023. 5

  58. [58]

    3D object reconstruc- tion from hand-object interactions

    Dimitrios Tzionas and Juergen Gall. 3D object reconstruc- tion from hand-object interactions. InInternational Confer- ence on Computer Vision (ICCV), 2015. 3

  59. [59]

    Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991

    Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991. 4

  60. [60]

    He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6D object pose and size estimation. InComputer Vision and Pattern Recognition (CVPR), 2019. 3

  61. [61]

    VGGT: Visual Geometry Grounded Transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InComputer Vi- sion and Pattern Recognition (CVPR), 2025. 2

  62. [62]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InComputer Vision and Pattern Recogni- tion (CVPR), 2024. 2

  63. [63]

    Mul- tiscale structural similarity for image quality assessment

    Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul- tiscale structural similarity for image quality assessment. InAsilomar Conference on Signals, Systems & Computers,

  64. [64]

    BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects

    Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InComputer Vision and Pattern Recognition (CVPR), 2023. 7

  65. [65]

    FoundationPose: Unified 6D pose estimation and tracking of novel objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

  66. [66]

    Holistic 3D human and scene mesh estimation from single view images

    Zhenzhen Weng and Serena Yeung. Holistic 3D human and scene mesh estimation from single view images. InCom- puter Vision and Pattern Recognition (CVPR), 2021. 3

  67. [67]

    CHORE: Contact, human and object reconstruction from a single RGB image

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. CHORE: Contact, human and object reconstruction from a single RGB image. InEuropean Conference on Computer Vision (ECCV), 2022. 3

  68. [68]

    Template free reconstruction of human- object interaction with procedural interaction generation

    Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In Computer Vision and Pattern Recognition (CVPR), 2024. 3

  69. [69]

    In- terTrack: Tracking human object interaction without object templates

    Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- terTrack: Tracking human object interaction without object templates. InInternational Conference on 3D Vision (3DV),

  70. [70]

    HSR: Holistic 3D human-scene recon- struction from monocular videos

    Lixin Xue, Chen Guo, Chengwei Zheng, Fangjinhua Wang, Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, Jie Song, and Hilliges Otmar. HSR: Holistic 3D human-scene recon- struction from monocular videos. InEuropean Conference on Computer Vision (ECCV), 2024. 2, 3, 4, 5, 6, 13, 14, 15, 16

  71. [71]

    PPR: Physically plausible recon- struction from monocular videos

    Gengshan Yang, Shuo Yang, John Z Zhang, Zachary Manch- ester, and Deva Ramanan. PPR: Physically plausible recon- struction from monocular videos. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

  72. [72]

    Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J

    Hongwei Yi, Chun-Hao P. Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J. Black. Human-aware object placement for visual environment reconstruction. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

  73. [73]

    Neural- Dome: A neural modeling pipeline on multi-view human- object interactions

    Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- Dome: A neural modeling pipeline on multi-view human- object interactions. InComputer Vision and Pattern Recog- nition (CVPR), 2023. 3, 6

  74. [74]

    HOI-M3: Capture multiple humans and objects inter- action within contextual environment

    Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. HOI-M3: Capture multiple humans and objects inter- action within contextual environment. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6

  75. [75]

    Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa

    Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3D human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3 11

  76. [76]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

  77. [77]

    Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices

    Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, and Siyu Tang. Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices. InEuropean Conference on Computer Vision (ECCV), 2022. 3

  78. [78]

    Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online dense 3D reconstruction of humans and scenes from monocular videos. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3

  79. [79]

    contact frame

    Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’M HOI: Inertia-aware monocular capture of 3D human-object interactions. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6 12 RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos Supplementary Material We discuss dat...