RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

Chen Guo; Chengwei Zheng; Dimitrios Tzionas; Georgios Paschalidis; Juan Zarate; Lixin Xue; Manuel Kaufmann

arxiv: 2605.17014 · v1 · pith:Z43EXVQMnew · submitted 2026-05-16 · 💻 cs.CV

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

Lixin Xue , Chengwei Zheng , Georgios Paschalidis , Chen Guo , Manuel Kaufmann , Juan Zarate , Dimitrios Tzionas This is my paper

Pith reviewed 2026-05-19 20:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionhuman-object interactionmonocular videonovel objectsneural fieldscontact priorsmotion separation4D capture

0 comments

The pith

A three-step method reconstructs 3D humans, novel objects, and scenes together from monocular videos of interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a person, a never-before-seen object they are using, and the room around them can all be turned into accurate 3D models from a single video taken with a moving camera. This matters because most videos of real actions mix the camera's shake with the object's movement and hide depths, making separate reconstruction of each part difficult and their interactions unrealistic. The work demonstrates this by breaking the problem into getting initial 3D hints from modern AI models, pulling apart the motions to line things up, and then polishing the models with a neural system that respects how things should touch.

Core claim

The paper establishes that by first using cues from 3D foundation models to get rough shapes and motions for both the object and scene despite poor texture, then removing the camera's movement to find the object's real motion and align it with the estimated human pose in a shared space, and finally optimizing a joint neural shape model with rules that pull surfaces together at contacts while keeping them from overlapping, one can obtain coherent 3D reconstructions of the full interaction.

What carries the argument

The compositional neural field built from separate signed-distance functions for each part, which supports adding contact attraction and interpenetration penalties during optimization.

Load-bearing premise

The approach depends on 3D foundation models providing reliable cues to build initial shapes from motion cues even on smooth object surfaces, and on the subtraction of camera movement from apparent motion leaving a clean estimate of the object's true movement.

What would settle it

A clear falsifier would be if the final 3D object model deviates significantly from the actual shape measured by the volumetric capture system, or if the human and object models show overlapping volumes at their contact points despite the priors.

Figures

Figures reproduced from arXiv: 2605.17014 by Chen Guo, Chengwei Zheng, Dimitrios Tzionas, Georgios Paschalidis, Juan Zarate, Lixin Xue, Manuel Kaufmann.

**Figure 1.** Figure 1: We develop RHINO, a novel framework that reconstructs detailed (dynamic) 3D human-object interactions (HOI) and the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Camera & Object motion (Sec. 3.1, Sec. 3.2). When both the camera and object move, their motion entangles into “apparent” motion. To disentangle them, we (a) estimate camera motion in the world frame via SfM on scene-only pixels, (b) estimate apparent motion via SfM on object-only pixels, (c–d) estimate object motion in the world frame by “removing” the camera motion from the apparent one. pose, so their p… view at source ↗

**Figure 4.** Figure 4: Method Overview. Starting with initialized global human and decoupled object poses (Sec. 3.1, Sec. 3.2), we sample points along the camera ray for the human, object and static scene. To enable consistent representation, sampled points are warped into canonical space using inverse LBS for the human and the estimated rigid transformation for the object. All components are then rendered holistically via compo… view at source ↗

**Figure 5.** Figure 5: Evaluation on shape reconstruction (Sec. 4.3) on our BenchRHINO dataset (Sec. 4.1). HOLD [14] struggles with noisy object poses (rows 2, 4) and fails to model interaction. InterTrack [69] recovers reasonable object shape but fails to model the interaction due to large human and object pose errors. Our method (RHINO) faithfully recovers interactions, which lie closer to the ground truth (GT). Input InterTra… view at source ↗

**Figure 6.** Figure 6: Evaluation on shape reconstruction on WildRHINO. InterTrack [69] fails on these out-of-distribution objects. While HOLD∗ (vanilla HOLD [14] using our method’s object poses) reconstructs good object shape, it fails to model interactions. RHINO yields reasonable reconstruction reflecting the interaction. 4.5. Ablation Study Object Pose Estimation. We compare our object pose estimation, which uses MASt3R [33… view at source ↗

**Figure 8.** Figure 8: Effects of contact. We show reconstructions of our framework with (“RHINO (Ours)”) and without (“w/o Contact Opt”) the physical losses of Eq. (12) and Eq. (13). For a bigger version with zoom-in impressions, see Supp. Mat. non-repeatable keypoint detection, leading to degenerate reconstruction. Similarly, LoFTR [53] is not accurate enough to find correspondences for objects under complex motions ( [PITH_… view at source ↗

read the original abstract

Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RHINO gives a practical three-stage pipeline for 3D human-novel object-scene reconstruction from monocular video by stabilizing SfM with foundation models then refining with compositional SDFs, and the new synchronized dataset plus released code make it worth checking even if the core assumptions need more stress-testing.

read the letter

RHINO combines foundation-model cues to stabilize SfM on low-texture novel objects, subtracts estimated camera motion to register the human and object into a shared frame, and then refines everything with compositional SDFs plus differentiable contact priors. The result is a full pipeline that outputs 3D shapes and motions for the person, the unseen manipulated object, and the static background from ordinary RGB video.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RHINO, a three-stage pipeline for reconstructing 3D humans, novel manipulated objects, and static scenes in a shared world frame from monocular RGB video. Stage 1 uses 3D-aware foundation models to stabilize SfM, yielding coarse object shape/motion from foreground pixels and scene shape/camera motion from background pixels. Stage 2 estimates the human in the camera frame via an off-the-shelf method and subtracts camera motion to register all components. Stage 3 refines the shapes via a compositional neural field with per-component SDFs and differentiable contact priors that encourage surface attraction while penalizing interpenetration. Evaluation on a newly captured synchronized dataset with volumetric ground truth shows outperformance on novel-view synthesis and 4D reconstruction; ablations indicate each stage contributes.

Significance. If the results hold, the work is significant for addressing the ill-posed problem of monocular 3D reconstruction of human-object interactions with unseen objects, without assuming known shapes or cameras. The composition of foundation-model cues, motion subtraction, and contact-aware neural fields is a practical advance with potential applications in AR/VR and robotics. Release of code and data is a clear strength that supports reproducibility.

major comments (2)

[§3.1] §3.1 (SfM stabilization): The central claim that foundation-model depth and motion cues suffice to stabilize SfM on low-texture novel-object regions is load-bearing, yet the manuscript provides no quantitative per-object error analysis or failure-case breakdown on the new dataset; systematic bias here would directly corrupt the coarse object trajectory before registration.
[§3.2] §3.2 (motion subtraction): The subtraction of estimated camera motion from apparent foreground motion to isolate true object motion assumes residuals are negligible; any correlated error between background SfM and foreground cues would misalign the independently estimated human (transformed by the same trajectory) with the object and scene in the world frame, undermining the subsequent compositional SDF stage.

minor comments (2)

[Abstract] The abstract states outperformance on 'novel-view synthesis and 4D reconstruction' but does not name the exact metrics or list the baselines; adding this would improve clarity.
[Table 1] Table 1 (quantitative results): the reported numbers for the new dataset would benefit from standard deviations across sequences to indicate variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major comment point by point below. Revisions have been made where they strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: [§3.1] §3.1 (SfM stabilization): The central claim that foundation-model depth and motion cues suffice to stabilize SfM on low-texture novel-object regions is load-bearing, yet the manuscript provides no quantitative per-object error analysis or failure-case breakdown on the new dataset; systematic bias here would directly corrupt the coarse object trajectory before registration.

Authors: We appreciate this observation. The overall novel-view synthesis and 4D reconstruction metrics on the new dataset already indicate that the stabilized SfM produces usable coarse trajectories, but we agree a targeted per-object analysis would make the claim more robust. In the revised manuscript we have added a supplementary table reporting per-object SfM reprojection and depth errors on the captured sequences, together with selected failure-case visualizations. These results show that foundation-model cues reduce drift on low-texture regions without introducing systematic bias that would propagate to later stages. revision: yes
Referee: [§3.2] §3.2 (motion subtraction): The subtraction of estimated camera motion from apparent foreground motion to isolate true object motion assumes residuals are negligible; any correlated error between background SfM and foreground cues would misalign the independently estimated human (transformed by the same trajectory) with the object and scene in the world frame, undermining the subsequent compositional SDF stage.

Authors: We acknowledge the possibility of residual correlation between background SfM and foreground cues. The design separates foreground and background processing in Stage 1 precisely to limit such correlation, and Stage 3’s compositional neural SDF with differentiable contact priors is explicitly intended to correct small registration residuals. Ablation results already demonstrate that omitting motion subtraction degrades final accuracy. To address the concern directly we have added a short sensitivity analysis in the revision (and supplementary material) quantifying how moderate residuals affect final contact and reconstruction metrics; the analysis confirms that the refinement stage compensates for the level of error observed in our data. revision: partial

Circularity Check

0 steps flagged

RHINO pipeline composes external foundation models and off-the-shelf estimators with novel registration and refinement stages

full rationale

The paper's three-step framework first applies 3D-aware foundation models to stabilize SfM on foreground and background pixels, then uses an independent off-the-shelf human estimator followed by camera-motion subtraction for registration into a common frame, and finally refines via a new compositional neural field with SDFs and differentiable contact priors. No step reduces the output to a fitted parameter by construction, renames a known result, or relies on a load-bearing self-citation chain; each stage introduces independent content or external components whose outputs are not equivalent to the inputs by definition. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

With only the abstract available, the ledger is necessarily incomplete. The method relies on the correctness of external foundation models and human estimators whose internal assumptions are not re-derived here.

axioms (2)

domain assumption 3D-aware foundation models produce usable coarse shape and motion cues on low-texture foreground regions
Invoked in the first stage to stabilize SfM
domain assumption Off-the-shelf human estimation and SfM camera motion can be subtracted to isolate object motion without large residual errors
Central to the second registration step

pith-pipeline@v0.9.0 · 5893 in / 1364 out tokens · 38481 ms · 2026-05-19T20:12:38.487550+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

[1]

SDFit: 3D object pose and shape by fitting a morphable SDF to a single image

Dimitrije Anti ´c, Georgios Paschalidis, Shashank Tripathi, Theo Gevers, Sai Kumar Dwivedi, and Dimitrios Tzionas. SDFit: 3D object pose and shape by fitting a morphable SDF to a single image. InInternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[2]

BEHA VE: Dataset and method for tracking human object interactions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 6

work page 2022
[3]

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[4]

Back on track: Bundle adjustment for dynamic scene recon- struction

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene recon- struction. InInternational Conference on Computer Vision (ICCV), 2025. 2

work page 2025
[5]

Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense

Yixin Chen, Siyuan Huang, Tao Yuan, Yixin Zhu, Siyuan Qi, and Song-Chun Zhu. Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense. InInternational Conference on Computer Vision (ICCV), 2019. 3

work page 2019
[6]

Human3r: Everyone everywhere all at once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Pons-Moll Gerard. Human3R: Everyone every- where all at once. arXiv:2510.06219, 2025. 3

work page arXiv 2025
[7]

High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015

Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015. 6

work page 2015
[8]

Black, and Dimitrios Tzionas

Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InComputer Vision and Pattern Recogni- tion (CVPR), 2025. 3

work page 2025
[9]

HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar

Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[10]

SuperPoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InComputer Vision and Pattern Recognition Workshops (CVPRw), 2018. 2, 3, 7, 8, 13

work page 2018
[11]

SO-Pose: Exploiting self- occlusion for direct 6D pose estimation

Yan Di, Fabian Manhardt, Gu Wang, Xiangyang Ji, Nassir Navab, and Federico Tombari. SO-Pose: Exploiting self- occlusion for direct 6D pose estimation. InInternational Conference on Computer Vision (ICCV), 2021. 3

work page 2021
[12]

Black, and Dimitrios Tzionas

Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, and Dimitrios Tzionas. POCO: 3D pose and shape estimation using confidence. InInternational Con- ference on 3D Vision (3DV), 2024. 2

work page 2024
[13]

Black, and Dim- itrios Tzionas

Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InComputer Vision and Pattern Recognition (CVPR), 2025. 3, 5, 13

work page 2025
[14]

HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Ot- mar Hilliges. HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6, 7, 8, 14, 15, 16

work page 2024
[15]

Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects

Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, and Christian Theobalt. Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects. InInternational Conference on 3D Vision (3DV), 2025. 3

work page 2025
[16]

Implicit geometric regularization for learning shapes

Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. InInternational Conference on Machine Learning (ICML), 2020. 14

work page 2020
[17]

Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. InCom- puter Vision and Pattern Recognition (CVPR), 2023. 2, 4, 5

work page 2023
[18]

Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior. InComputer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[19]

Single path one- shot neural architecture search with uniform sampling

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one- shot neural architecture search with uniform sampling. In European Conference on Computer Vision (ECCV), 2020. 3

work page 2020
[20]

From 3D scene geometry to human workspace

Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. From 3D scene geometry to human workspace. In Computer Vision and Pattern Recognition (CVPR), 2011. 3

work page 2011
[21]

Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors

Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors. InComputer Vision and Pat- tern Recognition (CVPR), 2021. 2, 3 9

work page 2021
[22]

Resolving 3D human pose ambigui- ties with 3D scene constraints

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3D human pose ambigui- ties with 3D scene constraints. InInternational Conference on Computer Vision (ICCV), 2019. 2, 3

work page 2019
[23]

OnePose++: Keypoint-free one- shot object pose estimation without CAD models

Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-free one- shot object pose estimation without CAD models. InNeural Information Processing Systems (NeurIPS), 2022. 2, 3, 7

work page 2022
[24]

Capturing and inferring dense full-body human-scene contact

Chun-Hao P Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J Black. Capturing and inferring dense full-body human-scene contact. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3

work page 2022
[25]

Black, and Dim- itrios Tzionas

Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction from multi-view RGB-D images.International Journal of Computer Vision (IJCV),

work page
[26]

NeuMan: Neural human radiance field from a single video

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. NeuMan: Neural human radiance field from a single video. InEuropean Conference on Computer Vision (ECCV), 2022. 4

work page 2022
[27]

NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions

Yuheng Jiang, Suyi Jiang, Guoxing Sun, Zhuo Su, Kaiwen Guo, Minye Wu, Jingyi Yu, and Lan Xu. NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[28]

MultiPly: Re- construction of multiple people from monocular video in the wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, and Jie Song. MultiPly: Re- construction of multiple people from monocular video in the wild. InComputer Vision and Pattern Recognition (CVPR),

work page
[29]

EMDB: The electromagnetic database of global 3D human pose and shape in the wild

Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tian- jian Jiang, Chengcheng Tang, Juan Jos ´e Z ´arate, and Otmar Hilliges. EMDB: The electromagnetic database of global 3D human pose and shape in the wild. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

work page 2023
[30]

Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito

Rawal Khirodkar, Timur M. Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models. InEuropean Conference on Computer Vision (ECCV), 2024. 4

work page 2024
[31]

Hedvig Kjellstr ¨om, Danica Kragic, and Michael J. Black. Tracking people interacting with objects. InComputer Vi- sion and Pattern Recognition (CVPR), 2010. 3

work page 2010
[32]

Any6D: Model-free 6D pose estimation of novel objects

Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6D pose estimation of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[33]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2, 3, 7, 13

work page 2024
[34]

MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes

Zhi Li, Soshi Shimada, Bernt Schiele, Christian Theobalt, and Vladislav Golyanik. MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes. InInternational Conference on 3D Vision (3DV), 2022. 3

work page 2022
[35]

SAM-6D: Segment anything model meets zero-shot 6D object pose estimation

Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment anything model meets zero-shot 6D object pose estimation. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[36]

Joint optimization for 4D human-scene reconstruction in the wild

Zhizheng Liu, Joe Lin, Wayne Wu, and Bolei Zhou. Joint optimization for 4D human-scene reconstruction in the wild. arXiv:2501.02158, 2025. 3

work page arXiv 2025
[37]

Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt

Diogo C. Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt. Scene-aware 3D multi-human motion capture from a single camera. In Computer Graphics Forum (CGF), 2023. 3

work page 2023
[38]

iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019

Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019. 3

work page 2019
[39]

GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects

Sungphill Moon, Hyeontae Son, Dongcheol Hur, and Sang- wook Kim. GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[40]

Reconstructing people, places, and cameras

Lea M ¨uller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jiten- dra Malik, and Angjoo Kanazawa. Reconstructing people, places, and cameras. InComputer Vision and Pattern Recog- nition (CVPR), 2025. 3

work page 2025
[41]

Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer

Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[42]

AprilTag: A robust and flexible visual fidu- cial system

Edwin Olson. AprilTag: A robust and flexible visual fidu- cial system. InInternational Conference on Robotics and Automation (ICRA), 2011. 6

work page 2011
[43]

Global structure-from-motion revisited

Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[44]

Ar- gyros

Pantelis Panteleris, Nikolaos Kyriazis, and Antonis A. Ar- gyros. 3D tracking of human hands in interaction with unknown objects. InBritish Machine Vision Conference (BMVC), 2015. 3

work page 2015
[45]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InComputer Vision and Pat- tern Recognition (CVPR), 2019. 3, 5, 14

work page 2019
[46]

3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InComputer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[47]

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, and Ross Girshick et. el. SAM 2: Segment any- thing in images and videos. InInternational Conference on Learning Representations...

work page 2025
[48]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, and Lei Zhanget. Grounding DINO 1.5: Advance the “edge” of open-set object detection. arXiv:2405.10300, 2024. 13 10

work page arXiv 2024
[49]

HAMSt3R: Human-aware multi-view stereo 3D reconstruction

Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Gregory Ro- gez. HAMSt3R: Human-aware multi-view stereo 3D reconstruction. InInternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025
[50]

SuperGlue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. InComputer Vision and Pattern Recognition (CVPR), 2020. 7, 8, 13

work page 2020
[51]

PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016

Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016. 3

work page 2016
[52]

Total-Recon: Deformable scene reconstruction for embodied view synthesis

Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, and Deva Ramanan. Total-Recon: Deformable scene reconstruction for embodied view synthesis. InInternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023
[53]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InComputer Vision and Pattern Recogni- tion (CVPR), 2021. 2, 3, 7, 8, 13

work page 2021
[54]

OnePose: One-shot object pose estimation without CAD models

Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. OnePose: One-shot object pose estimation without CAD models. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 7

work page 2022
[55]

AiOS: All-in-one-stage ex- pressive human pose and shape estimation

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi-Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. AiOS: All-in-one-stage ex- pressive human pose and shape estimation. InComputer Vi- sion and Pattern Recognition (CVPR), 2024. 2, 3, 13

work page 2024
[56]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeu- ral Information Processing Systems (NeurIPS), 2021. 2

work page 2021
[57]

Huang, Taheri Omid, Michael J

Shashank Tripathi, Lea M ¨uller, Chun-Hao P. Huang, Taheri Omid, Michael J. Black, and Dimitrios Tzionas. 3D human pose estimation via intuitive physics. InComputer Vision and Pattern Recognition (CVPR), pages 4713–4725, 2023. 5

work page 2023
[58]

3D object reconstruc- tion from hand-object interactions

Dimitrios Tzionas and Juergen Gall. 3D object reconstruc- tion from hand-object interactions. InInternational Confer- ence on Computer Vision (ICCV), 2015. 3

work page 2015
[59]

Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991

Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991. 4

work page 1991
[60]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6D object pose and size estimation. InComputer Vision and Pattern Recognition (CVPR), 2019. 3

work page 2019
[61]

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InComputer Vi- sion and Pattern Recognition (CVPR), 2025. 2

work page 2025
[62]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InComputer Vision and Pattern Recogni- tion (CVPR), 2024. 2

work page 2024
[63]

Mul- tiscale structural similarity for image quality assessment

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul- tiscale structural similarity for image quality assessment. InAsilomar Conference on Signals, Systems & Computers,

work page
[64]

BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InComputer Vision and Pattern Recognition (CVPR), 2023. 7

work page 2023
[65]

FoundationPose: Unified 6D pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[66]

Holistic 3D human and scene mesh estimation from single view images

Zhenzhen Weng and Serena Yeung. Holistic 3D human and scene mesh estimation from single view images. InCom- puter Vision and Pattern Recognition (CVPR), 2021. 3

work page 2021
[67]

CHORE: Contact, human and object reconstruction from a single RGB image

Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. CHORE: Contact, human and object reconstruction from a single RGB image. InEuropean Conference on Computer Vision (ECCV), 2022. 3

work page 2022
[68]

Template free reconstruction of human- object interaction with procedural interaction generation

Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[69]

In- terTrack: Tracking human object interaction without object templates

Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- terTrack: Tracking human object interaction without object templates. InInternational Conference on 3D Vision (3DV),

work page
[70]

HSR: Holistic 3D human-scene recon- struction from monocular videos

Lixin Xue, Chen Guo, Chengwei Zheng, Fangjinhua Wang, Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, Jie Song, and Hilliges Otmar. HSR: Holistic 3D human-scene recon- struction from monocular videos. InEuropean Conference on Computer Vision (ECCV), 2024. 2, 3, 4, 5, 6, 13, 14, 15, 16

work page 2024
[71]

PPR: Physically plausible recon- struction from monocular videos

Gengshan Yang, Shuo Yang, John Z Zhang, Zachary Manch- ester, and Deva Ramanan. PPR: Physically plausible recon- struction from monocular videos. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

work page 2023
[72]

Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J

Hongwei Yi, Chun-Hao P. Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J. Black. Human-aware object placement for visual environment reconstruction. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[73]

Neural- Dome: A neural modeling pipeline on multi-view human- object interactions

Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- Dome: A neural modeling pipeline on multi-view human- object interactions. InComputer Vision and Pattern Recog- nition (CVPR), 2023. 3, 6

work page 2023
[74]

HOI-M3: Capture multiple humans and objects inter- action within contextual environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. HOI-M3: Capture multiple humans and objects inter- action within contextual environment. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6

work page 2024
[75]

Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa

Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3D human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3 11

work page 2020
[76]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

work page 2018
[77]

Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices

Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, and Siyu Tang. Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices. InEuropean Conference on Computer Vision (ECCV), 2022. 3

work page 2022
[78]

Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online dense 3D reconstruction of humans and scenes from monocular videos. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3

work page 2025
[79]

contact frame

Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’M HOI: Inertia-aware monocular capture of 3D human-object interactions. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6 12 RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos Supplementary Material We discuss dat...

work page 2024

[1] [1]

SDFit: 3D object pose and shape by fitting a morphable SDF to a single image

Dimitrije Anti ´c, Georgios Paschalidis, Shashank Tripathi, Theo Gevers, Sai Kumar Dwivedi, and Dimitrios Tzionas. SDFit: 3D object pose and shape by fitting a morphable SDF to a single image. InInternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025

[2] [2]

BEHA VE: Dataset and method for tracking human object interactions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya A Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. BEHA VE: Dataset and method for tracking human object interactions. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 6

work page 2022

[3] [3]

FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models

Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. FreeZe: Training-free zero-shot 6D pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[4] [4]

Back on track: Bundle adjustment for dynamic scene recon- struction

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene recon- struction. InInternational Conference on Computer Vision (ICCV), 2025. 2

work page 2025

[5] [5]

Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense

Yixin Chen, Siyuan Huang, Tao Yuan, Yixin Zhu, Siyuan Qi, and Song-Chun Zhu. Holistic++ scene understanding: Single-view 3D holistic scene parsing and human pose es- timation with human-object interaction and physical com- monsense. InInternational Conference on Computer Vision (ICCV), 2019. 3

work page 2019

[6] [6]

Human3r: Everyone everywhere all at once

Yue Chen, Xingyu Chen, Yuxuan Xue, Anpei Chen, Yuliang Xiu, and Pons-Moll Gerard. Human3R: Everyone every- where all at once. arXiv:2510.06219, 2025. 3

work page arXiv 2025

[7] [7]

High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015

Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.Transactions on Graphics (TOG), 34(4), 2015. 6

work page 2015

[8] [8]

Black, and Dimitrios Tzionas

Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InComputer Vision and Pattern Recogni- tion (CVPR), 2025. 3

work page 2025

[9] [9]

HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar

Yudi Dai, Yitai Lin, Chenglu Wen, Siqi Shen, Lan Xu, Jingyi Yu, Yuexin Ma, and Cheng Wang. HSC4D: Human-centered 4D scene capture in large-scale indoor-outdoor space using wearable imus and lidar. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[10] [10]

SuperPoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. SuperPoint: Self-supervised interest point detection and description. InComputer Vision and Pattern Recognition Workshops (CVPRw), 2018. 2, 3, 7, 8, 13

work page 2018

[11] [11]

SO-Pose: Exploiting self- occlusion for direct 6D pose estimation

Yan Di, Fabian Manhardt, Gu Wang, Xiangyang Ji, Nassir Navab, and Federico Tombari. SO-Pose: Exploiting self- occlusion for direct 6D pose estimation. InInternational Conference on Computer Vision (ICCV), 2021. 3

work page 2021

[12] [12]

Black, and Dimitrios Tzionas

Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, and Dimitrios Tzionas. POCO: 3D pose and shape estimation using confidence. InInternational Con- ference on 3D Vision (3DV), 2024. 2

work page 2024

[13] [13]

Black, and Dim- itrios Tzionas

Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InComputer Vision and Pattern Recognition (CVPR), 2025. 3, 5, 13

work page 2025

[14] [14]

HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Ot- mar Hilliges. HOLD: Category-agnostic 3D reconstruction of interacting hands and objects from video. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6, 7, 8, 14, 15, 16

work page 2024

[15] [15]

Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects

Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, and Christian Theobalt. Betsu-Betsu: Multi-view separable 3D reconstruction of two interacting objects. InInternational Conference on 3D Vision (3DV), 2025. 3

work page 2025

[16] [16]

Implicit geometric regularization for learning shapes

Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. InInternational Conference on Machine Learning (ICML), 2020. 14

work page 2020

[17] [17]

Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition

Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. InCom- puter Vision and Pattern Recognition (CVPR), 2023. 2, 4, 5

work page 2023

[18] [18]

Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior. InComputer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025

[19] [19]

Single path one- shot neural architecture search with uniform sampling

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one- shot neural architecture search with uniform sampling. In European Conference on Computer Vision (ECCV), 2020. 3

work page 2020

[20] [20]

From 3D scene geometry to human workspace

Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. From 3D scene geometry to human workspace. In Computer Vision and Pattern Recognition (CVPR), 2011. 3

work page 2011

[21] [21]

Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors

Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human POSEitioning system (HPS): 3D hu- man pose estimation and self-localization in large scenes from body-mounted sensors. InComputer Vision and Pat- tern Recognition (CVPR), 2021. 2, 3 9

work page 2021

[22] [22]

Resolving 3D human pose ambigui- ties with 3D scene constraints

Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3D human pose ambigui- ties with 3D scene constraints. InInternational Conference on Computer Vision (ICCV), 2019. 2, 3

work page 2019

[23] [23]

OnePose++: Keypoint-free one- shot object pose estimation without CAD models

Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hujun Bao, and Xiaowei Zhou. OnePose++: Keypoint-free one- shot object pose estimation without CAD models. InNeural Information Processing Systems (NeurIPS), 2022. 2, 3, 7

work page 2022

[24] [24]

Capturing and inferring dense full-body human-scene contact

Chun-Hao P Huang, Hongwei Yi, Markus H ¨oschle, Matvey Safroshkin, Tsvetelina Alexiadis, Senya Polikovsky, Daniel Scharstein, and Michael J Black. Capturing and inferring dense full-body human-scene contact. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3

work page 2022

[25] [25]

Black, and Dim- itrios Tzionas

Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction from multi-view RGB-D images.International Journal of Computer Vision (IJCV),

work page

[26] [26]

NeuMan: Neural human radiance field from a single video

Wei Jiang, Kwang Moo Yi, Golnoosh Samei, Oncel Tuzel, and Anurag Ranjan. NeuMan: Neural human radiance field from a single video. InEuropean Conference on Computer Vision (ECCV), 2022. 4

work page 2022

[27] [27]

NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions

Yuheng Jiang, Suyi Jiang, Guoxing Sun, Zhuo Su, Kaiwen Guo, Minye Wu, Jingyi Yu, and Lan Xu. NeuralHOFu- sion: Neural volumetric rendering under human-object in- teractions. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[28] [28]

MultiPly: Re- construction of multiple people from monocular video in the wild

Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Otmar Hilliges, and Jie Song. MultiPly: Re- construction of multiple people from monocular video in the wild. InComputer Vision and Pattern Recognition (CVPR),

work page

[29] [29]

EMDB: The electromagnetic database of global 3D human pose and shape in the wild

Manuel Kaufmann, Jie Song, Chen Guo, Kaiyue Shen, Tian- jian Jiang, Chengcheng Tang, Juan Jos ´e Z ´arate, and Otmar Hilliges. EMDB: The electromagnetic database of global 3D human pose and shape in the wild. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

work page 2023

[30] [30]

Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito

Rawal Khirodkar, Timur M. Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human vi- sion models. InEuropean Conference on Computer Vision (ECCV), 2024. 4

work page 2024

[31] [31]

Hedvig Kjellstr ¨om, Danica Kragic, and Michael J. Black. Tracking people interacting with objects. InComputer Vi- sion and Pattern Recognition (CVPR), 2010. 3

work page 2010

[32] [32]

Any6D: Model-free 6D pose estimation of novel objects

Taeyeop Lee, Bowen Wen, Minjun Kang, Gyuree Kang, In So Kweon, and Kuk-Jin Yoon. Any6D: Model-free 6D pose estimation of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[33] [33]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision (ECCV), 2024. 2, 3, 7, 13

work page 2024

[34] [34]

MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes

Zhi Li, Soshi Shimada, Bernt Schiele, Christian Theobalt, and Vladislav Golyanik. MoCapDeform: Monocular 3D hu- man motion capture in deformable scenes. InInternational Conference on 3D Vision (3DV), 2022. 3

work page 2022

[35] [35]

SAM-6D: Segment anything model meets zero-shot 6D object pose estimation

Jiehong Lin, Lihua Liu, Dekun Lu, and Kui Jia. SAM-6D: Segment anything model meets zero-shot 6D object pose estimation. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[36] [36]

Joint optimization for 4D human-scene reconstruction in the wild

Zhizheng Liu, Joe Lin, Wayne Wu, and Bolei Zhou. Joint optimization for 4D human-scene reconstruction in the wild. arXiv:2501.02158, 2025. 3

work page arXiv 2025

[37] [37]

Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt

Diogo C. Luvizon, Marc Habermann, Vladislav Golyanik, Adam Kortylewski, and Christian Theobalt. Scene-aware 3D multi-human motion capture from a single camera. In Computer Graphics Forum (CGF), 2023. 3

work page 2023

[38] [38]

iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019

Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. iMapper: interaction-guided scene mapping from monocular videos.Transactions on Graphics (TOG), 38(4):92:1–92:15, 2019. 3

work page 2019

[39] [39]

GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects

Sungphill Moon, Hyeontae Son, Dongcheol Hur, and Sang- wook Kim. GenFlow: Generalizable recurrent flow for 6D pose refinement of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[40] [40]

Reconstructing people, places, and cameras

Lea M ¨uller, Hongsuk Choi, Anthony Zhang, Brent Yi, Jiten- dra Malik, and Angjoo Kanazawa. Reconstructing people, places, and cameras. InComputer Vision and Pattern Recog- nition (CVPR), 2025. 3

work page 2025

[41] [41]

Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer

Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3D human and ob- ject via contact-based refinement transformer. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[42] [42]

AprilTag: A robust and flexible visual fidu- cial system

Edwin Olson. AprilTag: A robust and flexible visual fidu- cial system. InInternational Conference on Robotics and Automation (ICRA), 2011. 6

work page 2011

[43] [43]

Global structure-from-motion revisited

Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[44] [44]

Ar- gyros

Pantelis Panteleris, Nikolaos Kyriazis, and Antonis A. Ar- gyros. 3D tracking of human hands in interaction with unknown objects. InBritish Machine Vision Conference (BMVC), 2015. 3

work page 2015

[45] [45]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. InComputer Vision and Pat- tern Recognition (CVPR), 2019. 3, 5, 14

work page 2019

[46] [46]

3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting

Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InComputer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[47] [47]

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, and Ross Girshick et. el. SAM 2: Segment any- thing in images and videos. InInternational Conference on Learning Representations...

work page 2025

[48] [48]

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xi- aoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, and Lei Zhanget. Grounding DINO 1.5: Advance the “edge” of open-set object detection. arXiv:2405.10300, 2024. 13 10

work page arXiv 2024

[49] [49]

HAMSt3R: Human-aware multi-view stereo 3D reconstruction

Sara Rojas, Matthieu Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Gregory Ro- gez. HAMSt3R: Human-aware multi-view stereo 3D reconstruction. InInternational Conference on Computer Vision (ICCV), 2025. 3

work page 2025

[50] [50]

SuperGlue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. InComputer Vision and Pattern Recognition (CVPR), 2020. 7, 8, 13

work page 2020

[51] [51]

PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016

Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. PiGraphs: Learning interac- tion snapshots from observations.Transactions on Graphics (TOG), 35(4):139:1–139:12, 2016. 3

work page 2016

[52] [52]

Total-Recon: Deformable scene reconstruction for embodied view synthesis

Chonghyuk Song, Gengshan Yang, Kangle Deng, Jun-Yan Zhu, and Deva Ramanan. Total-Recon: Deformable scene reconstruction for embodied view synthesis. InInternational Conference on Computer Vision (ICCV), 2023. 3

work page 2023

[53] [53]

LoFTR: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. InComputer Vision and Pattern Recogni- tion (CVPR), 2021. 2, 3, 7, 8, 13

work page 2021

[54] [54]

OnePose: One-shot object pose estimation without CAD models

Jiaming Sun, Zihao Wang, Siyu Zhang, Xingyi He, Hongcheng Zhao, Guofeng Zhang, and Xiaowei Zhou. OnePose: One-shot object pose estimation without CAD models. InComputer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 7

work page 2022

[55] [55]

AiOS: All-in-one-stage ex- pressive human pose and shape estimation

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi-Sing Leung, Ziwei Liu, Lei Yang, and Zhongang Cai. AiOS: All-in-one-stage ex- pressive human pose and shape estimation. InComputer Vi- sion and Pattern Recognition (CVPR), 2024. 2, 3, 13

work page 2024

[56] [56]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeu- ral Information Processing Systems (NeurIPS), 2021. 2

work page 2021

[57] [57]

Huang, Taheri Omid, Michael J

Shashank Tripathi, Lea M ¨uller, Chun-Hao P. Huang, Taheri Omid, Michael J. Black, and Dimitrios Tzionas. 3D human pose estimation via intuitive physics. InComputer Vision and Pattern Recognition (CVPR), pages 4713–4725, 2023. 5

work page 2023

[58] [58]

3D object reconstruc- tion from hand-object interactions

Dimitrios Tzionas and Juergen Gall. 3D object reconstruc- tion from hand-object interactions. InInternational Confer- ence on Computer Vision (ICCV), 2015. 3

work page 2015

[59] [59]

Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991

Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns.Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 1991. 4

work page 1991

[60] [60]

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized object coordinate space for category-level 6D object pose and size estimation. InComputer Vision and Pattern Recognition (CVPR), 2019. 3

work page 2019

[61] [61]

VGGT: Visual Geometry Grounded Transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer. InComputer Vi- sion and Pattern Recognition (CVPR), 2025. 2

work page 2025

[62] [62]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InComputer Vision and Pattern Recogni- tion (CVPR), 2024. 2

work page 2024

[63] [63]

Mul- tiscale structural similarity for image quality assessment

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Mul- tiscale structural similarity for image quality assessment. InAsilomar Conference on Signals, Systems & Computers,

work page

[64] [64]

BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InComputer Vision and Pattern Recognition (CVPR), 2023. 7

work page 2023

[65] [65]

FoundationPose: Unified 6D pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InComputer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[66] [66]

Holistic 3D human and scene mesh estimation from single view images

Zhenzhen Weng and Serena Yeung. Holistic 3D human and scene mesh estimation from single view images. InCom- puter Vision and Pattern Recognition (CVPR), 2021. 3

work page 2021

[67] [67]

CHORE: Contact, human and object reconstruction from a single RGB image

Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. CHORE: Contact, human and object reconstruction from a single RGB image. InEuropean Conference on Computer Vision (ECCV), 2022. 3

work page 2022

[68] [68]

Template free reconstruction of human- object interaction with procedural interaction generation

Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[69] [69]

In- terTrack: Tracking human object interaction without object templates

Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- terTrack: Tracking human object interaction without object templates. InInternational Conference on 3D Vision (3DV),

work page

[70] [70]

HSR: Holistic 3D human-scene recon- struction from monocular videos

Lixin Xue, Chen Guo, Chengwei Zheng, Fangjinhua Wang, Tianjian Jiang, Hsuan-I Ho, Manuel Kaufmann, Jie Song, and Hilliges Otmar. HSR: Holistic 3D human-scene recon- struction from monocular videos. InEuropean Conference on Computer Vision (ECCV), 2024. 2, 3, 4, 5, 6, 13, 14, 15, 16

work page 2024

[71] [71]

PPR: Physically plausible recon- struction from monocular videos

Gengshan Yang, Shuo Yang, John Z Zhang, Zachary Manch- ester, and Deva Ramanan. PPR: Physically plausible recon- struction from monocular videos. InInternational Confer- ence on Computer Vision (ICCV), 2023. 3

work page 2023

[72] [72]

Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J

Hongwei Yi, Chun-Hao P. Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J. Black. Human-aware object placement for visual environment reconstruction. InComputer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[73] [73]

Neural- Dome: A neural modeling pipeline on multi-view human- object interactions

Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- Dome: A neural modeling pipeline on multi-view human- object interactions. InComputer Vision and Pattern Recog- nition (CVPR), 2023. 3, 6

work page 2023

[74] [74]

HOI-M3: Capture multiple humans and objects inter- action within contextual environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. HOI-M3: Capture multiple humans and objects inter- action within contextual environment. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6

work page 2024

[75] [75]

Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa

Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3D human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3 11

work page 2020

[76] [76]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

work page 2018

[77] [77]

Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices

Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, and Siyu Tang. Ego- Body: Human body shape and motion of interacting peo- ple from head-mounted devices. InEuropean Conference on Computer Vision (ECCV), 2022. 3

work page 2022

[78] [78]

Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online dense 3D reconstruction of humans and scenes from monocular videos. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3

work page 2025

[79] [79]

contact frame

Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’M HOI: Inertia-aware monocular capture of 3D human-object interactions. InComputer Vision and Pattern Recognition (CVPR), 2024. 3, 6 12 RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos Supplementary Material We discuss dat...

work page 2024