pith. machine review for the scientific record. sign in

arxiv: 2512.11988 · v3 · submitted 2025-12-12 · 💻 cs.CV

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Pith reviewed 2026-05-16 22:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionhuman-object interactioncategory-agnosticmonocular videorender-and-comparemetric scalecontact reasoningfoundation models
0
0 comments X

The pith

CARI4D reconstructs spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos without assuming any object category.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARI4D as the first category-agnostic pipeline that turns ordinary single-camera video into full 4D reconstructions of a person and an arbitrary object moving together. It works by taking separate predictions from existing foundation models, selecting the most consistent pose hypotheses, and then iteratively refining the entire scene through a learned render-and-compare loop that enforces spatial, temporal, and pixel-level agreement. A final contact-reasoning stage adds physical constraints so that hands and objects touch plausibly. The result is metric-scale output that stays coherent across frames even when the object is completely novel. This matters because most real-world interaction footage comes from ordinary cameras, and previous reconstruction systems either needed a known 3D template or were restricted to a handful of object classes.

Core claim

CARI4D is the first category-agnostic method that reconstructs spatially and temporally consistent 4D human-object interaction at metric scale from monocular RGB videos. It achieves this by proposing a pose hypothesis selection algorithm that robustly integrates individual predictions from foundation models, jointly refining them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement that satisfies physical constraints.

What carries the argument

A pose-hypothesis selection step followed by a learned render-and-compare refinement loop that jointly optimizes human and object poses while enforcing temporal consistency and contact constraints.

If this is right

  • Reconstruction error drops 38 percent on in-distribution data and 36 percent on unseen object categories compared with prior methods.
  • The same trained model runs zero-shot on arbitrary internet videos containing novel objects.
  • Output meshes satisfy metric scale and remain coherent across long sequences without drifting.
  • Contact reasoning produces physically plausible hand-object touch points that earlier template-free approaches could not guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large existing video corpora could be turned into training data for robotics and animation without any manual 3D labeling.
  • The same refinement loop might be extended to multi-person scenes or to objects that deform during interaction.
  • Metric-scale output makes the reconstructions directly usable for measuring grasp forces or planning robot motions in real units.

Load-bearing premise

Initial predictions from off-the-shelf foundation models can be reliably combined and that the learned refinement step will generalize to completely unseen object shapes while still producing physically valid contacts.

What would settle it

Quantitative failure on a held-out object category where contact points between hand and object are visibly violated or where temporal jitter exceeds the error reported on the in-distribution test set.

Figures

Figures reproduced from arXiv: 2512.11988 by Bowen Wen, Gerard Pons-Moll, Hesam Rabeti, Jiefeng Li, Stan Birchfield, Xianghui Xie, Yan Chang, Ye Yuan.

Figure 1
Figure 1. Figure 1: Results on in-the-wild internet videos. Given a monocular RGB video, CARI4D reconstructs the human and object at metric scale, and tracks the 4D human-object interaction consistently across the video. Our method is category agnostic and generalizes zero-shot. Abstract Accurate capture of human-object interaction from ubiqui￾tous sensors like RGB cameras is important for applica￾tions in human understanding… view at source ↗
Figure 2
Figure 2. Figure 2: CARI4D method overview. Given a monocular RGB video, we reconstruct the 4D human and object at metric scale with consistent contacts. We start by estimating the metric-scale object mesh (Sec. 3.1), followed by initialization of human and object poses using dynamic pose hypothesis selection (Sec. 3.2). We then train a category agnostic contact reasoning model (CoCoNet) to refine the interaction poses and es… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on BEHAVE dataset [3]. InterTrack [57] reconstructs human and object as point clouds only, and the shapes are noisy. VisTracker [55] requires known object templates, hence we augment it with our reconstructed objects (denoted as † ). Our method reconstructs the objects and tracks the poses accurately. (Purple balls indicate contact predictions.) Input PICO VisTracker † Ours PICO VisT… view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot generalization to unseen InterCap dataset [18]. †Uses our reconstructed object meshes. Our method reconstructs the metric-scale object accurately and generalizes to unseen objects. (Purple balls indicate contact predictions.) data following Sec. 3.3 to align UniDepth [32], NLF [36], and FoundationPose [50] predictions to ground truth depth. We test the models on the BEHAVE test set and unseen Int… view at source ↗
Figure 6
Figure 6. Figure 6: Importance of contacts. Without our contact-aware optimization, the model does not properly handle the fine-grained hand-object interaction, leading to floating object or penetration errors. (Purple balls indicate contact predictions.) Compared to the initial prediction from CoCoNet (Tab. 3 e), our joint optimization improves the motion smoothness and coherency of the contacts. We show two examples in [PI… view at source ↗
Figure 5
Figure 5. Figure 5: Generalization to in-the-wild videos. Prior methods predict noisy shape (InterTrack [57]), flipped object pose (Vis￾Tracker [55], † with our object reconstruction) or wrong contacts and object position (PICO [10]). Our method generalizes better overall. (Purple balls indicate contact predictions.) Method CD-h↓ CD-o↓ CD-c↓ Acc-h↓ Acc-o↓ a. Raw NLF + FP tracking 11.53 1565.42 405.13 3.06 4.34 b. Raw NLF + FP… view at source ↗
Figure 7
Figure 7. Figure 7: CoCoNet architecture. Here b, t, h, w denote batch size, temporal window size, image height and width respectively. We follow a render-and-compare paradigm, hence RGB a and RGB b denote the image from input observation and rendering respectively, same for xyz map and mask (human and object stacked together). one A100@80GB GPU in Tab. 5. Image based approach PICO [10] requires the longest time as it optimiz… view at source ↗
Figure 8
Figure 8. Figure 8: Failure case examples. Our method focuses on full body interaction and the detailed hand poses are not handled, which can be important for fine-grained object manipulation task (top row). Our method thus failed to reconstruct realistic finger poses for holding the plate. Under highly dynamic motion and extreme occlusion (bottom row), FoundationPose predicts flipped object pose for initialization. Such larg… view at source ↗
read the original abstract

Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents CARI4D, the first claimed category-agnostic method for reconstructing spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos. It integrates predictions from off-the-shelf foundation models via a pose hypothesis selection algorithm, applies a learned render-and-compare refinement stage to enforce spatial/temporal/pixel alignment, and adds contact reasoning to satisfy physical constraints. Experiments report 38% and 36% reductions in reconstruction error on in-distribution and unseen datasets respectively, with claims of zero-shot generalization to in-the-wild videos.

Significance. If the generalization to arbitrary unseen object categories holds, the work would be significant for enabling practical 4D HOI capture without category-specific templates or priors, benefiting applications in robotics, AR/VR, and human motion analysis. The composition of foundation models with a learned refinement is a pragmatic direction, and the metric-scale output is a strength, but significance is tempered by the need for stronger evidence that the refinement truly corrects errors for novel geometries outside the training distribution.

major comments (2)
  1. [§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.
  2. [§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.
minor comments (2)
  1. [Abstract] Abstract: The claimed error reductions (38%/36%) are not accompanied by the specific metric (e.g., MPJPE, object Chamfer distance) or list of baselines, reducing interpretability.
  2. [§5] §5 (experiments): Qualitative results on in-the-wild videos are shown but lack corresponding quantitative metrics or failure-case enumeration for heavy occlusion or rapid motion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of CARI4D. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.

    Authors: We agree that additional details on the unseen categories would strengthen the generalization evidence. In the revised manuscript, we will expand §4.2 with a description of the specific object categories used in the unseen split, including their topological characteristics (e.g., thin vs. bulky structures) and appearance variations. We will also add a quantitative comparison of foundation-model baseline error rates on seen versus unseen splits to directly illustrate the contribution of the render-and-compare refinement. revision: yes

  2. Referee: [§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.

    Authors: We acknowledge that an explicit error-propagation analysis would provide stronger support for the pipeline's robustness. While our end-to-end results demonstrate consistent metric-scale and contact improvements, the current manuscript lacks a dedicated ablation isolating the effects of input inaccuracies. In the revision, we will add such an analysis in §3 (e.g., by injecting controlled noise into foundation-model outputs and measuring downstream effects on spatial/temporal consistency and contact satisfaction). revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a pipeline that integrates predictions from existing foundation models via a pose hypothesis selection algorithm, followed by a learned render-and-compare refinement for alignment and contact reasoning. No equations, self-definitional steps, or fitted parameters renamed as predictions are present in the abstract or described method. The central claim of category-agnostic performance rests on empirical results (38% and 36% gains) rather than reducing to inputs by construction. Self-citations, if any, are not load-bearing for uniqueness theorems or ansatzes. The derivation is self-contained against external benchmarks and does not match any enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that foundation-model outputs provide usable initial hypotheses for unseen objects and that a learned render-and-compare loss can enforce spatial-temporal-pixel alignment without category-specific supervision.

axioms (2)
  • domain assumption Foundation models supply sufficiently accurate initial pose hypotheses for arbitrary objects
    The pose hypothesis selection step depends on this premise.
  • domain assumption Render-and-compare refinement generalizes across object categories
    The learned refinement is presented as category-agnostic.

pith-pipeline@v0.9.0 · 5541 in / 1253 out tokens · 49747 ms · 2026-05-16T22:34:51.415352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 1 internal anchor

  1. [1]

    https://github.com/cmu-perceptual-computing- lab/openpose. 1

  2. [2]

    Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023

    Stability AI. Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023. Available athttps : / / github . com / Stability - AI / generative - models. 5

  3. [3]

    Behave: Dataset and method for tracking human object inter- actions

    Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object inter- actions. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 6, 7, 1

  4. [4]

    Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang

    Michael J. Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. BEDLAM: A synthetic dataset of bodies ex- hibiting detailed lifelike animated motion. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 8726–8737, 2023. 2

  5. [5]

    Richter, and Vladlen Koltun

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Repre- sentation Learning (ICLR), 2025. 2

  6. [6]

    Markerless garment capture

    Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Hei- drich, and Tamy Boubekeur. Markerless garment capture. ACM Trans. Graph., 27(3):1–9, 2008. 2

  7. [7]

    SMPLer-X: Scaling up expressive human pose and shape estimation

    Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qing- ping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Zi- wei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. InAdvances in Neural Information Processing Systems (NeurIPS) — Datasets abd Benchmarks Track, 2023. 2

  8. [8]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2

  9. [9]

    High-quality streamable free-viewpoint video.ACM Trans

    Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.ACM Trans. Graph., 34(4), 2015. 2

  10. [10]

    Black, and Dimitrios Tzionas

    Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1783–1794, 2025. 2, 3, 6, 7, 8

  11. [11]

    Black, and Dim- itrios Tzionas

    Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 7

  12. [12]

    HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video

    Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Otmar Hilliges. HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024. 3

  13. [13]

    Humans in 4D: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023. 2

  14. [14]

    Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild

    Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. InEuropean conference on computer vision (ECCV), 2024. 2

  15. [15]

    Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior

    Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  16. [16]

    Black, Ivan Laptev, and Cordelia Schmid

    Yana Hasson, G ¨ul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. InCVPR, 2019. 3

  17. [17]

    Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction

    Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction. InCVPR, 2020. 3

  18. [18]

    Black, and Dim- itrios Tzionas

    Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction. InGerman Conference on Pattern Recognition (GCPR), pages 281–299. Springer,

  19. [19]

    Monocular human- object reconstruction in the wild

    Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human- object reconstruction in the wild. InProceedings of the 32nd ACM International Conference on Multimedia, page 5547–5555, New York, NY , USA, 2024. Association for Computing Machinery. 3

  20. [20]

    Sith: Single- view textured human reconstruction with image-conditioned diffusion

    Hsuan I Ho, Jie Song, and Otmar Hilliges. Sith: Single- view textured human reconstruction with image-conditioned diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–549, 2024. 2

  21. [21]

    Human3.6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 36(7):1325–1339, 2014. 2

  22. [22]

    Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies

    Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8320–8329, 2018. 2

  23. [23]

    Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025

    Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025. 2, 7

  24. [24]

    Hoi4d: A 4d egocentric dataset for category-level human- object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, 2022. 3

  25. [25]

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model. InACM Transactions on Graphics. ACM, 2015. 3

  26. [26]

    V olumetricSMPL: A neural volu- metric body model for efficient interactions, contacts, and collisions

    Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea M¨uller, and Siyu Tang. V olumetricSMPL: A neural volu- metric body model for efficient interactions, contacts, and collisions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 5, 1

  27. [27]

    Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer

    Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 7

  28. [28]

    Dinov2: Learning robust visual features with- out supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  29. [29]

    Huang, Joachim Tesch, David T

    Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regres- sion analysis. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 13463–13473,

  30. [30]

    Reconstruct- ing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 2

  31. [31]

    UniDepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4

  32. [32]

    UniDepthV2: Universal monocular metric depth estimation made simpler, 2025

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler, 2025. 2, 3, 5, 6, 7

  33. [33]

    Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024

    Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024. 2

  34. [34]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 2, 5

  35. [35]

    Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bod- ies together.ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia), 36(6), 2017. 3

  36. [36]

    Neural localizer fields for continuous 3d human pose and shape estimation

    Istv ´an S´ar´andi and Gerard Pons-Moll. Neural localizer fields for continuous 3d human pose and shape estimation. 2024. 2, 3, 4, 6, 7, 8

  37. [37]

    f-brs: Rethinking backpropagating refinement for interactive segmentation

    Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 3

  38. [38]

    Robustfusion: Robust vol- umetric performance reconstruction under human-object interactions from monocular RGBD stream.CoRR, abs/2104.14837, 2021

    Zhuo Su, Lan Xu, Dawei Zhong, Zhong Li, Fan Deng, Shuxue Quan, and Lu Fang. Robustfusion: Robust vol- umetric performance reconstruction under human-object interactions from monocular RGBD stream.CoRR, abs/2104.14837, 2021. 3

  39. [39]

    Neural free-viewpoint performance rendering under complex human-object interactions

    Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingya Wang, and Jingyi Yu. Neural free-viewpoint performance rendering under complex human-object interactions. InProceedings of the 29th ACM International Conference on Multimedia, 2021. 3

  40. [40]

    Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

    Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

  41. [41]

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 2

  42. [42]

    Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J. Black. DECO: Dense estimation of 3D human-scene contact in the wild. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 8001–8013, 2023. 3, 7

  43. [43]

    SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion

    Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 2, 1

  44. [44]

    Normalized object coordinate space for category-level 6d object pose and size estimation

    He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651,

  45. [45]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

  46. [46]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025. Oral. 2

  47. [47]

    Moge-2: Accurate monocular geometry with metric scale and sharp details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. arXiv preprint, 2025. 2

  48. [48]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 2

  49. [49]

    BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects

    Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InCVPR, 2023. 2

  50. [50]

    FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

    Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. InCVPR, 2024. 2, 3, 5, 6, 7, 8

  51. [51]

    Reconstructing in-the-wild open-vocabulary human-object interactions, 2025

    Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, and Yong-Lu Li. Reconstructing in-the-wild open-vocabulary human-object interactions, 2025. 3

  52. [52]

    Holistic 3d human and scene mesh estimation from single view images.arXiv preprint arXiv:2012.01591, 2020

    Zhenzhen Weng and Serena Yeung. Holistic 3d human and scene mesh estimation from single view images.arXiv preprint arXiv:2012.01591, 2020. 3

  53. [53]

    Structured 3d latents for scalable and versatile 3d generation

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Spotlight. 2

  54. [54]

    Chore: Contact, human and object reconstruction from a sin- gle rgb image

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a sin- gle rgb image. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 3, 5, 7

  55. [55]

    Visibility aware human-object interaction tracking from sin- gle rgb camera

    Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from sin- gle rgb camera. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7, 8

  56. [56]

    Template free reconstruction of human- object interaction with procedural interaction generation

    Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024. 2, 3, 7

  57. [57]

    In- tertrack: Tracking human object interaction without object templates

    Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- tertrack: Tracking human object interaction without object templates. 2024. 2, 3, 6, 7, 8

  58. [58]

    Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhat- nagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeongjin Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, and Gerard Pons-Moll. RHOBIN Challenge: Reconstruction of human object interaction.arXiv...

  59. [59]

    Mvgbench: Com- prehensive benchmark for multi-view generation models,

    Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons-Moll. Mvgbench: Com- prehensive benchmark for multi-view generation models,

  60. [60]

    Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024

    Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024. 2

  61. [61]

    Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models

    Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models. InArxiv, 2024. 2

  62. [62]

    Infinihuman: Infinite 3d human creation with precise control

    Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Infinihuman: Infinite 3d human creation with precise control. 2025. 2

  63. [63]

    Physic: Physically plausible 3d human-scene interaction and contact from a single image

    Pradyumna Yalandur-Muralidhar, Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Physic: Physically plausible 3d human-scene interaction and contact from a single image. InACM SIGGRAPH Asia, 2025. 3

  64. [64]

    CPF: Learning a contact potential field to model the hand-object interaction

    Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. CPF: Learning a contact potential field to model the hand-object interaction. InICCV, 2021. 3

  65. [65]

    Depth anything: Un- leashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Un- leashing the power of large-scale unlabeled data. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. arXiv preprint arXiv:2401.10891. 2

  66. [66]

    Diffusion-guided reconstruction of everyday hand- object interaction clips

    Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tul- siani. Diffusion-guided reconstruction of everyday hand- object interaction clips. InICCV, 2023. 3

  67. [67]

    G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis

    Yufei Ye, Abhinav Gupta, Kris Kitani, and Shubham Tul- siani. G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. InCVPR, 2024. 3

  68. [68]

    Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras

    Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  69. [69]

    Smoothnet: A plug-and-play network for refining human poses in videos

    Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. Smoothnet: A plug-and-play network for refining human poses in videos. InEuropean Conference on Computer Vision. Springer, 2022. 7

  70. [70]

    Neural- dome: A neural modeling pipeline on multi-view human- object interactions

    Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. InCVPR, 2023. 3, 5, 1

  71. [71]

    Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment

    Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment. InCVPR, 2024. 3

  72. [72]

    Hawor: World-space hand mo- tion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025

    Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolan- dos Alexandros Potamias. Hawor: World-space hand mo- tion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025. 2

  73. [73]

    Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa

    Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3d human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3, 5, 1

  74. [74]

    Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2

  75. [75]

    I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions

    Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 729–741, 2024. 3 CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction...

  76. [76]

    Note that our code and pretrained models will be fully released with de- tailed documentation to enable reproduction of our results

    Implementation Details We detail our implementations in this section. Note that our code and pretrained models will be fully released with de- tailed documentation to enable reproduction of our results. 6.1. CoCoNet Details Network architecture.We plot the architecture diagram of our CoCoNet in Fig. 7. We adopt DINOv2 [28] as our image encoder and keep th...

  77. [77]

    We show two typical failure cases in Fig

    Limitation and Future As the first step towards category agnostic 4D interaction reconstruction, our method shows strong generalization per- formance to in-the-wild videos, yet there are still some lim- itations. We show two typical failure cases in Fig. 8. First, our method primarily targets full-body human- object interaction; consequently, it does not ...