arxiv: 2512.11988 · v3 · submitted 2025-12-12 · 💻 cs.CV

CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction

Xianghui Xie , Bowen Wen , Yan Chang , Hesam Rabeti , Jiefeng Li , Ye Yuan , Gerard Pons-Moll , Stan Birchfield This is my paper

Pith reviewed 2026-05-16 22:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D reconstructionhuman-object interactioncategory-agnosticmonocular videorender-and-comparemetric scalecontact reasoningfoundation models

0 comments

The pith

CARI4D reconstructs spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos without assuming any object category.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARI4D as the first category-agnostic pipeline that turns ordinary single-camera video into full 4D reconstructions of a person and an arbitrary object moving together. It works by taking separate predictions from existing foundation models, selecting the most consistent pose hypotheses, and then iteratively refining the entire scene through a learned render-and-compare loop that enforces spatial, temporal, and pixel-level agreement. A final contact-reasoning stage adds physical constraints so that hands and objects touch plausibly. The result is metric-scale output that stays coherent across frames even when the object is completely novel. This matters because most real-world interaction footage comes from ordinary cameras, and previous reconstruction systems either needed a known 3D template or were restricted to a handful of object classes.

Core claim

CARI4D is the first category-agnostic method that reconstructs spatially and temporally consistent 4D human-object interaction at metric scale from monocular RGB videos. It achieves this by proposing a pose hypothesis selection algorithm that robustly integrates individual predictions from foundation models, jointly refining them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement that satisfies physical constraints.

What carries the argument

A pose-hypothesis selection step followed by a learned render-and-compare refinement loop that jointly optimizes human and object poses while enforcing temporal consistency and contact constraints.

If this is right

Reconstruction error drops 38 percent on in-distribution data and 36 percent on unseen object categories compared with prior methods.
The same trained model runs zero-shot on arbitrary internet videos containing novel objects.
Output meshes satisfy metric scale and remain coherent across long sequences without drifting.
Contact reasoning produces physically plausible hand-object touch points that earlier template-free approaches could not guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large existing video corpora could be turned into training data for robotics and animation without any manual 3D labeling.
The same refinement loop might be extended to multi-person scenes or to objects that deform during interaction.
Metric-scale output makes the reconstructions directly usable for measuring grasp forces or planning robot motions in real units.

Load-bearing premise

Initial predictions from off-the-shelf foundation models can be reliably combined and that the learned refinement step will generalize to completely unseen object shapes while still producing physically valid contacts.

What would settle it

Quantitative failure on a held-out object category where contact points between hand and object are visibly violated or where temporal jitter exceeds the error reported on the in-distribution test set.

Figures

Figures reproduced from arXiv: 2512.11988 by Bowen Wen, Gerard Pons-Moll, Hesam Rabeti, Jiefeng Li, Stan Birchfield, Xianghui Xie, Yan Chang, Ye Yuan.

**Figure 1.** Figure 1: Results on in-the-wild internet videos. Given a monocular RGB video, CARI4D reconstructs the human and object at metric scale, and tracks the 4D human-object interaction consistently across the video. Our method is category agnostic and generalizes zero-shot. Abstract Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding… view at source ↗

**Figure 2.** Figure 2: CARI4D method overview. Given a monocular RGB video, we reconstruct the 4D human and object at metric scale with consistent contacts. We start by estimating the metric-scale object mesh (Sec. 3.1), followed by initialization of human and object poses using dynamic pose hypothesis selection (Sec. 3.2). We then train a category agnostic contact reasoning model (CoCoNet) to refine the interaction poses and es… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on BEHAVE dataset [3]. InterTrack [57] reconstructs human and object as point clouds only, and the shapes are noisy. VisTracker [55] requires known object templates, hence we augment it with our reconstructed objects (denoted as † ). Our method reconstructs the objects and tracks the poses accurately. (Purple balls indicate contact predictions.) Input PICO VisTracker † Ours PICO VisT… view at source ↗

**Figure 4.** Figure 4: Zero-shot generalization to unseen InterCap dataset [18]. †Uses our reconstructed object meshes. Our method reconstructs the metric-scale object accurately and generalizes to unseen objects. (Purple balls indicate contact predictions.) data following Sec. 3.3 to align UniDepth [32], NLF [36], and FoundationPose [50] predictions to ground truth depth. We test the models on the BEHAVE test set and unseen Int… view at source ↗

**Figure 6.** Figure 6: Importance of contacts. Without our contact-aware optimization, the model does not properly handle the fine-grained hand-object interaction, leading to floating object or penetration errors. (Purple balls indicate contact predictions.) Compared to the initial prediction from CoCoNet (Tab. 3 e), our joint optimization improves the motion smoothness and coherency of the contacts. We show two examples in [PI… view at source ↗

**Figure 5.** Figure 5: Generalization to in-the-wild videos. Prior methods predict noisy shape (InterTrack [57]), flipped object pose (VisTracker [55], † with our object reconstruction) or wrong contacts and object position (PICO [10]). Our method generalizes better overall. (Purple balls indicate contact predictions.) Method CD-h↓ CD-o↓ CD-c↓ Acc-h↓ Acc-o↓ a. Raw NLF + FP tracking 11.53 1565.42 405.13 3.06 4.34 b. Raw NLF + FP… view at source ↗

**Figure 7.** Figure 7: CoCoNet architecture. Here b, t, h, w denote batch size, temporal window size, image height and width respectively. We follow a render-and-compare paradigm, hence RGB a and RGB b denote the image from input observation and rendering respectively, same for xyz map and mask (human and object stacked together). one A100@80GB GPU in Tab. 5. Image based approach PICO [10] requires the longest time as it optimiz… view at source ↗

**Figure 8.** Figure 8: Failure case examples. Our method focuses on full body interaction and the detailed hand poses are not handled, which can be important for fine-grained object manipulation task (top row). Our method thus failed to reconstruct realistic finger poses for holding the plate. Under highly dynamic motion and extreme occlusion (bottom row), FoundationPose predicts flipped object pose for initialization. Such larg… view at source ↗

read the original abstract

Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARI4D removes template and category assumptions from monocular 4D HOI reconstruction and shows measurable gains on unseen data, but the learned refinement's robustness to truly arbitrary objects remains the open question.

read the letter

The paper's core advance is dropping the usual requirements for ground-truth object templates or narrow category lists. It combines foundation-model predictions for initial human and object poses, then runs them through a learned render-and-compare stage that enforces spatial, temporal, and pixel alignment plus contact constraints. That produces the reported 38% and 36% error reductions on in-distribution and unseen test sets, and the authors say the model works zero-shot on internet videos outside the training categories. The pipeline is straightforward to describe and the metric-scale claim is useful for downstream robotics or AR work.

Referee Report

2 major / 2 minor

Summary. The paper presents CARI4D, the first claimed category-agnostic method for reconstructing spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos. It integrates predictions from off-the-shelf foundation models via a pose hypothesis selection algorithm, applies a learned render-and-compare refinement stage to enforce spatial/temporal/pixel alignment, and adds contact reasoning to satisfy physical constraints. Experiments report 38% and 36% reductions in reconstruction error on in-distribution and unseen datasets respectively, with claims of zero-shot generalization to in-the-wild videos.

Significance. If the generalization to arbitrary unseen object categories holds, the work would be significant for enabling practical 4D HOI capture without category-specific templates or priors, benefiting applications in robotics, AR/VR, and human motion analysis. The composition of foundation models with a learned refinement is a pragmatic direction, and the metric-scale output is a strength, but significance is tempered by the need for stronger evidence that the refinement truly corrects errors for novel geometries outside the training distribution.

major comments (2)

[§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.
[§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.

minor comments (2)

[Abstract] Abstract: The claimed error reductions (38%/36%) are not accompanied by the specific metric (e.g., MPJPE, object Chamfer distance) or list of baselines, reducing interpretability.
[§5] §5 (experiments): Qualitative results on in-the-wild videos are shown but lack corresponding quantitative metrics or failure-case enumeration for heavy occlusion or rapid motion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of CARI4D. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.

read point-by-point responses

Referee: [§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.

Authors: We agree that additional details on the unseen categories would strengthen the generalization evidence. In the revised manuscript, we will expand §4.2 with a description of the specific object categories used in the unseen split, including their topological characteristics (e.g., thin vs. bulky structures) and appearance variations. We will also add a quantitative comparison of foundation-model baseline error rates on seen versus unseen splits to directly illustrate the contribution of the render-and-compare refinement. revision: yes
Referee: [§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.

Authors: We acknowledge that an explicit error-propagation analysis would provide stronger support for the pipeline's robustness. While our end-to-end results demonstrate consistent metric-scale and contact improvements, the current manuscript lacks a dedicated ablation isolating the effects of input inaccuracies. In the revision, we will add such an analysis in §3 (e.g., by injecting controlled noise into foundation-model outputs and measuring downstream effects on spatial/temporal consistency and contact satisfaction). revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a pipeline that integrates predictions from existing foundation models via a pose hypothesis selection algorithm, followed by a learned render-and-compare refinement for alignment and contact reasoning. No equations, self-definitional steps, or fitted parameters renamed as predictions are present in the abstract or described method. The central claim of category-agnostic performance rests on empirical results (38% and 36% gains) rather than reducing to inputs by construction. Self-citations, if any, are not load-bearing for uniqueness theorems or ansatzes. The derivation is self-contained against external benchmarks and does not match any enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that foundation-model outputs provide usable initial hypotheses for unseen objects and that a learned render-and-compare loss can enforce spatial-temporal-pixel alignment without category-specific supervision.

axioms (2)

domain assumption Foundation models supply sufficiently accurate initial pose hypotheses for arbitrary objects
The pose hypothesis selection step depends on this premise.
domain assumption Render-and-compare refinement generalizes across object categories
The learned refinement is presented as category-agnostic.

pith-pipeline@v0.9.0 · 5541 in / 1253 out tokens · 49747 ms · 2026-05-16T22:34:51.415352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a pose hypothesis selection algorithm... CoCoNet... render-and-compare paradigm... contact-aware joint optimization framework.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reconstructs spatially and temporally consistent 4D human-object interaction at metric scale

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 1 internal anchor

[1]

https://github.com/cmu-perceptual-computing- lab/openpose. 1

work page
[2]

Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023

Stability AI. Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023. Available athttps : / / github . com / Stability - AI / generative - models. 5

work page arXiv 2023
[3]

Behave: Dataset and method for tracking human object inter- actions

Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object inter- actions. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 6, 7, 1

work page 2022
[4]

Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang

Michael J. Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. BEDLAM: A synthetic dataset of bodies ex- hibiting detailed lifelike animated motion. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 8726–8737, 2023. 2

work page 2023
[5]

Richter, and Vladlen Koltun

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Repre- sentation Learning (ICLR), 2025. 2

work page 2025
[6]

Markerless garment capture

Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Hei- drich, and Tamy Boubekeur. Markerless garment capture. ACM Trans. Graph., 27(3):1–9, 2008. 2

work page 2008
[7]

SMPLer-X: Scaling up expressive human pose and shape estimation

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qing- ping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Zi- wei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. InAdvances in Neural Information Processing Systems (NeurIPS) — Datasets abd Benchmarks Track, 2023. 2

work page 2023
[8]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2

work page arXiv 2025
[9]

High-quality streamable free-viewpoint video.ACM Trans

Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.ACM Trans. Graph., 34(4), 2015. 2

work page 2015
[10]

Black, and Dimitrios Tzionas

Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1783–1794, 2025. 2, 3, 6, 7, 8

work page 2025
[11]

Black, and Dim- itrios Tzionas

Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 7

work page 2025
[12]

HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video

Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Otmar Hilliges. HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024. 3

work page 2024
[13]

Humans in 4D: Reconstructing and tracking humans with transformers

Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023. 2

work page 2023
[14]

Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild

Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. InEuropean conference on computer vision (ECCV), 2024. 2

work page 2024
[15]

Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior

Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

work page 2025
[16]

Black, Ivan Laptev, and Cordelia Schmid

Yana Hasson, G ¨ul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. InCVPR, 2019. 3

work page 2019
[17]

Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction

Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction. InCVPR, 2020. 3

work page 2020
[18]

Black, and Dim- itrios Tzionas

Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction. InGerman Conference on Pattern Recognition (GCPR), pages 281–299. Springer,

work page
[19]

Monocular human- object reconstruction in the wild

Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human- object reconstruction in the wild. InProceedings of the 32nd ACM International Conference on Multimedia, page 5547–5555, New York, NY , USA, 2024. Association for Computing Machinery. 3

work page 2024
[20]

Sith: Single- view textured human reconstruction with image-conditioned diffusion

Hsuan I Ho, Jie Song, and Otmar Hilliges. Sith: Single- view textured human reconstruction with image-conditioned diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–549, 2024. 2

work page 2024
[21]

Human3.6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 36(7):1325–1339, 2014. 2

work page 2014
[22]

Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies

Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8320–8329, 2018. 2

work page 2018
[23]

Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025. 2, 7

work page arXiv 2025
[24]

Hoi4d: A 4d egocentric dataset for category-level human- object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, 2022. 3

work page 2022
[25]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model. InACM Transactions on Graphics. ACM, 2015. 3

work page 2015
[26]

V olumetricSMPL: A neural volu- metric body model for efficient interactions, contacts, and collisions

Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea M¨uller, and Siyu Tang. V olumetricSMPL: A neural volu- metric body model for efficient interactions, contacts, and collisions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 5, 1

work page 2025
[27]

Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer

Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 7

work page 2024
[28]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[29]

Huang, Joachim Tesch, David T

Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regres- sion analysis. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 13463–13473,

work page
[30]

Reconstruct- ing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 2

work page 2024
[31]

UniDepth: Universal monocular metric depth estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4

work page 2024
[32]

UniDepthV2: Universal monocular metric depth estimation made simpler, 2025

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler, 2025. 2, 3, 5, 6, 7

work page 2025
[33]

Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024. 2

work page 2024
[34]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 2, 5

work page 2022
[35]

Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bod- ies together.ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia), 36(6), 2017. 3

work page 2017
[36]

Neural localizer fields for continuous 3d human pose and shape estimation

Istv ´an S´ar´andi and Gerard Pons-Moll. Neural localizer fields for continuous 3d human pose and shape estimation. 2024. 2, 3, 4, 6, 7, 8

work page 2024
[37]

f-brs: Rethinking backpropagating refinement for interactive segmentation

Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 3

work page 2020
[38]

Robustfusion: Robust vol- umetric performance reconstruction under human-object interactions from monocular RGBD stream.CoRR, abs/2104.14837, 2021

Zhuo Su, Lan Xu, Dawei Zhong, Zhong Li, Fan Deng, Shuxue Quan, and Lu Fang. Robustfusion: Robust vol- umetric performance reconstruction under human-object interactions from monocular RGBD stream.CoRR, abs/2104.14837, 2021. 3

work page arXiv 2021
[39]

Neural free-viewpoint performance rendering under complex human-object interactions

Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingya Wang, and Jingyi Yu. Neural free-viewpoint performance rendering under complex human-object interactions. InProceedings of the 29th ACM International Conference on Multimedia, 2021. 3

work page 2021
[40]

Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,

work page
[41]

TripoSR: Fast 3D Object Reconstruction from a Single Image

Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J. Black. DECO: Dense estimation of 3D human-scene contact in the wild. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 8001–8013, 2023. 3, 7

work page 2023
[43]

SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion

Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 2, 1

work page 2024
[44]

Normalized object coordinate space for category-level 6d object pose and size estimation

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651,

work page
[45]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2

work page 2025
[46]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025. Oral. 2

work page 2025
[47]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. arXiv preprint, 2025. 2

work page 2025
[48]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 2

work page 2024
[49]

BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects

Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InCVPR, 2023. 2

work page 2023
[50]

FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. InCVPR, 2024. 2, 3, 5, 6, 7, 8

work page 2024
[51]

Reconstructing in-the-wild open-vocabulary human-object interactions, 2025

Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, and Yong-Lu Li. Reconstructing in-the-wild open-vocabulary human-object interactions, 2025. 3

work page 2025
[52]

Holistic 3d human and scene mesh estimation from single view images.arXiv preprint arXiv:2012.01591, 2020

Zhenzhen Weng and Serena Yeung. Holistic 3d human and scene mesh estimation from single view images.arXiv preprint arXiv:2012.01591, 2020. 3

work page arXiv 2012
[53]

Structured 3d latents for scalable and versatile 3d generation

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Spotlight. 2

work page 2025
[54]

Chore: Contact, human and object reconstruction from a sin- gle rgb image

Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a sin- gle rgb image. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 3, 5, 7

work page 2022
[55]

Visibility aware human-object interaction tracking from sin- gle rgb camera

Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from sin- gle rgb camera. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7, 8

work page 2023
[56]

Template free reconstruction of human- object interaction with procedural interaction generation

Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024. 2, 3, 7

work page 2024
[57]

In- tertrack: Tracking human object interaction without object templates

Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- tertrack: Tracking human object interaction without object templates. 2024. 2, 3, 6, 7, 8

work page 2024
[58]

Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhat- nagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeongjin Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, and Gerard Pons-Moll. RHOBIN Challenge: Reconstruction of human object interaction.arXiv...

work page arXiv 2024
[59]

Mvgbench: Com- prehensive benchmark for multi-view generation models,

Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons-Moll. Mvgbench: Com- prehensive benchmark for multi-view generation models,

work page
[60]

Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024

Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024. 2

work page 2024
[61]

Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models

Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models. InArxiv, 2024. 2

work page 2024
[62]

Infinihuman: Infinite 3d human creation with precise control

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Infinihuman: Infinite 3d human creation with precise control. 2025. 2

work page 2025
[63]

Physic: Physically plausible 3d human-scene interaction and contact from a single image

Pradyumna Yalandur-Muralidhar, Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Physic: Physically plausible 3d human-scene interaction and contact from a single image. InACM SIGGRAPH Asia, 2025. 3

work page 2025
[64]

CPF: Learning a contact potential field to model the hand-object interaction

Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. CPF: Learning a contact potential field to model the hand-object interaction. InICCV, 2021. 3

work page 2021
[65]

Depth anything: Un- leashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Un- leashing the power of large-scale unlabeled data. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. arXiv preprint arXiv:2401.10891. 2

work page arXiv 2024
[66]

Diffusion-guided reconstruction of everyday hand- object interaction clips

Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tul- siani. Diffusion-guided reconstruction of everyday hand- object interaction clips. InICCV, 2023. 3

work page 2023
[67]

G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis

Yufei Ye, Abhinav Gupta, Kris Kitani, and Shubham Tul- siani. G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. InCVPR, 2024. 3

work page 2024
[68]

Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras

Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

work page 2022
[69]

Smoothnet: A plug-and-play network for refining human poses in videos

Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. Smoothnet: A plug-and-play network for refining human poses in videos. InEuropean Conference on Computer Vision. Springer, 2022. 7

work page 2022
[70]

Neural- dome: A neural modeling pipeline on multi-view human- object interactions

Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. InCVPR, 2023. 3, 5, 1

work page 2023
[71]

Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment

Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment. InCVPR, 2024. 3

work page 2024
[72]

Hawor: World-space hand mo- tion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025

Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolan- dos Alexandros Potamias. Hawor: World-space hand mo- tion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025. 2

work page arXiv 2025
[73]

Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa

Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3d human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3, 5, 1

work page 2020
[74]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2

work page 2024
[75]

I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions

Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 729–741, 2024. 3 CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction...

work page 2024
[76]

Note that our code and pretrained models will be fully released with de- tailed documentation to enable reproduction of our results

Implementation Details We detail our implementations in this section. Note that our code and pretrained models will be fully released with de- tailed documentation to enable reproduction of our results. 6.1. CoCoNet Details Network architecture.We plot the architecture diagram of our CoCoNet in Fig. 7. We adopt DINOv2 [28] as our image encoder and keep th...

work page
[77]

We show two typical failure cases in Fig

Limitation and Future As the first step towards category agnostic 4D interaction reconstruction, our method shows strong generalization per- formance to in-the-wild videos, yet there are still some lim- itations. We show two typical failure cases in Fig. 8. First, our method primarily targets full-body human- object interaction; consequently, it does not ...

work page