CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
Pith reviewed 2026-05-16 22:34 UTC · model grok-4.3
The pith
CARI4D reconstructs spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos without assuming any object category.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CARI4D is the first category-agnostic method that reconstructs spatially and temporally consistent 4D human-object interaction at metric scale from monocular RGB videos. It achieves this by proposing a pose hypothesis selection algorithm that robustly integrates individual predictions from foundation models, jointly refining them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement that satisfies physical constraints.
What carries the argument
A pose-hypothesis selection step followed by a learned render-and-compare refinement loop that jointly optimizes human and object poses while enforcing temporal consistency and contact constraints.
If this is right
- Reconstruction error drops 38 percent on in-distribution data and 36 percent on unseen object categories compared with prior methods.
- The same trained model runs zero-shot on arbitrary internet videos containing novel objects.
- Output meshes satisfy metric scale and remain coherent across long sequences without drifting.
- Contact reasoning produces physically plausible hand-object touch points that earlier template-free approaches could not guarantee.
Where Pith is reading between the lines
- Large existing video corpora could be turned into training data for robotics and animation without any manual 3D labeling.
- The same refinement loop might be extended to multi-person scenes or to objects that deform during interaction.
- Metric-scale output makes the reconstructions directly usable for measuring grasp forces or planning robot motions in real units.
Load-bearing premise
Initial predictions from off-the-shelf foundation models can be reliably combined and that the learned refinement step will generalize to completely unseen object shapes while still producing physically valid contacts.
What would settle it
Quantitative failure on a held-out object category where contact points between hand and object are visibly violated or where temporal jitter exceeds the error reported on the in-distribution test set.
Figures
read the original abstract
Accurate capture of human-object interaction from ubiquitous sensors like RGB cameras is important for applications in human understanding, gaming, and robot learning. However, inferring 4D interactions from a single RGB view is highly challenging due to the unknown object and human information, depth ambiguity, occlusion, and complex motion, which hinder consistent 3D and temporal reconstruction. Previous methods simplify the setup by assuming ground truth object template or constraining to a limited set of object categories. We present CARI4D, the first category-agnostic method that reconstructs spatially and temporarily consistent 4D human-object interaction at metric scale from monocular RGB videos. To this end, we propose a pose hypothesis selection algorithm that robustly integrates the individual predictions from foundation models, jointly refine them through a learned render-and-compare paradigm to ensure spatial, temporal and pixel alignment, and finally reasoning about intricate contacts for further refinement satisfying physical constraints. Experiments show that our method outperforms prior art by 38% on in-distribution dataset and 36% on unseen dataset in terms of reconstruction error. Our model generalizes beyond the training categories and thus can be applied zero-shot to in-the-wild internet videos. Our code and pretrained models will be publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CARI4D, the first claimed category-agnostic method for reconstructing spatially and temporally consistent 4D human-object interactions at metric scale from monocular RGB videos. It integrates predictions from off-the-shelf foundation models via a pose hypothesis selection algorithm, applies a learned render-and-compare refinement stage to enforce spatial/temporal/pixel alignment, and adds contact reasoning to satisfy physical constraints. Experiments report 38% and 36% reductions in reconstruction error on in-distribution and unseen datasets respectively, with claims of zero-shot generalization to in-the-wild videos.
Significance. If the generalization to arbitrary unseen object categories holds, the work would be significant for enabling practical 4D HOI capture without category-specific templates or priors, benefiting applications in robotics, AR/VR, and human motion analysis. The composition of foundation models with a learned refinement is a pragmatic direction, and the metric-scale output is a strength, but significance is tempered by the need for stronger evidence that the refinement truly corrects errors for novel geometries outside the training distribution.
major comments (2)
- [§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.
- [§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.
minor comments (2)
- [Abstract] Abstract: The claimed error reductions (38%/36%) are not accompanied by the specific metric (e.g., MPJPE, object Chamfer distance) or list of baselines, reducing interpretability.
- [§5] §5 (experiments): Qualitative results on in-the-wild videos are shown but lack corresponding quantitative metrics or failure-case enumeration for heavy occlusion or rapid motion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential impact of CARI4D. We address each major comment below and will revise the manuscript to incorporate additional details and analyses as outlined.
read point-by-point responses
-
Referee: [§4.2] §4.2 (unseen dataset evaluation): The reported 36% error reduction on the unseen set is presented as evidence of category-agnostic generalization, yet the manuscript provides no details on the specific object categories, their topological differences (e.g., thin vs. bulky structures), appearance variations, or quantitative comparison of foundation-model error rates between seen and unseen splits. This directly affects the central claim that the learned render-and-compare stage robustly generalizes without category-specific shape priors.
Authors: We agree that additional details on the unseen categories would strengthen the generalization evidence. In the revised manuscript, we will expand §4.2 with a description of the specific object categories used in the unseen split, including their topological characteristics (e.g., thin vs. bulky structures) and appearance variations. We will also add a quantitative comparison of foundation-model baseline error rates on seen versus unseen splits to directly illustrate the contribution of the render-and-compare refinement. revision: yes
-
Referee: [§3.1–3.3] §3.1–3.3 (pipeline integration): The pose hypothesis selection and refinement stages assume that initial predictions from foundation models (pose, depth, segmentation) can be reliably fused and corrected. No ablation or error-propagation analysis quantifies how inaccuracies in these off-the-shelf components affect final metric-scale consistency or contact satisfaction, which is load-bearing for the 4D reconstruction claims.
Authors: We acknowledge that an explicit error-propagation analysis would provide stronger support for the pipeline's robustness. While our end-to-end results demonstrate consistent metric-scale and contact improvements, the current manuscript lacks a dedicated ablation isolating the effects of input inaccuracies. In the revision, we will add such an analysis in §3 (e.g., by injecting controlled noise into foundation-model outputs and measuring downstream effects on spatial/temporal consistency and contact satisfaction). revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a pipeline that integrates predictions from existing foundation models via a pose hypothesis selection algorithm, followed by a learned render-and-compare refinement for alignment and contact reasoning. No equations, self-definitional steps, or fitted parameters renamed as predictions are present in the abstract or described method. The central claim of category-agnostic performance rests on empirical results (38% and 36% gains) rather than reducing to inputs by construction. Self-citations, if any, are not load-bearing for uniqueness theorems or ansatzes. The derivation is self-contained against external benchmarks and does not match any enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Foundation models supply sufficiently accurate initial pose hypotheses for arbitrary objects
- domain assumption Render-and-compare refinement generalizes across object categories
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a pose hypothesis selection algorithm... CoCoNet... render-and-compare paradigm... contact-aware joint optimization framework.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reconstructs spatially and temporally consistent 4D human-object interaction at metric scale
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://github.com/cmu-perceptual-computing- lab/openpose. 1
-
[2]
Stability AI. Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023. Available athttps : / / github . com / Stability - AI / generative - models. 5
-
[3]
Behave: Dataset and method for tracking human object inter- actions
Bharat Lal Bhatnagar, Xianghui Xie, Ilya Petrov, Cristian Sminchisescu, Christian Theobalt, and Gerard Pons-Moll. Behave: Dataset and method for tracking human object inter- actions. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3, 5, 6, 7, 1
work page 2022
-
[4]
Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang
Michael J. Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. BEDLAM: A synthetic dataset of bodies ex- hibiting detailed lifelike animated motion. InProceedings IEEE/CVF Conf. on Computer Vision and Pattern Recogni- tion (CVPR), pages 8726–8737, 2023. 2
work page 2023
-
[5]
Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Repre- sentation Learning (ICLR), 2025. 2
work page 2025
-
[6]
Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Hei- drich, and Tamy Boubekeur. Markerless garment capture. ACM Trans. Graph., 27(3):1–9, 2008. 2
work page 2008
-
[7]
SMPLer-X: Scaling up expressive human pose and shape estimation
Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qing- ping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Zi- wei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. InAdvances in Neural Information Processing Systems (NeurIPS) — Datasets abd Benchmarks Track, 2023. 2
work page 2023
-
[8]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2
-
[9]
High-quality streamable free-viewpoint video.ACM Trans
Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.ACM Trans. Graph., 34(4), 2015. 2
work page 2015
-
[10]
Alp ´ar Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Ar- jun Lakshmipathy, Agniv Chatterjee, Michael J. Black, and Dimitrios Tzionas. PICO: Reconstructing 3D people in con- tact with objects. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 1783–1794, 2025. 2, 3, 6, 7, 8
work page 2025
-
[11]
Black, and Dim- itrios Tzionas
Sai Kumar Dwivedi, Dimitrije Anti ´c, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, and Dim- itrios Tzionas. InteractVLM: 3D interaction reasoning from 2D foundational models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3, 7
work page 2025
-
[12]
HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video
Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J Black, and Otmar Hilliges. HOLD: Category-agnostic 3d reconstruction of in- teracting hands and objects from video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 494–504, 2024. 3
work page 2024
-
[13]
Humans in 4D: Reconstructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Humans in 4D: Reconstructing and tracking humans with transformers. In ICCV, 2023. 2
work page 2023
-
[14]
Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild
Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. Reloo: Reconstructing humans dressed in loose garments from monocular video in the wild. InEuropean conference on computer vision (ECCV), 2024. 2
work page 2024
-
[15]
Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior
Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2avatar-pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2
work page 2025
-
[16]
Black, Ivan Laptev, and Cordelia Schmid
Yana Hasson, G ¨ul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated ob- jects. InCVPR, 2019. 3
work page 2019
-
[17]
Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction
Yana Hasson, Bugra Tekin, Federica Bogo, Ivan Laptev, Marc Pollefeys, and Cordelia Schmid. Leveraging photomet- ric consistency over time for sparsely supervised hand-object reconstruction. InCVPR, 2020. 3
work page 2020
-
[18]
Black, and Dim- itrios Tzionas
Yinghao Huang, Omid Taheri, Michael J. Black, and Dim- itrios Tzionas. InterCap: Joint markerless 3D tracking of humans and objects in interaction. InGerman Conference on Pattern Recognition (GCPR), pages 281–299. Springer,
-
[19]
Monocular human- object reconstruction in the wild
Chaofan Huo, Ye Shi, and Jingya Wang. Monocular human- object reconstruction in the wild. InProceedings of the 32nd ACM International Conference on Multimedia, page 5547–5555, New York, NY , USA, 2024. Association for Computing Machinery. 3
work page 2024
-
[20]
Sith: Single- view textured human reconstruction with image-conditioned diffusion
Hsuan I Ho, Jie Song, and Otmar Hilliges. Sith: Single- view textured human reconstruction with image-conditioned diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 538–549, 2024. 2
work page 2024
-
[21]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predic- tive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 36(7):1325–1339, 2014. 2
work page 2014
-
[22]
Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies
Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap- ture: A 3d deformation model for tracking faces, hands, and bodies. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8320–8329, 2018. 2
work page 2018
-
[23]
Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025
Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, and Ye Yuan. Genmo: Genera- tive models for human motion synthesis.arXiv preprint arXiv:2505.01425, 2025. 2, 7
-
[24]
Hoi4d: A 4d egocentric dataset for category-level human- object interaction
Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human- object interaction. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, 2022. 3
work page 2022
-
[25]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi- person linear model. InACM Transactions on Graphics. ACM, 2015. 3
work page 2015
-
[26]
Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea M¨uller, and Siyu Tang. V olumetricSMPL: A neural volu- metric body model for efficient interactions, contacts, and collisions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 5, 1
work page 2025
-
[27]
Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer
Hyeongjin Nam, Daniel Sungho Jung, Gyeongsik Moon, and Kyoung Mu Lee. Joint reconstruction of 3d human and ob- ject via contact-based refinement transformer. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 7
work page 2024
-
[28]
Dinov2: Learning robust visual features with- out supervision, 2024
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...
work page 2024
-
[29]
Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regres- sion analysis. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 13463–13473,
-
[30]
Reconstruct- ing hands in 3D with transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstruct- ing hands in 3D with transformers. InCVPR, 2024. 2
work page 2024
-
[31]
UniDepth: Universal monocular metric depth estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4
work page 2024
-
[32]
UniDepthV2: Universal monocular metric depth estimation made simpler, 2025
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler, 2025. 2, 3, 5, 6, 7
work page 2025
-
[33]
Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild, 2024. 2
work page 2024
-
[34]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022. 2, 5
work page 2022
-
[35]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bod- ies together.ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia), 36(6), 2017. 3
work page 2017
-
[36]
Neural localizer fields for continuous 3d human pose and shape estimation
Istv ´an S´ar´andi and Gerard Pons-Moll. Neural localizer fields for continuous 3d human pose and shape estimation. 2024. 2, 3, 4, 6, 7, 8
work page 2024
-
[37]
f-brs: Rethinking backpropagating refinement for interactive segmentation
Konstantin Sofiiuk, Ilia Petrov, Olga Barinova, and Anton Konushin. f-brs: Rethinking backpropagating refinement for interactive segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8623–8632, 2020. 3
work page 2020
-
[38]
Zhuo Su, Lan Xu, Dawei Zhong, Zhong Li, Fan Deng, Shuxue Quan, and Lu Fang. Robustfusion: Robust vol- umetric performance reconstruction under human-object interactions from monocular RGBD stream.CoRR, abs/2104.14837, 2021. 3
-
[39]
Neural free-viewpoint performance rendering under complex human-object interactions
Guoxing Sun, Xin Chen, Yizhang Chen, Anqi Pang, Pei Lin, Yuheng Jiang, Lan Xu, Jingya Wang, and Jingyi Yu. Neural free-viewpoint performance rendering under complex human-object interactions. InProceedings of the 29th ACM International Conference on Multimedia, 2021. 3
work page 2021
-
[40]
Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,
Tencent Hunyuan3D Team. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation,
-
[41]
TripoSR: Fast 3D Object Reconstruction from a Single Image
Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, , Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, and Michael J. Black. DECO: Dense estimation of 3D human-scene contact in the wild. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 8001–8013, 2023. 3, 7
work page 2023
-
[43]
SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion
Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitrii Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. SV3D: Novel multi-view synthesis and 3D generation from a single image using latent video diffusion. InProceedings of the European Conference on Computer Vision (ECCV), 2024. 2, 1
work page 2024
-
[44]
Normalized object coordinate space for category-level 6d object pose and size estimation
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651,
-
[45]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2
work page 2025
-
[46]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5261–5271, 2025. Oral. 2
work page 2025
-
[47]
Moge-2: Accurate monocular geometry with metric scale and sharp details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. arXiv preprint, 2025. 2
work page 2025
-
[48]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 2
work page 2024
-
[49]
BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects
Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas M ¨uller, Alex Evans, Dieter Fox, Jan Kautz, and Stan Birchfield. BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. InCVPR, 2023. 2
work page 2023
-
[50]
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. InCVPR, 2024. 2, 3, 5, 6, 7, 8
work page 2024
-
[51]
Reconstructing in-the-wild open-vocabulary human-object interactions, 2025
Boran Wen, Dingbang Huang, Zichen Zhang, Jiahong Zhou, Jianbin Deng, Jingyu Gong, Yulong Chen, Lizhuang Ma, and Yong-Lu Li. Reconstructing in-the-wild open-vocabulary human-object interactions, 2025. 3
work page 2025
-
[52]
Zhenzhen Weng and Serena Yeung. Holistic 3d human and scene mesh estimation from single view images.arXiv preprint arXiv:2012.01591, 2020. 3
-
[53]
Structured 3d latents for scalable and versatile 3d generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Spotlight. 2
work page 2025
-
[54]
Chore: Contact, human and object reconstruction from a sin- gle rgb image
Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Chore: Contact, human and object reconstruction from a sin- gle rgb image. InEuropean Conference on Computer Vision (ECCV). Springer, 2022. 3, 5, 7
work page 2022
-
[55]
Visibility aware human-object interaction tracking from sin- gle rgb camera
Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from sin- gle rgb camera. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7, 8
work page 2023
-
[56]
Template free reconstruction of human- object interaction with procedural interaction generation
Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen, and Gerard Pons-Moll. Template free reconstruction of human- object interaction with procedural interaction generation. In IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024. 2, 3, 7
work page 2024
-
[57]
In- tertrack: Tracking human object interaction without object templates
Xianghui Xie, Jan Eric Lenssen, and Gerard Pons-Moll. In- tertrack: Tracking human object interaction without object templates. 2024. 2, 3, 6, 7, 8
work page 2024
-
[58]
Xianghui Xie, Xi Wang, Nikos Athanasiou, Bharat Lal Bhat- nagar, Chun-Hao P. Huang, Kaichun Mo, Hao Chen, Xia Jia, Zerui Zhang, Liangxian Cui, Xiao Lin, Bingqiao Qian, Jie Xiao, Wenfei Yang, Hyeongjin Nam, Daniel Sungho Jung, Kihoon Kim, Kyoung Mu Lee, Otmar Hilliges, and Gerard Pons-Moll. RHOBIN Challenge: Reconstruction of human object interaction.arXiv...
-
[59]
Mvgbench: Com- prehensive benchmark for multi-view generation models,
Xianghui Xie, Chuhang Zou, Meher Gitika Karumuri, Jan Eric Lenssen, and Gerard Pons-Moll. Mvgbench: Com- prehensive benchmark for multi-view generation models,
-
[60]
Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024
Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Gen-3diffusion: Realistic image-to-3d genera- tion via 2d & 3d diffusion synergy, 2024. 2
work page 2024
-
[61]
Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models
Yuxuan Xue, Xianghui Xie, Riccardo Marin, and Gerard Pons-Moll. Human 3diffusion: Realistic avatar creation via explicit 3d consistent diffusion models. InArxiv, 2024. 2
work page 2024
-
[62]
Infinihuman: Infinite 3d human creation with precise control
Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Infinihuman: Infinite 3d human creation with precise control. 2025. 2
work page 2025
-
[63]
Physic: Physically plausible 3d human-scene interaction and contact from a single image
Pradyumna Yalandur-Muralidhar, Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, and Gerard Pons-Moll. Physic: Physically plausible 3d human-scene interaction and contact from a single image. InACM SIGGRAPH Asia, 2025. 3
work page 2025
-
[64]
CPF: Learning a contact potential field to model the hand-object interaction
Lixin Yang, Xinyu Zhan, Kailin Li, Wenqiang Xu, Jiefeng Li, and Cewu Lu. CPF: Learning a contact potential field to model the hand-object interaction. InICCV, 2021. 3
work page 2021
-
[65]
Depth anything: Un- leashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Un- leashing the power of large-scale unlabeled data. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2024. arXiv preprint arXiv:2401.10891. 2
-
[66]
Diffusion-guided reconstruction of everyday hand- object interaction clips
Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tul- siani. Diffusion-guided reconstruction of everyday hand- object interaction clips. InICCV, 2023. 3
work page 2023
-
[67]
G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis
Yufei Ye, Abhinav Gupta, Kris Kitani, and Shubham Tul- siani. G-hop: Generative hand-object prior for interaction reconstruction and grasp synthesis. InCVPR, 2024. 3
work page 2024
-
[68]
Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras
Ye Yuan, Umar Iqbal, Pavlo Molchanov, Kris Kitani, and Jan Kautz. Glamr: Global occlusion-aware human mesh recov- ery with dynamic cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2
work page 2022
-
[69]
Smoothnet: A plug-and-play network for refining human poses in videos
Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. Smoothnet: A plug-and-play network for refining human poses in videos. InEuropean Conference on Computer Vision. Springer, 2022. 7
work page 2022
-
[70]
Neural- dome: A neural modeling pipeline on multi-view human- object interactions
Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Neural- dome: A neural modeling pipeline on multi-view human- object interactions. InCVPR, 2023. 3, 5, 1
work page 2023
-
[71]
Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment
Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, and Jingya Wang. Hoi-m3: Capture multiple humans and objects in- teraction within contextual environment. InCVPR, 2024. 3
work page 2024
-
[72]
Jinglei Zhang, Jiankang Deng, Chao Ma, and Rolan- dos Alexandros Potamias. Hawor: World-space hand mo- tion reconstruction from egocentric videos.arXiv preprint arXiv:2501.02973, 2025. 2
-
[73]
Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa
Jason Y . Zhang, Sam Pepose, Hanbyul Joo, Deva Ra- manan, Jitendra Malik, and Angjoo Kanazawa. Perceiving 3d human-object spatial arrangements from a single image in the wild. InEuropean Conference on Computer Vision (ECCV), 2020. 3, 5, 1
work page 2020
-
[74]
Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2
work page 2024
-
[75]
I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions
Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, and Lan Xu. I’m hoi: Inertia-aware monocular capture of 3d human-object interac- tions. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 729–741, 2024. 3 CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction...
work page 2024
-
[76]
Implementation Details We detail our implementations in this section. Note that our code and pretrained models will be fully released with de- tailed documentation to enable reproduction of our results. 6.1. CoCoNet Details Network architecture.We plot the architecture diagram of our CoCoNet in Fig. 7. We adopt DINOv2 [28] as our image encoder and keep th...
-
[77]
We show two typical failure cases in Fig
Limitation and Future As the first step towards category agnostic 4D interaction reconstruction, our method shows strong generalization per- formance to in-the-wild videos, yet there are still some lim- itations. We show two typical failure cases in Fig. 8. First, our method primarily targets full-body human- object interaction; consequently, it does not ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.