pith. sign in

arxiv: 2605.17354 · v1 · pith:TY4FXWG6new · submitted 2026-05-17 · 💻 cs.CV

GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

Pith reviewed 2026-05-20 13:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D hand reconstructionmonocular geometry estimationgeometric priorshand-object interactionocclusion handlingcross-modal fusioniterative refinementcomputer vision
0
0 comments X

The pith

GeoHand adapts geometric priors from a frozen scene estimator to resolve ambiguities in monocular 3D hand reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that RGB appearance features alone leave severe ambiguities in 3D hand shape and pose when hands are self-occluded or interacting with objects. It shows these ambiguities can be reduced by borrowing spatial cues from a general-purpose monocular geometry model without retraining that model. The method adds a lightweight adapter to adjust the borrowed features for hands, fuses them selectively with image tokens, and applies an iterative local refiner driven by projected keypoints. If the approach works, hand reconstruction becomes more reliable in exactly the cluttered, occluded settings where current systems fail most often.

Core claim

GeoHand unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). It introduces a map-level GeoAdapter to recalibrate the spatial features for detailed hand reconstruction, employs a gated cross-modal token fusion strategy to integrate the adapted priors without overwhelming intrinsic RGB appearance cues, and designs a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction.

What carries the argument

The map-level GeoAdapter that recalibrates spatial features from the general-scene estimator for hand-specific use.

If this is right

  • Global geometric disambiguation combined with local keypoint-driven refinement produces more accurate 3D hands on FreiHAND, DexYCB, and HO3Dv3.
  • The same pipeline improves results particularly when self-occlusions or hand-object contacts are present.
  • Gated fusion prevents the added geometric priors from overwriting useful appearance information.
  • Iterative querying of image features at projected joint locations corrects local articulation errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter-and-fusion pattern could transfer geometric priors to other articulated objects such as bodies or tools.
  • Making the adapter learnable end-to-end rather than map-level might further reduce domain gap between scene geometry and hand geometry.
  • The method implies that frozen foundation geometry models can be specialized for narrow tasks with far less data than training a hand-specific estimator from scratch.

Load-bearing premise

The spatial features from the frozen general-scene estimator contain information that can be recalibrated for fine-grained hand geometry without destroying useful cues.

What would settle it

Running the pipeline with the GeoAdapter and geometric priors removed and observing no gain or a clear drop in accuracy on occluded or hand-object test cases relative to an RGB-only baseline.

Figures

Figures reproduced from arXiv: 2605.17354 by Liangchen Dai, Weiquan Lin, Xingyu Chen, Xu Tang, Yaoqing Hu.

Figure 1
Figure 1. Figure 1: Comparison of 3D reconstruction pipelines. Unlike pure RGB or RGB-D methods, GeoHand extracts [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of GeoHand. The network extracts parallel appearance tokens and prior geometry features [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons on DexYCB (top four rows) and HO3Dv3 (bottom four rows). Compared to WiLoR [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative evaluation on in-the-wild multi-hand scenarios with crowded layouts and complex occlusions. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention heatmaps on FreiHAND. Guided by geometric priors, GeoHand focuses entirely on hand structures, avoiding the background distractions seen in the RGB-only baseline [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative evaluation of GeoHand in in-the-wild single-hand object interaction scenarios. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the FreiHAND evaluation set. Samples in each triplet are the input image, the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GeoHand, a framework for monocular 3D hand reconstruction that extracts geometric priors from a frozen general-scene estimator (MoGe2), adapts them via a map-level GeoAdapter for hand-specific use, integrates them with RGB cues through gated cross-modal token fusion, and refines local articulation with a Keypoint-Queried Iterative Refiner (KQIR). The central claim is that this unified pipeline achieves state-of-the-art results on FreiHAND, DexYCB, and HO3Dv3, with particular gains under severe self-occlusions and hand-object interactions.

Significance. If the empirical claims hold after verification, the work would be moderately significant for computer vision: it offers a concrete demonstration of adapting foundational monocular geometry models to a specialized fine-grained task without sensor depth, potentially improving robustness in occluded and interactive settings. The approach is conceptually straightforward and could inspire similar adapter-based transfers in related reconstruction problems.

major comments (2)
  1. [Section 3.2] Section 3.2 (GeoAdapter): The claim that the map-level GeoAdapter recalibrates MoGe2 spatial features for detailed hand geometry without eroding occlusion-disambiguation cues is load-bearing for the global disambiguation step, yet the manuscript supplies no feature-map visualizations, correlation analysis between adapted depth maps and ground-truth hand surfaces, or ablation isolating adapter performance specifically on occlusion-heavy frames. Without such evidence the assumption that the adapted priors remain usable remains unverified.
  2. [Section 4] Section 4 (Experiments): The SOTA claims on FreiHAND, DexYCB, and HO3Dv3, especially on occluded subsets, are presented without error bars, full ablation tables isolating each component (GeoAdapter, gated fusion, KQIR), or implementation details such as training schedules and hyper-parameters. This makes it impossible to confirm that reported gains are robust and attributable to the proposed geometry adaptation rather than other factors.
minor comments (2)
  1. [Section 3.3] Notation for the gated fusion module is introduced without an explicit equation; adding a compact formulation would improve clarity.
  2. [Figure 5] Figure captions for qualitative results should explicitly label the occlusion and interaction severity levels shown in each row.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments and suggestions. We provide detailed responses to each major comment and have made revisions to the manuscript to address the raised concerns.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (GeoAdapter): The claim that the map-level GeoAdapter recalibrates MoGe2 spatial features for detailed hand geometry without eroding occlusion-disambiguation cues is load-bearing for the global disambiguation step, yet the manuscript supplies no feature-map visualizations, correlation analysis between adapted depth maps and ground-truth hand surfaces, or ablation isolating adapter performance specifically on occlusion-heavy frames. Without such evidence the assumption that the adapted priors remain usable remains unverified.

    Authors: We appreciate the referee's observation regarding the need for more direct evidence supporting the GeoAdapter. Although the overall performance improvements on occlusion-heavy scenarios in our experiments suggest the effectiveness of the adapted priors, we acknowledge that explicit visualizations and targeted ablations would provide stronger verification. In the revised manuscript, we have added feature-map visualizations of the original and adapted depth maps, a correlation analysis with ground-truth hand surfaces, and an ablation isolating the GeoAdapter's performance on occlusion-heavy frames. These additions confirm that the occlusion-disambiguation cues are preserved and enhanced for hand-specific reconstruction. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): The SOTA claims on FreiHAND, DexYCB, and HO3Dv3, especially on occluded subsets, are presented without error bars, full ablation tables isolating each component (GeoAdapter, gated fusion, KQIR), or implementation details such as training schedules and hyper-parameters. This makes it impossible to confirm that reported gains are robust and attributable to the proposed geometry adaptation rather than other factors.

    Authors: We agree that including error bars, comprehensive ablations, and implementation details is essential for demonstrating the robustness of our results. We have updated the experimental section to include error bars for all quantitative results, expanded the ablation tables to isolate the contributions of the GeoAdapter, gated fusion, and KQIR individually, and provided full details on training schedules, hyper-parameters, and other implementation aspects in the revised manuscript and supplementary material. This allows readers to better assess the attribution of the performance gains to our proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a pipeline that takes outputs from an external frozen model (MoGe2) as input, applies a map-level GeoAdapter for recalibration, uses gated fusion to integrate with RGB features, and adds a KQIR module for local refinement. No equations, self-citations, or fitted parameters are shown in the provided text that reduce the claimed SOTA performance or geometric disambiguation to a quantity defined by the target metrics or by the authors' own prior results. The central claims rest on the independent utility of the external priors plus standard benchmark evaluation, making the derivation self-contained against external references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that general-scene geometry features are transferable to hands after lightweight adaptation. No free parameters are explicitly fitted in the abstract description. Two new modules are introduced as architectural inventions.

axioms (1)
  • domain assumption Geometric priors from a frozen general-scene monocular estimator (MoGe2) contain transferable spatial information that can be recalibrated for detailed hand reconstruction.
    This premise is invoked to justify the entire GeoAdapter and fusion strategy.
invented entities (2)
  • map-level GeoAdapter no independent evidence
    purpose: Recalibrate spatial features from MoGe2 specifically for hand geometry.
    New component introduced to bridge general-scene priors to the hand domain.
  • Keypoint-Queried Iterative Refiner (KQIR) no independent evidence
    purpose: Use projected joint locations to query and correct local geometry-aware features.
    New iterative module for precise local articulation.

pith-pipeline@v0.9.0 · 5753 in / 1609 out tokens · 52461 ms · 2026-05-20T13:44:08.265026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

  1. [1]

    Frei- hand: A dataset for markerless capture of hand pose and shape from single rgb images

    Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. Frei- hand: A dataset for markerless capture of hand pose and shape from single rgb images. InProceedings of the IEEE/CVF international conference on computer vision, pages 813–822, 2019. 10 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

  2. [2]

    Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image

    Xingyu Chen, Yufeng Liu, Yajiao Dong, Xiong Zhang, Chongyang Ma, Yanmin Xiong, Yuan Zhang, and Xi- aoyan Guo. Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20544–20554, 2022

  3. [3]

    Reconstructing hands in 3d with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3d with transformers. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 9826–9836, 2024

  4. [4]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019

  5. [5]

    Em- bodied hands: Modeling and capturing hands and bodies to- gether

    Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together.arXiv preprint arXiv:2201.02610, 2022

  6. [6]

    Keep it smpl: Automatic estimation of 3d human pose and shape from a single image

    Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. InEuropean conference on computer vision, pages 561–578. Springer, 2016

  7. [7]

    Learning to reconstruct 3d hu- man pose and shape via model-fitting in the loop

    Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d hu- man pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2252–2261, 2019

  8. [8]

    Hands Deep in Deep Learning for Hand Pose Estimation

    Markus Oberweger, Paul Wohlhart, and Vincent Lep- etit. Hands deep in deep learning for hand pose estima- tion.arXiv preprint arXiv:1502.06807, 2015

  9. [9]

    3d hand shape and pose from images in the wild

    Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr. 3d hand shape and pose from images in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10843– 10852, 2019

  10. [10]

    End-to-end hand mesh recovery from a monocular rgb image

    Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. End-to-end hand mesh recovery from a monocular rgb image. InProceedings of the IEEE/CVF international conference on computer vision, pages 2354–2364, 2019

  11. [11]

    Awr: Adaptive weighting regression for 3d hand pose estimation

    Weiting Huang, Pengfei Ren, Jingyu Wang, Qi Qi, and Haifeng Sun. Awr: Adaptive weighting regression for 3d hand pose estimation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11061–11068, 2020

  12. [12]

    Ganerated hands for real-time 3d hand tracking from monocular rgb

    Franziska Mueller, Florian Bernard, Oleksandr Sotny- chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 49–59, 2018

  13. [13]

    Weakly- supervised mesh-convolutional hand reconstruction in the wild

    Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos, Michael M Bronstein, and Stefanos Zafeiriou. Weakly- supervised mesh-convolutional hand reconstruction in the wild. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 4990–5000, 2020

  14. [14]

    Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering

    Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1067–1076, 2019

  15. [15]

    3d hand shape and pose estimation from a single rgb image

    Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingy- ing Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10833–10842, 2019

  16. [16]

    Convolutional mesh regression for single-image human shape reconstruction

    Nikos Kolotouros, Georgios Pavlakos, and Kostas Dani- ilidis. Convolutional mesh regression for single-image human shape reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4501–4510, 2019

  17. [17]

    Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration

    Xingyu Chen, Yufeng Liu, Chongyang Ma, Jianlong Chang, Huayan Wang, Tian Chen, Xiaoyan Guo, Pengfei Wan, and Wen Zheng. Camera-space hand mesh recovery via semantic aggregation and adaptive 2d-1d registration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13274–13283, 2021

  18. [18]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 11 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction 16x16 words: Transformers for image recognition at scale.arXiv preprint ar...

  19. [19]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  20. [20]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  21. [21]

    Wilor: End-to- end 3d hand localization and reconstruction in-the-wild

    Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng, and Stefanos Zafeiriou. Wilor: End-to- end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025

  22. [22]

    End-to-end human pose and mesh reconstruction with transformers

    Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 1954–1963, 2021

  23. [23]

    Mesh graphormer

    Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 12939– 12948, 2021

  24. [24]

    Cross-attention of disentangled modalities for 3d hu- man mesh recovery with transformers

    Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3d hu- man mesh recovery with transformers. InEuro- pean Conference on Computer Vision, pages 342–359. Springer, 2022

  25. [25]

    A simple baseline for efficient hand mesh reconstruction

    Zhishan Zhou, Shihao Zhou, Zhi Lv, Minqiang Zou, Yao Tang, and Jiajun Liang. A simple baseline for efficient hand mesh reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1367–1376, 2024

  26. [26]

    V2v-posenet: V oxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map

    Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. V2v-posenet: V oxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. InProceedings of the IEEE con- ference on computer vision and pattern Recognition, pages 5079–5088, 2018

  27. [27]

    Dexycb: A benchmark for capturing hand grasping of objects

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 9044–9053, 2021

  28. [28]

    Honnotate: A method for 3d annota- tion of hand and object poses

    Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Honnotate: A method for 3d annota- tion of hand and object poses. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196–3206, 2020

  29. [29]

    Grab: A dataset of whole-body human grasping of objects

    Omid Taheri, Nima Ghorbani, Michael J Black, and Dimitrios Tzionas. Grab: A dataset of whole-body human grasping of objects. InEuropean conference on computer vision, pages 581–600. Springer, 2020

  30. [30]

    Learning joint reconstruction of hands and manipulated objects

    Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kale- vatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11807–11816, 2019

  31. [31]

    Contactpose: A dataset of grasps with object contact and hand pose

    Samarth Brahmbhatt, Chengcheng Tang, Christopher D Twigg, Charles C Kemp, and James Hays. Contactpose: A dataset of grasps with object contact and hand pose. InEuropean Conference on Computer Vision, pages 361–378. Springer, 2020

  32. [32]

    Keypoint fusion for rgb-d based 3d hand pose esti- mation

    Xingyu Liu, Pengfei Ren, Yuanyuan Gao, Jingyu Wang, Haifeng Sun, Qi Qi, Zirui Zhuang, and Jianxin Liao. Keypoint fusion for rgb-d based 3d hand pose esti- mation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 3756–3764, 2024

  33. [33]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017

  34. [34]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

  35. [35]

    Depth anything: 12 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: 12 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024

  36. [36]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocu- lar geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025

  37. [37]

    Learning to estimate 3d hand pose from single rgb images

    Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE international conference on computer vision, pages 4903–4911, 2017

  38. [38]

    Hand pose estimation via latent 2.5 d heatmap regression

    Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5 d heatmap regression. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 118–134, 2018

  39. [39]

    Weakly-supervised 3d hand pose estimation from monocular rgb images

    Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. InProceedings of the Eu- ropean conference on computer vision (ECCV), pages 666–682, 2018

  40. [40]

    Hand pointnet: 3d hand pose estimation using point sets

    Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan. Hand pointnet: 3d hand pose estimation using point sets. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 8417–8426, 2018

  41. [41]

    End-to-end recovery of human shape and pose

    Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 7122–7131, 2018

  42. [42]

    Frankmocap: A monocular 3d whole-body pose es- timation system via regression and integration

    Yu Rong, Takaaki Shiratori, and Hanbyul Joo. Frankmocap: A monocular 3d whole-body pose es- timation system via regression and integration. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1759, 2021

  43. [43]

    Collaborative learning for hand and object reconstruction with attention-guided graph convolution

    Tze Ho Elden Tse, Kwang In Kim, Ales Leonardis, and Hyung Jin Chang. Collaborative learning for hand and object reconstruction with attention-guided graph convolution. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 1664–1674, 2022

  44. [44]

    Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose

    Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. InEuropean Conference on Computer Vision, pages 769–787. Springer, 2020

  45. [45]

    Disentangling latent hands for image synthesis and pose estimation

    Linlin Yang and Angela Yao. Disentangling latent hands for image synthesis and pose estimation. InPro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 9877–9886, 2019

  46. [46]

    Densepose: Dense human pose estimation in the wild

    Rıza Alp G¨uler, Natalia Neverova, and Iasonas Kokki- nos. Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 7297–7306, 2018

  47. [47]

    I2l-meshnet: Image-to-lixel prediction network for accurate 3d hu- man pose and mesh estimation from a single rgb image

    Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d hu- man pose and mesh estimation from a single rgb image. InEuropean Conference on Computer Vision, pages 752–768. Springer, 2020

  48. [48]

    Probabilistic modeling for human mesh recovery

    Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayara- man, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 11605–11614, 2021

  49. [49]

    Handoccnet: Occlusion-robust 3d hand mesh estimation network

    JoonKyu Park, Yeonguk Oh, Gyeongsik Moon, Hong- suk Choi, and Kyoung Mu Lee. Handoccnet: Occlusion-robust 3d hand mesh estimation network. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 1496–1505, 2022

  50. [50]

    H2onet: Hand-occlusion-and-orientation-aware network for real-time 3d hand mesh reconstruction

    Hao Xu, Tianyu Wang, Xiao Tang, and Chi-Wing Fu. H2onet: Hand-occlusion-and-orientation-aware network for real-time 3d hand mesh reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17048–17058, 2023

  51. [51]

    First-person hand action benchmark with rgb-d videos and 3d hand pose an- notations

    Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-person hand action benchmark with rgb-d videos and 3d hand pose an- notations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 409– 419, 2018

  52. [52]

    Interhand2

    Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shira- tori, and Kyoung Mu Lee. Interhand2. 6m: A dataset 13 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction and baseline for 3d interacting hand pose estimation from a single rgb image. InEuropean Conference on Computer Vision, pages 548–564. Springer, 2020

  53. [53]

    H2o: Two hands manipulating objects for first person interaction recognition

    Taein Kwon, Bugra Tekin, Jan St¨uhmer, Federica Bogo, and Marc Pollefeys. H2o: Two hands manipulating objects for first person interaction recognition. InPro- ceedings of the IEEE/CVF international conference on computer vision, pages 10138–10148, 2021

  54. [54]

    Oakink: A large-scale knowledge repository for understanding hand-object interaction

    Lixin Yang, Kailin Li, Xinyu Zhan, Fei Wu, Anran Xu, Liu Liu, and Cewu Lu. Oakink: A large-scale knowledge repository for understanding hand-object interaction. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 20953–20962, 2022

  55. [55]

    Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation,

    Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgrasp- net: A large-scale robotic dexterous grasp dataset for general objects based on simulation.arXiv preprint arXiv:2210.02697, 2022

  56. [56]

    Contactopt: Optimizing contact to improve grasps

    Patrick Grady, Chengcheng Tang, Christopher D Twigg, Minh V o, Samarth Brahmbhatt, and Charles C Kemp. Contactopt: Optimizing contact to improve grasps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1471–1481, 2021

  57. [57]

    Grasping field: Learn- ing implicit representations for human grasps

    Korrawe Karunratanakul, Adrian Spurr, Zicong Muigai, Otmar Hilliges, and Siyu Tang. Grasping field: Learn- ing implicit representations for human grasps. InInter- national Conference on 3D Vision (3DV), 2020

  58. [58]

    Interacting at- tention graph for single image two-hand reconstruction

    Mengcheng Li, Liang An, Hongwen Zhang, Lianpeng Wu, Feng Chen, Tao Yu, and Yebin Liu. Interacting at- tention graph for single image two-hand reconstruction. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 2761–2770, 2022

  59. [59]

    Complementing event streams and rgb frames for hand mesh recon- struction

    Jianping Jiang, Xinyu Zhou, Bingxuan Wang, Xiaom- ing Deng, Chao Xu, and Boxin Shi. Complementing event streams and rgb frames for hand mesh recon- struction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24944–24954, 2024

  60. [60]

    Cross-modal deep variational hand pose es- timation

    Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. Cross-modal deep variational hand pose es- timation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 89–98, 2018

  61. [61]

    Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  62. [62]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502, 2024

  63. [63]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training super- vision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training super- vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5261–5271, 2025

  64. [64]

    Contrastive representation learning for hand shape esti- mation

    Christian Zimmermann, Max Argus, and Thomas Brox. Contrastive representation learning for hand shape esti- mation. InProceedings of the DAGM German Confer- ence on Pattern Recognition, pages 250–264, 2021

  65. [65]

    Bihand: Recovering hand mesh with multi- stage bisected hourglass networks.arXiv preprint arXiv:2008.05079, 2020

    Lixin Yang, Jiasen Li, Wenqiang Xu, Yiqun Diao, and Cewu Lu. Bihand: Recovering hand mesh with multi- stage bisected hourglass networks.arXiv preprint arXiv:2008.05079, 2020

  66. [66]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  67. [67]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  68. [68]

    Hamba: Single-view 3d hand reconstruction with graph-guided bi-scanning mamba.arXiv preprint arXiv:2407.09646, 2024

    Haoye Dong, Aviral Chharia, Wenbo Gou, Fran- cisco Vicente Carrasco, and Fernando De la Torre. Hamba: Single-view 3d hand reconstruction with graph-guided bi-scanning mamba.arXiv preprint arXiv:2407.09646, 2024. 14 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

  69. [69]

    Handos: 3d hand reconstruction in one stage

    Xingyu Chen, Zhuheng Song, Xiaoke Jiang, Yaoqing Hu, Junzhi Yu, and Lei Zhang. Handos: 3d hand reconstruction in one stage. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17304–17314, 2025

  70. [70]

    Maskhand: Generative masked modeling for robust hand mesh reconstruc- tion in the wild

    Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Mayur Jagdishbhai Patel, Hongfei Xue, Ahmed Helmy, Srijan Das, and Pu Wang. Maskhand: Generative masked modeling for robust hand mesh reconstruc- tion in the wild. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8372–8383, 2025

  71. [71]

    Spmhand: Segmentation-guided progressive multi-path 3d hand pose and shape estimation.IEEE Transactions on Mul- timedia, 26:6822–6833, 2024

    Haofan Lu, Shuiping Gou, and Ruimin Li. Spmhand: Segmentation-guided progressive multi-path 3d hand pose and shape estimation.IEEE Transactions on Mul- timedia, 26:6822–6833, 2024. 15 GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction Appendix A.1 Detailed Training Objectives GeoHand is optimized using a physically motivated ...

  72. [72]

    Exploration of Cross-Modal Extraction Limits.Al- though the simple token concatenation and map-level GeoAdapter mechanism have proven highly effective against domain gaps without complex volume-rendering overheads, explicitly exploring more sophisticated mul- timodal Transformer frameworks with cross-attention between appearance tokens and coordinate-boun...

  73. [73]

    End-to-End Synergy

    Bounding-Box Reliance vs. End-to-End Synergy. GeoHand functions under a top-down paradigm depen- dent on an independent 2D hand detector’s bounding-box cropping stage. Consequently, final 3D reconstruction quality remains partially bounded by this preprocessing stage. Tightly unifying high-quality object detection within the same ViT-based pipeline would ...

  74. [74]

    Temporal & Multi-View Contextual Recovery for Oc- clusions.While geometry priors substantially aid oc- cluded interactions, recovering radically invisible topo- logical boundaries remains inherently mathematically impossible from a static single viewpoint alone. Utiliz- ing sequential historical motion context (temporal kine- matic pipelines) or complemen...