AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Bo Ai; Chen Si; Chuanxia Zheng; Hao Su; Jianwen Xie; Rolandos Alexandros Potamias; Yulin Liu

REVIEW 2 major objections 5 minor 2 cited by

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

T0 review · 2 major / 5 minor · reviewed 2026-07-13 · grok-4.5

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

Pith's one-line read A large synthetic RGB-D hand dataset improves 3D pose models without changing their architecture.

desk verdict Solid data paper: multi-million RGB-D hands with arm context and GraspXL interactions lift fixed HaMeR/WiLoR recipes; pose-prior overlap is real but does not erase the gains. read the letter →

arxiv 2603.25726 v3 pith:ZIDSSHZG submitted 2026-03-26 cs.CV

Chen Si , Yulin Liu , Bo Ai , Jianwen Xie , Rolandos Alexandros Potamias , Chuanxia Zheng , Hao Su This is my paper

classification cs.CV

keywords 3DhandposeestimationsyntheticdataRGB-Dhand-objectinteractionsim-to-realMANOdepthfusion

verification ladder T0 review T1 audit T2 compute T3 formal

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The reading

This paper argues that 3D hand pose estimation from images is limited less by model design than by how much diverse, clean training data we can give those models. Real hand datasets cover only narrow poses, views, and subjects, and earlier synthetic sets rarely combine arm context, object occlusions, and aligned depth at scale. AnyHand supplies millions of rendered single-hand and hand-object RGB-D images with perfect 3D labels, varied skin and sleeve appearance, and heavy occlusions. Adding this data to the original training mixes of strong RGB baselines measurably lowers error on standard benchmarks and helps on out-of-domain images, without changing the networks. The same data also trains a simple depth-fusion add-on that beats prior RGB-D methods, showing that better data and depth together matter as much as new architectures.

What carries the argument

AnyHand: a simulation pipeline that samples realistic MANO shapes and diffusion-prior poses, high-fidelity hand and forearm textures, HDR/indoor backgrounds, and GraspXL hand-object contacts, then renders paired RGB and aligned depth with arm context and guaranteed 3D labels for co-training.

What would settle it

Retrain the same fixed baselines on equal-scale real data or on ablated AnyHand variants that remove arm texture, interaction occlusions, or depth, and check whether FreiHAND, HO-3D, and out-of-domain HO-Cap errors still fall by the reported margins.

Watch

Extended reading notes

Core claim

Co-training existing RGB hand-pose models on AnyHand—2.5M single-hand and 4.1M hand-object synthetic RGB-D images with rich geometric annotations—improves accuracy on FreiHAND and HO-3D while architectures and training schemes stay fixed, and a lightweight depth-fusion module trained the same way outperforms prior RGB-D methods.

Load-bearing premise

The synthetic images are close enough to real photos that the measured gains come from true diversity and clean labels, not leftover mismatches in lighting, background depth, or pose distribution.

Share X Bluesky LinkedIn Reddit HN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, and a circularity audit.

Desk Editor's Note

read the letter

The useful takeaway is simple: they ship a large synthetic RGB-D hand corpus (2.5M single + 4.1M interact) with arm context, HDR/dynamic lighting, and GraspXL-derived occlusions, then show that co-training the official HaMeR and WiLoR recipes with it improves FreiHAND and HO-3D without architecture changes. A lightweight dual-token cross-attention depth fusion on top of WiLoR also beats prior RGB-D numbers on HO-3D.

What is actually new is the combination at scale, not any single ingredient. Prior synthetic sets miss the joint package of aligned depth, realistic arms, and large interaction occlusions. They isolate the data variable cleanly, keep hyper-parameters fixed, and give scaling curves plus ablations (Single vs Interact, arm texture, diffusion vs interpolated poses, cross-attention). Out-of-domain HO-Cap gains without fine-tuning are the more interesting signal. The generation pipeline is released, which matters for reuse.

The stress-test concern about DPoser-Hand (trained on FreiHAND/HO-3D/etc.) and shapes drawn from the same real distributions is fair and should be stated more bluntly in the paper. Some of the in-domain PA-MPJPE drop (HaMeR 6.0 o5.545 mm) can be densification of already-seen manifolds plus perfect labels rather than pure novel diversity. The interpolated-pose ablation and HO-Cap results mitigate this but do not erase it. Background depth is MoGe-2 fused with rendered foreground, so the RGB-D supervision is approximate; the fact that estimated depth can beat quantized HO-3D sensor depth is honest but also a warning about domain match. None of this is load-bearing failure; the fixed-architecture gains and ablations still stand.

Math is standard supervised mesh recovery; citations cover the right synthetic and foundation baselines. This is for people who train hand models or need RGB-D supervision at scale. It deserves a serious referee. I would engage with the data and the co-training recipe.

Referee Report

2 major / 5 minor

Summary. The paper introduces AnyHand, a large-scale synthetic RGB-D hand dataset (2.5M single-hand + 4.1M hand-object images) with aligned depth, arm context, and rich geometric annotations, generated via SAPIEN with MANO shapes, DPoser-Hand poses, Handy/SMPLitex textures, GraspXL interactions, and HDR/indoor backgrounds. Co-training HaMeR and WiLoR on their original real corpora plus AnyHand, with architectures and training schemes held fixed, yields consistent gains on FreiHAND and HO-3D (e.g., HaMeR PA-MPJPE 6.0 o5.545 mm on FreiHAND) and stronger out-of-domain results on HO-Cap without fine-tuning. A lightweight RGB-D depth-fusion module (AnyHandNet-D) co-trained with AnyHand further surpasses prior RGB-D methods on HO-3D. Ablations address scale, Single vs Interact branches, arm texture, and diffusion vs interpolated poses.

Significance. If the gains hold under fixed architectures, the work provides strong evidence that data quality, diversity, and modality coverage remain first-order levers for 3D hand pose estimation, comparable to or larger than recent architectural increments. The released generation pipeline, scale (millions of RGB-D frames with arm and object occlusion), and modular depth-fusion design are concrete, reusable contributions. Strengths include controlled co-training that isolates the data effect, multi-benchmark evaluation (in-domain FreiHAND/HO-3D, out-of-domain HO-Cap), scaling curves, and component ablations. The finding that estimated depth can outperform quantized sensor depth is practically useful and well documented.

major comments (2)

Sec. 3.1 and Tab. 5: The central claim that AnyHand’s diversity (not merely densification of already-seen manifolds plus perfect labels) drives the fixed-architecture gains is only partially supported. Poses are sampled from DPoser-Hand trained on FreiHAND, HO-3D, DexYCB, H2O, and Re:InterHand; shapes are drawn from FreiHAND/InterHand2.6M empirical distributions. The “w/ interp. pose” ablation and HO-Cap gains mitigate but do not fully isolate novel coverage from densification of the same support. A clearer quantification of pose/shape novelty relative to the real co-training corpora (e.g., coverage statistics or a held-out pose prior) would strengthen the diversity claim.
Sec. 5 and Tab. 6: The RGB-D evaluation is limited to HO-3D v2, and the surprising result that MoGe-2 estimated depth outperforms sensor GT is attributed to quantization/missing values and distribution match with synthetic backgrounds. This is interesting but leaves open whether AnyHandNet-D generalizes under true multi-sensor depth noise or on other RGB-D benchmarks. At minimum, report variance across seeds or an additional real RGB-D set if available; otherwise qualify the claim that the module “surpasses prior RGB-D methods” more carefully.

minor comments (5)

Abstract vs main text: Abstract says depth utility is examined “in the appendix,” while Sec. 5 presents the full RGB-D architecture and HO-3D results. Align the abstract wording with the body.
Tab. 1 and Sec. 3.2: Image counts are stated as 2.5M/4.1M in the abstract and 2.1M/4.2M in Sec. 3.2; reconcile the Single/Interact image totals.
Fig. 5 and Appendix B.3: Scaling and mix-ratio experiments are useful; briefly state whether real-data sampling is with replacement or re-weighted when the synthetic budget changes, so the co-training recipe is fully reproducible.
Sec. 3.1 (Aligned Depth Maps): The admission that fused background depth is approximate (MoGe-2 + intrinsic mismatch) is honest; a short note on whether models are trained with a foreground mask loss or full-image depth would help readers reuse the data.
Typos/clarity: “de factoarchitecture,” “foundationaltraining,” and occasional missing spaces after commas appear in the introduction; a light copy-edit pass would improve polish.

Circularity Check

0 steps flagged · score 1.0 of 10

No derivation-by-construction circularity; only mild distributional dependence via pose/shape priors trained on overlapping real corpora.

full rationale

This is an empirical dataset + co-training paper, not a first-principles derivation. The central claims (PA-MPJPE/PA-MPVPE reductions on FreiHAND and HO-3D when HaMeR/WiLoR are co-trained with AnyHand under fixed architectures and protocols; RGB-D fusion gains on HO-3D) are measured by independent evaluation on external real benchmarks whose test splits are not used to construct the synthetic labels. Perfect synthetic GT, arm textures, backgrounds, and GraspXL interactions supply genuine additional supervision. The only mild concern is that hand poses are sampled on-the-fly from DPoser-Hand (trained on FreiHAND/HO-3D/DexYCB/etc.) and shapes from the empirical distributions of FreiHAND/InterHand2.6M; this densifies already-seen manifolds rather than inventing wholly novel pose support. That is a validity caveat for the 'diversity' narrative, not a circular reduction of the reported metrics (which remain Procrustes-aligned joint/vertex errors on real images). Ablations (interp. pose, Single vs Interact, scale curves, HO-Cap OOD) and the fact that estimated MoGe depth can outperform quantized sensor depth further show the results are not forced by construction. No self-definitional equations, fitted-parameter-as-prediction, load-bearing uniqueness theorems, or ansatz smuggling appear. Score 1 reflects only the acknowledged prior overlap; empty of true circular steps.

Assumptions & free parameters 4 free parameters · 4 assumptions · 2 invented entities

The central empirical claim rests on standard MANO/SMPL parametric models, public real datasets for shape/pose priors and evaluation, off-the-shelf texture and monocular depth estimators, and the modeling choice that synthetic RGB-D co-training transfers. Free parameters are mostly generation design knobs (camera ranges, light counts, texture counts) rather than fitted constants that define the reported metrics. No new physical entities are postulated.

free parameters (4)

camera FOV range = 30–40 deg
Uniformly sampled 30°–40°; design choice that affects viewpoint distribution and therefore co-training diversity.
hand–camera distance mixture = means 0.6/0.7/1.0 m, std 0.1 m
Three Gaussians (means 0.6/0.7/1.0 m, std 0.1 m) chosen to mimic real capture statistics; controls scale distribution in images.
number of scene lights = 1–5
Random 1–5 lights with randomized type/color/shadow; ad-hoc illumination diversity knob.
synthetic co-training budget / mix ratio = up to 6.6M; mixes 0–100%
Up to 6.6M AnyHand samples added to original real corpora; mixing-ratio ablation under fixed 2.7M budget shows sensitivity but no unique optimum.

assumptions (4)

domain assumption MANO parametric hand model plus attached SMPL-family forearm adequately represents real hand geometry and articulation for supervised training.
Used throughout generation and as the regression target of HaMeR/WiLoR (Sec. 3, Sec. 4).
domain assumption Synthetic images rendered in SAPIEN with Handy/SMPLitex textures, HDR/indoor backgrounds, and MoGe-2 background depths are close enough to real photographs that co-training improves real-benchmark metrics.
Core sim-to-real transfer premise of the entire evaluation (Sec. 3.1, Sec. 4–5).
domain assumption DPoser-Hand diffusion prior trained on real hand datasets yields more useful pose diversity than interpolation of real poses.
Ablation in Tab. 5 shows performance drop when replaced by interpolated poses.
standard math Procrustes-aligned and scale-translation-aligned MPJPE/MPVPE and F-scores are valid proxies for hand pose quality.
Standard metrics inherited from FreiHAND/HO-3D protocols (Sec. 4.1).

invented entities (2)

AnyHand dataset (Single + Interact branches) independent evidence
purpose: Provide large-scale RGB-D supervision with arm context, occlusions, and perfect geometric labels for co-training.
New corpus constructed by the authors; independent evidence is the public generation pipeline and reported transfer gains on external benchmarks.
AnyHandNet-D depth fusion module
purpose: Lightweight bidirectional cross-attention between RGB and depth tokens so existing ViT hand models can consume RGB-D.
Architectural add-on introduced in Sec. 5 / Fig. 6; evidence is HO-3D numbers vs prior RGB-D methods, but the module itself is not independently validated outside this paper.

how reviews work

0 comments

Cite this review

Pith. "Pith review of AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation." pith.science (2026). https://pith.science/paper/ZIDSSHZG

@misc{pith2026260325726,
  author       = {Pith},
  title        = {Pith review of: AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation},
  year         = {2026},
  howpublished = {\url{https://pith.science/paper/ZIDSSHZG}},
  note         = {Machine review of arXiv:2603.25726}
}

read the original abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation. While recent works with foundation approaches have shown that scaling training data markedly improves hand pose estimation, existing real-world datasets are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our proposed AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. We show that extending the original training data recipes of existing RGB baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architectures and training schemes fixed. Together with extensive ablations on the scale and composition of the training data setups, these results suggest that training data diversity and quality are as critical as scale for advancing hand pose estimation. We further examine the utility of AnyHand's aligned depth maps in the appendix, showing that scaling RGB-D supervision with AnyHand allows a lightweight depth-fusion variant of existing RGB baselines to outperform prior RGB-D methods.

Discussion (0). Sign in to comment.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ego2Robot: Scalable Robot Data Synthesis from Egocentric Human Data
cs.RO 2026-08 conditional novelty 6.0 of 10

Pretraining a VLA model on 18,561 hours of robot-synthesized egocentric human video mixed with robot data improves out-of-distribution manipulation success in simulation and on a real dual-arm robot.
Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
cs.CV 2026-06 unverdicted novelty 4.0 of 10

Multi-stage training that first mixes real and inpainted synthetic hand images then fine-tunes on real data improves mAP on glove-wearing test images over real-only baselines.

Reference graph

Works this paper leans on

77 extracted references · 7 linked inside Pith · cited by 2 Pith papers

[1]

Apple: Apple Vision Pro.https://www.apple.com/apple-vision-pro/(2024), accessed: 2026-03-03 1

2024
[2]

In: CVPR

Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In: CVPR. pp. 1067–1076 (2019) 4

2019
[3]

In: CVPR

Boukhayma, A., Bem, R.d., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR. pp. 10843–10852 (2019) 4

2019
[4]

In: BMVC (2023) 6, 7

Casas, D., Comino-Trinidad, M.: SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image. In: BMVC (2023) 6, 7

2023
[5]

In: CVPR

Chao,Y.W.,Yang,W.,Xiang, Y.,Molchanov,P.,Handa,A., Tremblay, J.,Narang, Y.S., Van Wyk, K., Iqbal, U., Birchfield, S., et al.: Dexycb: A benchmark for capturing hand grasping of objects. In: CVPR. pp. 9044–9053 (2021) 6, 7, 13

2021
[6]

In: ICCV

Chen, P., Chen, Y., Yang, D., Wu, F., Li, Q., Xia, Q., Tan, Y.: I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh mod- eling. In: ICCV. pp. 12929–12938 (2021) 26

2021
[7]

In: CVPR

Chen, X., Liu, Y., Dong, Y., Zhang, X., Ma, C., Xiong, Y., Zhang, Y., Guo, X.: Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. In: CVPR. pp. 20544–20554 (2022) 26

2022
[8]

In: CVPR

Cheng, W., Tang, H., Van Gool, L., Ko, J.H.: Handdiff: 3d hand pose estimation with diffusion on image-point cloud. In: CVPR. pp. 2274–2284 (2024) 4

2024

Show all 77 references

[9]

In: CoRL (2024) 1

Cheng, X., Li, J., Yang, S., Yang, G., Wang, X.: Open-television: Teleoperation with immersive active visual feedback. In: CoRL (2024) 1

2024
[10]

In: European Conference on Computer Vision

Choi, H., Moon, G., Lee, K.M.: Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: European Conference on Computer Vision. pp. 769–787. Springer (2020) 26

2020
[11]

NeurIPS36, 35799–35813 (2023) 2

Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. NeurIPS36, 35799–35813 (2023) 2

2023
[12]

In: CVPR

Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: CVPR. pp. 13142–13153 (2023) 2, 8, 9

2023
[13]

In: IROS

Ding, R., Qin, Y., Zhu, J., Jia, C., Yang, S., Yang, R., Qi, X., Wang, X.: Bunny- visionpro: Real-time bimanual dexterous teleoperation for imitation learning. In: IROS. pp. 12248–12255. IEEE (2025) 1 16 C. Si et al

2025
[14]

NeurIPS37, 2127–2160 (2024) 2, 4, 9, 26

Dong, H., Chharia, A., Gou, W., Vicente Carrasco, F., De la Torre, F.D.: Hamba:Single-view3dhandreconstructionwithgraph-guidedbi-scanningmamba. NeurIPS37, 2127–2160 (2024) 2, 4, 9, 26

2024
[15]

In: CVPR

Fu, R., Zhang, D., Jiang, A., Fu, W., Funk, A., Ritchie, D., Sridhar, S.: Gigahands: A massive annotated dataset of bimanual hand activities. In: CVPR. pp. 17461– 17474 (2025) 2

2025
[16]

In: COLM (2024) 4

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: COLM (2024) 4

2024
[17]

In: CVPR

Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3d annotation of hand and object poses. In: CVPR. pp. 3196–3206 (2020) 6, 7, 9, 10, 12, 13, 14, 22, 26

2020
[18]

In: CVPR

Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3d annotation of hand and object poses. In: CVPR. pp. 3196–3206 (2020) 26

2020
[19]

In: CVPR

Hampali, S., Sarkar, S.D., Rad, M., Lepetit, V.: Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In: CVPR. pp. 11090–11100 (2022) 26

2022
[20]

In: CVPR

Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: CVPR. pp. 11807–11816 (2019) 4, 5, 7

2019
[21]

In: CVPR

Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: CVPR. pp. 11807–11816 (2019) 26

2019
[22]

In: CVPR (2026) 3, 5

Jiang, Z., Zheng, C., Laina, I., Larlus, D., Vedaldi, A.: Mesh4d: 4d mesh recon- struction and tracking from monocular video. In: CVPR (2026) 3, 5

2026
[23]

In: CVPR

Jiang, Z., Rahmani, H., Black, S., Williams, B.M.: A probabilistic attention model with occlusion-aware texture regression for 3d hand reconstruction from a single rgb image. In: CVPR. pp. 758–767 (2023) 9, 26

2023
[24]

ACM Transactions on Graphics (ToG)36(4), 1–13 (2017) 10

Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG)36(4), 1–13 (2017) 10

2017
[25]

In: CVPR

Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly- supervised mesh-convolutional hand reconstruction in the wild. In: CVPR. pp. 4990–5000 (2020) 4

2020
[26]

arXiv preprint arXiv:1905.01326 (2019) 4

Kulon, D., Wang, H., Güler, R.A., Bronstein, M., Zafeiriou, S.: Single image 3d hand reconstruction with mesh convolutions. arXiv preprint arXiv:1905.01326 (2019) 4

1905 arXiv
[27]

In: ICCV

Kwon, T., Tekin, B., Stühmer, J., Bogo, F., Pollefeys, M.: H2o: Two hands manip- ulating objects for first person interaction recognition. In: ICCV. pp. 10138–10148 (2021) 6

2021
[28]

In: CVPR

Li, K., Li, P., Liu, T., Li, Y., Huang, S.: Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning. In: CVPR. pp. 6991–7003 (2025) 1

2025
[29]

In: ICCV

Li, L., Tian, L., Zhang, X., Wang, Q., Zhang, B., Bo, L., Liu, M., Chen, C.: Renderih: A large-scale synthetic dataset for 3d interacting hand pose estimation. In: ICCV. pp. 20395–20405 (2023) 4, 5, 7

2023
[30]

arXiv preprint arXiv:2305.13705 (2023) 14

Li, L., Zhuo, L., Zhang, B., Bo, L., Chen, C.: Diffhand: End-to-end hand mesh reconstruction via diffusion models. arXiv preprint arXiv:2305.13705 (2023) 14

2023 arXiv
[31]

In: ICCV (2025) 3, 5

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Dso: Aligning 3d generators with simulation feedback for physical soundness. In: ICCV (2025) 3, 5

2025
[32]

In: ICCV (2025) 3 AnyHand 17

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics. In: ICCV (2025) 3 AnyHand 17

2025
[33]

ACM Trans

Li, Y., Zhang, L., Qiu, Z., Jiang, Y., Li, N., Ma, Y., Zhang, Y., Xu, L., Yu, J.: Nimble: A non-rigid hand model with bones and muscles. ACM Trans. Graph. 41(4) (jul 2022).https://doi.org/10.1145/3528223.3530079,https://doi. org/10.1145/3528223.353007924

2022 doi
[34]

In: CVPR

Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR. pp. 1954–1963 (2021) 4, 26

1954
[35]

In: ICCV

Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: ICCV. pp. 12939–12948 (2021) 26

2021
[36]

In: CVPR

Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3d hand-object poses estimation with interactions in time. In: CVPR. pp. 14687–14697 (2021) 26

2021
[37]

In: Proceedings of the ACM Web Conference 2023

Liu, X., Ren, P., Chen, Y., Liu, C., Wang, J., Sun, H., Qi, Q., Wang, J.: Sa- fusion: multimodal fusion approach for web-based human-computer interaction in the wild. In: Proceedings of the ACM Web Conference 2023. pp. 3883–3891 (2023) 4

2023
[38]

In: AAAI

Liu, X., Ren, P., Gao, Y., Wang, J., Sun, H., Qi, Q., Zhuang, Z., Liao, J.: Keypoint fusion for rgb-d based 3d hand pose estimation. In: AAAI. vol. 38, pp. 3756–3764 (2024) 3, 4, 14

2024
[39]

ACM Transactions on Graphics, (Proc

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc. SIG- GRAPH Asia)34(6), 248:1–248:16 (Oct 2015) 7

2015
[40]

In: ICCV

Lu, J., Lin, J., Dou, H., Zeng, A., Deng, Y., Liu, X., Cai, Z., Yang, L., Zhang, Y., Wang, H., et al.: Dposer-x: Diffusion model as robust 3d whole-body human pose prior. In: ICCV. pp. 9988–9997 (2025) 6, 12, 20, 24

2025
[41]

arXiv preprint arXiv:2505.24853 (2025) 1

Mandi, Z., Hou, Y., Fox, D., Narang, Y., Mandlekar, A., Song, S.: Dexmachina: Functional retargeting for bimanual dexterous manipulation. arXiv preprint arXiv:2505.24853 (2025) 1

2025 arXiv
[42]

Meta: Meta Quest 3.https://www.meta.com/quest/quest-3/(2023), accessed: 2026-03-03 1

2023
[43]

In: European Con- ference on Computer Vision

Moon, G., Lee, K.M.: I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: European Con- ference on Computer Vision. pp. 752–768. Springer (2020) 26

2020
[44]

NeurIPS36, 17689–17701 (2023) 4, 5, 6, 7

Moon, G., Saito, S., Xu, W., Joshi, R., Buffalini, J., Bellan, H., Rosen, N., Richard- son, J., Mize, M., De Bree, P., et al.: A dataset of relighted 3d interacting hands. NeurIPS36, 17689–17701 (2023) 4, 5, 6, 7

2023
[45]

6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image

Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: ECCV. pp. 548–564. Springer (2020) 6

2020
[46]

In: CVPR

Oh, Y., Park, J., Kim, J., Moon, G., Lee, K.M.: Recovering 3d hand mesh sequence from a single blurry image: A new dataset and temporal unfolding. In: CVPR. pp. 554–563 (2023) 4

2023
[47]

arXiv preprint arXiv:2511.09484 (2025) 1

Pan, C., Wang, C., Qi, H., Liu, Z., Bharadhwaj, H., Sharma, A., Wu, T., Shi, G., Malik, J., Hogan, F.: Spider: Scalable physics-informed dexterous retargeting. arXiv preprint arXiv:2511.09484 (2025) 1

2025
[48]

In: CVPR

Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: Handoccnet: Occlusion-robust 3d hand mesh estimation network. In: CVPR. pp. 1496–1505 (2022) 4, 26

2022
[49]

In: CVPR (2019) 7

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: CVPR (2019) 7

2019
[50]

In: CVPR

Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: CVPR. pp. 9826–9836 (2024) 2, 3, 4, 9, 10, 11, 12, 24, 26 18 C. Si et al

2024
[51]

In: CVPR

Potamias, R.A., Ploumpis, S., Moschoglou, S., Triantafyllou, V., Zafeiriou, S.: Handy: Towards a high fidelity 3d hand shape and appearance model. In: CVPR. pp. 4670–4680 (June 2023) 6, 7, 24

2023
[52]

In: CVPR

Potamias, R.A., Zhang, J., Deng, J., Zafeiriou, S.: Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. In: CVPR. pp. 12242–12254 (2025) 2, 3, 4, 9, 10, 11, 13, 26

2025
[53]

In: CVPR

Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: CVPR. pp. 1106–1113 (2014) 3

2014
[54]

In: CVPR

Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR. pp. 413–420. IEEE (2009) 7

2009
[55]

In: AAAI

Ren, P., Chen, Y., Hao, J., Sun, H., Qi, Q., Wang, J., Liao, J.: Two heads are better than one: Image-point cloud network for depth-based 3d hand pose estimation. In: AAAI. vol. 37, pp. 2163–2171 (2023) 3, 4, 14

2023
[56]

ACM Transactions on Graphics, (Proc

Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia)36(6) (Nov 2017) 2, 4, 7

2017
[57]

In: IROS

Shi, L., Liu, Y., Zeng, L., Ai, B., Hong, Z., Su, H.: Learning adaptive dexterous grasping from single demonstrations. In: IROS. pp. 9456–9463. IEEE (2025) 1

2025
[58]

In: ECCV

Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3d hand pose estimation via biomechanical constraints. In: ECCV. pp. 211–228. Springer (2020) 4

2020
[59]

In: Computer graphics forum

Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., Pauly, M.: Robust articulated-icp for real-time hand tracking. In: Computer graphics forum. vol. 34, pp. 101–114. Wiley Online Library (2015) 3

2015
[60]

Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Malik, J.: Sam 3d: 3dfy anything in images (...

2025 arXiv
[61]

Team, T.H.: Hunyuan3d 2.0: Scaling diffusion models for high resolution textured 3d assets generation (2025) 3

2025
[62]

arXiv preprint arXiv:2406.06843 (2024) 11, 12, 22

Wang, J., Zhang, Q., Chao, Y.W., Wen, B., Guo, X., Xiang, Y.: Ho-cap: A cap- ture system and dataset for 3d reconstruction and pose tracking of hand-object interaction. arXiv preprint arXiv:2406.06843 (2024) 11, 12, 22

2024 arXiv
[63]

Wang,R., Xu,S.,Dong, Y.,Deng, Y.,Xiang, J.,Lv, Z.,Sun, G., Tong,X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details (2025), https://arxiv.org/abs/2507.025468, 14

2025 arXiv
[64]

In: CVPR

Wen, B., Yang, W., Kautz, J., Birchfield, S.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. In: CVPR. pp. 17868–17879 (2024) 3, 5

2024
[65]

In: ICCV (2025) 3, 5

Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images. In: ICCV (2025) 3, 5

2025
[66]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., et al.: Sapien: A simulated part-based interactive environment. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11097–11107 (2020) 7

2020
[67]

In: CVPR (June

Xiang, J., Lv, Z., Xu, S., Deng, Y., Wang, R., Zhang, B., Chen, D., Tong, X., Yang, J.: Structured 3d latents for scalable and versatile 3d generation. In: CVPR (June
[68]

In: CVPR

Xie, P., Xu, W., Tang, T., Yu, Z., Lu, C.: Ms-mano: Enabling hand pose tracking with biomechanical constraints. In: CVPR. pp. 2382–2392 (2024) 4

2024
[69]

NeurIPS35, 38571–38584 (2022) 4 AnyHand 19

Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. NeurIPS35, 38571–38584 (2022) 4 AnyHand 19

2022
[70]

In: CVPR

Yang, L., Li, K., Zhan, X., Lv, J., Xu, W., Li, J., Lu, C.: Artiboost: Boosting articulated 3d hand-object pose estimation via online exploration and synthesis. In: CVPR. pp. 2750–2760 (2022) 26

2022
[71]

arXiv preprint arXiv:2507.12440 (2025) 1

Yang, R., Yu, Q., Wu, Y., Yan, R., Li, B., Cheng, A.C., Zou, X., Fang, Y., Cheng, X., Qiu, R.Z., et al.: Egovla: Learning vision-language-action models from egocen- tric human videos. arXiv preprint arXiv:2507.12440 (2025) 1

2025 arXiv
[72]

In: ECCV (2024) 8, 24

Zhang, H., Christen, S., Fan, Z., Hilliges, O., Song, J.: GraspXL: Generating grasp- ing motions for diverse objects at scale. In: ECCV (2024) 8, 24

2024
[73]

ACM Transactions on Graphics (TOG)43(4), 1–20 (2024) 3, 5

Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG)43(4), 1–20 (2024) 3, 5

2024
[74]

In: ICCV

Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular rgb image. In: ICCV. pp. 2354–2364 (2019) 4

2019
[75]

In: CVPR

Zhao,Z.,Yang,L.,Sun,P.,Hui,P.,Yao,A.:Analyzingthesynthetic-to-realdomain gap in 3d hand pose estimation. In: CVPR. pp. 12255–12265 (2025) 5, 22

2025
[76]

In: ICCV

Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single rgb images. In: ICCV. pp. 4903–4911 (2017) 4, 5, 7

2017
[77]

In: ICCV

Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: ICCV. pp. 813–822 (2019) 6, 9, 10, 12, 26 20 C. Si et al. Appendix AAnyHandDataset Details A.1 Synthetic Data G...

2019

Pith tools

Reviewed July 13, 2026 · model on record in the stance chip above.

[1] [1]

Apple: Apple Vision Pro.https://www.apple.com/apple-vision-pro/(2024), accessed: 2026-03-03 1

2024

[2] [2]

In: CVPR

Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In: CVPR. pp. 1067–1076 (2019) 4

2019

[3] [3]

In: CVPR

Boukhayma, A., Bem, R.d., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR. pp. 10843–10852 (2019) 4

2019

[4] [4]

In: BMVC (2023) 6, 7

Casas, D., Comino-Trinidad, M.: SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image. In: BMVC (2023) 6, 7

2023

[5] [5]

In: CVPR

Chao,Y.W.,Yang,W.,Xiang, Y.,Molchanov,P.,Handa,A., Tremblay, J.,Narang, Y.S., Van Wyk, K., Iqbal, U., Birchfield, S., et al.: Dexycb: A benchmark for capturing hand grasping of objects. In: CVPR. pp. 9044–9053 (2021) 6, 7, 13

2021

[6] [6]

In: ICCV

Chen, P., Chen, Y., Yang, D., Wu, F., Li, Q., Xia, Q., Tan, Y.: I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh mod- eling. In: ICCV. pp. 12929–12938 (2021) 26

2021

[7] [7]

In: CVPR

Chen, X., Liu, Y., Dong, Y., Zhang, X., Ma, C., Xiong, Y., Zhang, Y., Guo, X.: Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image. In: CVPR. pp. 20544–20554 (2022) 26

2022

[8] [8]

In: CVPR

Cheng, W., Tang, H., Van Gool, L., Ko, J.H.: Handdiff: 3d hand pose estimation with diffusion on image-point cloud. In: CVPR. pp. 2274–2284 (2024) 4

2024