Dual Grid Net: hand mesh vertex regression from single depth maps
Pith reviewed 2026-05-24 16:46 UTC · model grok-4.3
The pith
A two-stage network recovers 3D hand mesh vertices from a single depth map by regressing coordinates on a mesh grid after estimating correspondences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that regressing hand mesh vertex coordinates from a single depth map is possible with a two-stage 2D CNN: the first stage estimates a dense correspondence field for every pixel on the depth map to the mesh grid; the second stage uses a differentiable operator to map features from the previous stage and regress a 3D coordinate map on the mesh grid; vertices are then sampled and fitted to an articulated template mesh in closed form, allowing single-pass prediction of vertices, transformations, and joints.
What carries the argument
The dual-grid network with an image-to-mesh correspondence stage followed by a differentiable feature-to-coordinate mapping operator on the mesh grid.
If this is right
- The method predicts all mesh vertices, joint transformation matrices, and joint coordinates in a single forward pass.
- Self-supervision is possible by minimizing data fitting and kinematic prior terms without human annotation.
- With multi-camera rig training to resolve self-occlusion, performance is competitive with strongly supervised methods.
- It recovers mesh vertices and a dense correspondence map alongside keypoint localization.
- State-of-the-art accuracy on NYU keypoint localization when supervised on sparse keypoints.
Where Pith is reading between the lines
- The correspondence-based mapping could be applied to other body parts or objects if a suitable mesh template is available.
- Self-supervision might reduce the need for 3D annotations in related 3D reconstruction tasks.
- The closed-form template fitting may constrain the method to hand shapes similar to the template used.
- Extending the approach to RGB images could broaden its applicability beyond depth sensors.
Load-bearing premise
The differentiable operator maps 2D image-grid features to the mesh grid without introducing large systematic errors in 3D vertex placement, and the closed-form fit to the articulated template remains accurate across real hand variations.
What would settle it
Depth images with accurate ground-truth 3D mesh vertex positions where the network's predicted vertices deviate significantly from ground truth after the template fit, especially on hand shapes or poses outside the training distribution.
Figures
read the original abstract
We present a method for recovering the dense 3D surface of the hand by regressing the vertex coordinates of a mesh model from a single depth map. To this end, we use a two-stage 2D fully convolutional network architecture. In the first stage, the network estimates a dense correspondence field for every pixel on the depth map or image grid to the mesh grid. In the second stage, we design a differentiable operator to map features learned from the previous stage and regress a 3D coordinate map on the mesh grid. Finally, we sample from the mesh grid to recover the mesh vertices, and fit it an articulated template mesh in closed form. During inference, the network can predict all the mesh vertices, transformation matrices for every joint and the joint coordinates in a single forward pass. When given supervision on the sparse key-point coordinates, our method achieves state-of-the-art accuracy on NYU dataset for key point localization while recovering mesh vertices and a dense correspondence map. Our framework can also be learned through self-supervision by minimizing a set of data fitting and kinematic prior terms. With multi-camera rig during training to resolve self-occlusion, it can perform competitively with strongly supervised methods Without any human annotation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dual Grid Net, a two-stage 2D fully convolutional architecture for regressing dense 3D hand mesh vertices from a single depth map. Stage 1 predicts a dense correspondence field mapping every image-grid pixel to the mesh grid. Stage 2 applies a differentiable operator to learned features to regress a 3D coordinate map on the mesh grid; vertices are then sampled and an articulated template is fit in closed form. The network outputs mesh vertices, per-joint transformation matrices, and joint coordinates in one forward pass. With keypoint supervision it claims SOTA accuracy on NYU keypoint localization while recovering the mesh and correspondence; it also supports self-supervision via data-fitting and kinematic priors and, with multi-camera training, competitive performance without annotations.
Significance. If the quantitative claims and implementation details hold, the work would offer a practical single-pass pipeline for dense hand surface recovery that supports both fully supervised SOTA keypoint performance and annotation-free self-supervised training. The closed-form template fit and explicit prediction of transformation matrices are potentially useful for downstream tracking and animation tasks.
major comments (3)
- [Abstract] Abstract: the SOTA keypoint accuracy claim on NYU and the competitive self-supervised performance are asserted without any tables, error metrics, baselines, error bars, or ablation studies in the manuscript. This directly undermines verification of the central empirical claims.
- [Abstract] Abstract (second-stage operator): no description, equation, or pseudocode is supplied for the differentiable operator that maps image-grid features onto the mesh-grid 3D coordinate map. Because this operator is the load-bearing step that converts 2D features into 3D vertex coordinates before the closed-form fit, its absence prevents assessment of whether systematic placement errors are introduced.
- [Abstract] Abstract (template fit): the closed-form fit of the sampled vertices to an articulated template is stated without specification of the template's degrees of freedom, the fitting objective, or any validation that the fit remains accurate across the range of hand shapes and articulations in the NYU test set. This is required for the mesh-recovery claim to hold.
minor comments (2)
- [Abstract] Abstract contains a capitalization error: 'Without any human annotation' should be lowercase.
- [Abstract] Abstract: 'fit it an articulated template mesh' appears to be missing the preposition 'to'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the clarity and verifiability of the work without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the SOTA keypoint accuracy claim on NYU and the competitive self-supervised performance are asserted without any tables, error metrics, baselines, error bars, or ablation studies in the manuscript. This directly undermines verification of the central empirical claims.
Authors: We agree that the abstract presents high-level claims and that the manuscript must make the supporting evidence immediately verifiable. The experiments section contains quantitative tables on NYU keypoint localization with baselines and metrics, plus self-supervision results; however, error bars and explicit cross-references from the abstract were not included. We will revise by adding error bars to all reported results, inserting a short results summary paragraph with pointers to the tables, and ensuring ablation studies are clearly labeled. This addresses the verification concern directly. revision: yes
-
Referee: [Abstract] Abstract (second-stage operator): no description, equation, or pseudocode is supplied for the differentiable operator that maps image-grid features onto the mesh-grid 3D coordinate map. Because this operator is the load-bearing step that converts 2D features into 3D vertex coordinates before the closed-form fit, its absence prevents assessment of whether systematic placement errors are introduced.
Authors: The comment is correct: the abstract mentions the operator but supplies no equation or pseudocode, and the method section description is insufficient for full assessment. We will add a dedicated subsection with the mathematical formulation of the differentiable mapping, a pseudocode listing of the feature transfer and coordinate regression steps, and a brief analysis of potential placement error sources. This will allow readers to evaluate the operator's properties. revision: yes
-
Referee: [Abstract] Abstract (template fit): the closed-form fit of the sampled vertices to an articulated template is stated without specification of the template's degrees of freedom, the fitting objective, or any validation that the fit remains accurate across the range of hand shapes and articulations in the NYU test set. This is required for the mesh-recovery claim to hold.
Authors: We accept that the current description is incomplete. The manuscript states the closed-form fit but does not detail the template's degrees of freedom, the exact objective minimized, or quantitative validation on NYU shape/articulation variation. We will expand the relevant section to specify the template parameterization, the fitting objective, and add a validation table or plot demonstrating fit accuracy on the NYU test set. This directly supports the mesh-recovery claim. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation consists of a two-stage FCN that first predicts a dense correspondence field from image grid to mesh grid, then applies a differentiable operator to regress 3D coordinates on the mesh grid before closed-form template fitting. Self-supervision minimizes external data-fitting and kinematic prior terms. None of these steps reduce a claimed prediction to an input quantity by definition or construction, nor do they rely on load-bearing self-citations or imported uniqueness theorems. The method is presented as a self-contained architecture whose accuracy claims rest on empirical evaluation rather than tautological re-use of fitted quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- loss weights for self-supervision terms
axioms (1)
- domain assumption An articulated template mesh can be fitted in closed form to the regressed vertices and will accurately represent observed hand shapes.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage 2D fully convolutional network... differentiable operator to map features... regress a 3D coordinate map on the mesh grid... fit an articulated template mesh in closed form
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
self-supervision by minimizing a set of data fitting and kinematic prior terms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://en.wikipedia.org/wiki/UV_ mapping
-
[2]
R. Alp Guler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, and I. Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017
work page 2017
- [3]
-
[4]
F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision , pages 561–578. Springer, 2016
work page 2016
-
[5]
I. Borg and P. Groenen. Modern Multidimensional Scaling. Springer Series in Statistics. Springer New York, 1997
work page 1997
-
[6]
A. Boukhayma, R. de Bem, and P. H. Torr. 3d hand shape and pose from images in the wild. In CVPR, 2019
work page 2019
-
[7]
Y . Cai, L. Ge, J. Cai, and J. Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. ECCV , Springer, 12, 2018
work page 2018
-
[8]
X. Chen, G. Wang, H. Guo, and C. Zhang. Pose guided structured region ensemble network for cascaded hand pose estimation. arXiv preprint arXiv:1708.03416, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
X. Chen, G. Wang, C. Zhang, T.-K. Kim, and X. Ji. Shpr- net: Deep semantic hand pose regression from point clouds. IEEE Access, 2018
work page 2018
-
[10]
M. Defferrard, X. Bresson, and P. Vandergheynst. Convolu- tional neural networks on graphs with fast localized spectral filtering. InAdvances in Neural Information Processing Sys- tems, 2016
work page 2016
- [11]
-
[12]
L. Ge, Y . Cai, J. Weng, and J. Yuan. Hand pointnet: 3d hand pose estimation using point sets. In CVPR, 2018
work page 2018
-
[13]
L. Ge, H. Liang, J. Yuan, and D. Thalmann. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition , vol- ume 1, page 5, 2017
work page 2017
-
[14]
L. Ge, Z. Ren, Y . Li, Z. Xue, Y . Wang, J. Cai, and J. Yuan. 3d hand shape and pose estimation from a single rgb image. In CVPR, 2019
work page 2019
-
[15]
L. Ge, Z. Ren, and J. Yuan. Point-to-point regression point- net for 3d hand pose estimation. ECCV, 2018
work page 2018
-
[16]
H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang. Region ensemble network: Improving convolutional net- work for hand pose estimation. In Image Processing (ICIP), 2017
work page 2017
-
[17]
H. Joo, T. Simon, and Y . Sheikh. Total capture: A 3d defor- mation model for tracking faces, hands, and bodies. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8320–8329, 2018
work page 2018
-
[18]
D. Joseph Tan, T. Cashman, J. Taylor, A. Fitzgibbon, D. Tar- low, S. Khamis, S. Izadi, and J. Shotton. Fits like a glove: Rapid and reliable hand shape personalization. In CvPR, 2016
work page 2016
-
[19]
A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End- to-end recovery of human shape and pose. In Computer Vi- sion and Pattern Regognition (CVPR), 2018
work page 2018
-
[20]
I. Kostrikov, Z. Jiang, D. Panozzo, D. Zorin, and B. Joan. Surface networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, 2018
work page 2018
-
[21]
S. Lombardi, J. Saragih, T. Simon, and Y . Sheikh. Deep appearance models for face rendering. ACM Transactions on Graphics (TOG), 2018
work page 2018
- [22]
-
[23]
G. Moon, J. Y . Chang, and K. M. Lee. V2v-posenet: V oxel- to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In CVPR, 2018
work page 2018
- [24]
-
[25]
M. Oberweger and V . Lepetit. Deepprior++: Improving fast and accurate 3d hand pose estimation. In ICCV workshop, 2017
work page 2017
-
[26]
M. Oberweger, G. Riegler, P. Wohlhart, and V . Lepetit. Ef- ficiently creating 3d training data for fine hand pose estima- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4957–4965, 2016
work page 2016
-
[27]
M. Oberweger, P. Wohlhart, and V . Lepetit. Training a feed- back loop for hand pose estimation. In ICCV, 2015
work page 2015
- [28]
-
[29]
G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3D human pose and shape from a single color image. In CVPR, 2018
work page 2018
- [30]
- [31]
-
[32]
C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
C. Qian, X. Sun, Y . Wei, X. Tang, and J. Sun. Realtime and robust hand tracking from depth. In CVPR, 2014
work page 2014
-
[34]
M. Rad, M. Oberweger, and V . Lepetit. Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In CVPR, 2018
work page 2018
- [35]
- [36]
-
[37]
W. Shi, J. Caballero, F. Husz ´ar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single im- age and video super-resolution using an efficient sub-pixel convolutional neural network. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[38]
A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017
work page 2017
- [39]
-
[40]
O. Sorkine. Least-squares rigid motion using svd. Technical notes, 2009
work page 2009
-
[41]
O. Sorkine and M. Alexa. As-rigid-as-possible surface mod- eling. In Proceedings of the Fifth Eurographics Symposium on Geometry Processing, 2007
work page 2007
-
[42]
S. Sridhar, F. Mueller, M. Zollhoefer, D. Casas, A. Oulasvirta, and C. Theobalt. Real-time joint track- ing of a hand manipulating an object from rgb-d input. In Proceedings of European Conference on Computer Vision (ECCV), 2016
work page 2016
-
[43]
H. Su, V . Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz. SPLATNet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 2530–2539, 2018
work page 2018
-
[44]
J. S. Supancic, G. Rogez, Y . Yang, J. Shotton, and D. Ra- manan. Depth-based hand pose estimation: data, methods, and challenges. In ICCV, 2015
work page 2015
-
[45]
A. Tagliasacchi, M. Schroeder, A. Tkach, S. Bouaziz, M. Botsch, and M. Pauly. Robust articulated-icp for real- time hand tracking. Computer Graphics Forum (Symposium on Geometry Processing), 34(5), 2015
work page 2015
-
[46]
J. Tan, I. Budvytis, and R. Cipolla. Indirect deep structured learning for 3d human body shape and pose prediction. Pro- ceedings of the BMVC, London, UK, pages 4–7, 2017
work page 2017
-
[47]
D. Tang, J. Taylor, P. Kohli, C. Keskin, T.-K. Kim, and J. Shotton. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In ICCV, 2015
work page 2015
- [48]
- [49]
- [50]
-
[51]
J. Tompson, M. Stein, Y . Lecun, and K. Perlin. Real-time continuous pose recovery of human hands using convolu- tional networks. ACM Transactions on Graphics (ToG)
-
[52]
H.-Y . Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self- supervised learning of motion capture. In Advances in Neu- ral Information Processing Systems (NIPS), 2017
work page 2017
- [53]
-
[54]
C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In CVPR, 2017
work page 2017
-
[55]
C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense 3d re- gression for hand pose estimation. In CVPR, 2018
work page 2018
-
[56]
L. Wei, Q. Huang, D. Ceylan, E. V ouga, and H. Li. Dense human body correspondences using convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
work page 2016
-
[57]
C. Xu, L. N. Govindarajan, Y . Zhang, and L. Cheng. Lie-x: Depth image based articulated object pose estimation, track- ing, and action recognition on lie groups.International Jour- nal of Computer Vision, 2017
work page 2017
-
[58]
R. Yu, S. Saito, H. Li, D. Ceylan, and H. Li. Learning dense facial correspondences in unconstrained images. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[59]
S. Yuan, Q. Ye, B. Stenger, S. Jain, and T.-K. Kim. Big- hand2. 2m benchmark: Hand pose dataset and state of the art analysis. In CVPR, 2017
work page 2017
- [60]
-
[61]
X. Zhou, Q. Wan, W. Zhang, X. Xue, and Y . Wei. Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854, 2016. Supplemental Materials
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[62]
Qualitative results We show more qualitative results on the testing set of NYU dataset in Fig. 10, 11 and 12. Left column shows re- sults trained by the sparse key point supervision. Right col- umn shows results trained by the proposed self-supervision method. Readers may also refer to the attached video to check qualitative results on more frames
-
[63]
self-supervised(test on training set)
Self-supervision training error We investigate how well the proposed self-supervision method can fit to the training set itself, i.e. , the training er- ror, as “self-supervised(test on training set)” in Tab. 3 and Fig. 9. Since our self-supervision method can be potentially applied for automatic annotation of depth frames and ac- companied RGBs, its train...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.