Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Chanyoung Kim; Donghyun Kim; Dong-Hyun Sim; Seong Jae Hwang; Youngjoong Kwon

arxiv: 2605.13604 · v1 · pith:LNV5F5DAnew · submitted 2026-05-13 · 💻 cs.CV

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Chanyoung Kim , Donghyun Kim , Dong-Hyun Sim , Seong Jae Hwang , Youngjoong Kwon This is my paper

Pith reviewed 2026-05-14 19:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords hand pose estimation2D-to-3D liftinggraph convolutional networksself-attentiongraph attentionskeleton topologyFPHA benchmark

0 comments

The pith

Adaptive attention outperforms fixed graph convolution for lifting 2D hand poses to 3D.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the standard use of fixed-adjacency graph convolutional networks for turning 2D hand keypoints into 3D poses. Controlled experiments on the FPHA benchmark, with parameter counts matched, show that plain multi-head self-attention lowers mean per-joint position error from 12.36 mm to 10.09 mm. A graph-constrained attention variant captures most of the gain, while fully connected attention adds the rest. Hand skeleton structure helps most when supplied as a soft prior through graph-distance positional encodings rather than as a rigid adjacency matrix. These findings indicate that input-dependent aggregation supplies a stronger inductive bias than static graph convolution for this task.

Core claim

Standard multi-head self-attention outperforms GCN baselines and even multi-hop strengthened GCNs on the FPHA benchmark, cutting MPJPE from 12.36 mm to 10.09 mm. Skeleton-constrained graph attention recovers most of the improvement, showing that input-dependent aggregation drives the advance, while fully connected attention supplies additional gains. Hand topology is most effective when added softly via graph-distance positional encoding instead of as a hard adjacency constraint.

What carries the argument

Multi-head self-attention for input-dependent spatial aggregation, optionally guided by graph-distance positional encodings as soft priors on hand topology.

If this is right

Attention layers can replace GCN layers in hand pose lifters to reduce 3D reconstruction error under matched parameter budgets.
Soft positional encodings of skeleton structure outperform hard adjacency constraints.
Input-dependent aggregation accounts for the largest share of the observed accuracy gains.
Fully connected attention yields further improvement beyond skeleton-constrained attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adaptive attention may similarly improve performance on other skeleton-based lifting problems such as full-body or face pose estimation.
Designers of graph-based models may need fewer hand-crafted adjacency rules if attention can learn connections on the fly.
Repeating the ablation protocol on additional public hand-pose datasets would test whether the FPHA results generalize.

Load-bearing premise

That the superiority of attention over GCNs shown in parameter-matched ablations on the FPHA benchmark alone establishes the general advantage of adaptive attention across hand pose lifting tasks and datasets.

What would settle it

A new experiment in which a parameter-matched GCN achieves equal or lower MPJPE than attention on a second hand-pose benchmark would contradict the central claim.

Figures

Figures reproduced from arXiv: 2605.13604 by Chanyoung Kim, Donghyun Kim, Dong-Hyun Sim, Seong Jae Hwang, Youngjoong Kwon.

**Figure 2.** Figure 2: Fixed skeleton adjacency versus learned attention maps [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison: GCN multi-hop (red) vs. GT (green) on the left two columns, attention (blue) vs. GT (green) on the right [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Noise robustness: oracle-trained models evaluated under [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Attention beats parameter-matched GCNs on FPHA hand pose lifting, but single-benchmark results limit how far the inductive-bias claim travels.

read the letter

The paper's core finding is that multi-head self-attention lowers MPJPE from 12.36 mm to 10.09 mm versus a strengthened multi-hop GCN on FPHA, with skeleton-constrained attention recovering most of the difference and graph-distance positional encoding working better as a soft prior than hard adjacency. The controlled, parameter-matched ablations are the useful part; they isolate input-dependent aggregation as the main driver and give a clear, quantitative comparison that prior GCN papers did not run in exactly this form. That targeted evidence is what the work actually contributes. The experiments look internally consistent on the reported numbers and avoid obvious circularity. The main limitation is the single-benchmark scope. FPHA is a reasonable starting point, but without the same protocol on HO3D, FreiHAND, or InterHand2.6M it is difficult to treat the result as general evidence that adaptive attention is a stronger inductive bias for hand pose lifting overall. No error bars or run-to-run variance are mentioned, and the abstract leaves implementation details thin. A reader working on hand pose or graph-based lifting methods would find the ablations worth checking, but the paper is not yet positioned as a broad replacement for GCNs. It is worth sending to peer review because the question is well-posed and the comparisons are concrete; the referees can ask for the missing datasets and variance numbers.

Referee Report

2 major / 2 minor

Summary. The paper claims that for 2D-to-3D hand pose lifting, adaptive spatial attention (multi-head self-attention) supplies a stronger inductive bias than fixed graph convolution. This is supported by parameter-matched ablations on the FPHA benchmark showing MPJPE reduction from 12.36 mm (strengthened multi-hop GCN) to 10.09 mm (self-attention), with a skeleton-constrained GAT recovering most of the gain and graph-distance positional encoding providing an effective soft prior for topology.

Significance. If the superiority of attention holds under broader validation, the result would meaningfully shift design choices in hand pose estimation away from GCNs toward attention-based aggregation, while still allowing soft structural priors. The controlled parameter-matched experiments and explicit comparison of hard adjacency versus input-dependent mechanisms are strengths that make the empirical case on FPHA concrete and falsifiable.

major comments (2)

[Experiments / Abstract] The central claim of general superiority of adaptive attention over fixed GCNs for hand pose lifting is load-bearing on the assumption that FPHA differences generalize. All reported results (including the 12.36 mm to 10.09 mm MPJPE comparison and the GAT recovery) are confined to FPHA; no experiments appear on other standard benchmarks (HO3D, FreiHAND, InterHand2.6M) under matched 2D-to-3D lifting protocols.
[Experiments] The manuscript lacks error bars, multiple random seeds, or statistical significance tests for the reported MPJPE deltas. This weakens the reliability of the 2.27 mm gap between the strengthened GCN and self-attention baselines.

minor comments (2)

[Experiments] Implementation details (exact layer widths, optimizer settings, training schedule, and how parameter counts were matched) are referenced but not fully enumerated, making exact reproduction difficult.
[Method] Notation for the graph-distance positional encoding and the precise form of the skeleton constraint in the GAT variant could be clarified with an explicit equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of our controlled ablations. We respond point-by-point to the major comments below and will revise the manuscript to address the concerns where feasible.

read point-by-point responses

Referee: [Experiments / Abstract] The central claim of general superiority of adaptive attention over fixed GCNs for hand pose lifting is load-bearing on the assumption that FPHA differences generalize. All reported results (including the 12.36 mm to 10.09 mm MPJPE comparison and the GAT recovery) are confined to FPHA; no experiments appear on other standard benchmarks (HO3D, FreiHAND, InterHand2.6M) under matched 2D-to-3D lifting protocols.

Authors: We agree that broader evaluation would strengthen claims of general superiority. Our study deliberately focused on in-depth, parameter-matched ablations on FPHA to isolate the contribution of input-dependent aggregation versus fixed graph structure, as this benchmark enables precise control and is standard for hand pose lifting. In revision we will add an explicit limitations paragraph discussing single-benchmark scope, avoid over-generalizing in the abstract, and include a forward-looking statement on planned multi-dataset validation. This provides a balanced presentation without requiring new large-scale experiments at this stage. revision: partial
Referee: [Experiments] The manuscript lacks error bars, multiple random seeds, or statistical significance tests for the reported MPJPE deltas. This weakens the reliability of the 2.27 mm gap between the strengthened GCN and self-attention baselines.

Authors: We acknowledge the importance of statistical rigor. In the revised manuscript we will rerun the key models with at least five random seeds, report mean MPJPE with standard deviations, add error bars to tables and figures, and include paired statistical tests (e.g., Wilcoxon signed-rank) to establish significance of the observed differences. These changes directly address the reliability concern. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablations rest on direct comparisons

full rationale

The paper reports parameter-matched architecture ablations on FPHA, with attention reducing MPJPE from 12.36 mm to 10.09 mm versus strengthened GCN baselines. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims derive from experimental measurements rather than any reduction to inputs by construction, satisfying the self-contained empirical criterion for score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard domain assumptions in pose estimation rather than introducing new free parameters or entities.

axioms (2)

domain assumption Hand skeleton topology can be encoded as a graph with fixed adjacency for baseline comparisons
Invoked when defining GCN baselines and multi-hop variants.
domain assumption MPJPE on FPHA is a sufficient proxy for model quality in 2D-to-3D lifting
Used to quantify all reported improvements.

pith-pipeline@v0.9.0 · 5494 in / 1231 out tokens · 38289 ms · 2026-05-14T19:18:04.208070+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

standard multi-head self-attention consistently outperforms GCN baselines... input-dependent aggregation is a major source of improvement
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

First-person hand action bench- mark with rgb-d videos and 3d hand pose annotations

Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-person hand action bench- mark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 409–419, 2018. 1, 2, 4

work page 2018
[2]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classi- fication with graph convolutional networks. InInternational Conference on Learning Representations, 2017. 1

work page 2017
[3]

Gtignet: Global topology interaction graphormer network for 3d hand pose estimation.Neural Networks, 185:107221, 2025

Yanjun Liu, Wanshu Fan, Cong Wang, Shixi Wen, Xin Yang, Qiang Zhang, Xiaopeng Wei, and Dongsheng Zhou. Gtignet: Global topology interaction graphormer network for 3d hand pose estimation.Neural Networks, 185:107221, 2025. 1, 4

work page 2025
[4]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 2

work page 2019
[5]

Assemblyhands: Towards egocen- tric activity understanding via 3d hand pose estimation

Takehiko Ohkawa, Kun He, Takaaki Shiratori, Shunsuke Saito, and Yoichi Sato. Assemblyhands: Towards egocen- tric activity understanding via 3d hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 21220–21230,

work page
[6]

Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos

Wonseok Roh, Seung Hyun Lee, Won Jeong Ryoo, Jakyung Lee, Gyeongrok Oh, Sooyeon Hwang, Hyung-Gun Chi, and Sangpil Kim. Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos. InBMVC, page 193, 2023. 4

work page 2023
[7]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017
[8]

Graph at- tention networks

Petar Veli ˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li `o, and Yoshua Bengio. Graph at- tention networks. InInternational Conference on Learning Representations, 2018. 3

work page 2018
[9]

Handgcnformer: A novel topology-aware transformer net- work for 3d hand pose estimation

Yintong Wang, LiLi Chen, Jiamao Li, and Xiaolin Zhang. Handgcnformer: A novel topology-aware transformer net- work for 3d hand pose estimation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5675–5684, 2023. 1

work page 2023
[10]

Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocen- tric rgb videos

Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, and Wenping Wang. Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocen- tric rgb videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21243–21253, 2023. 4

work page 2023
[11]

Dynamic iterative refinement for efficient 3d hand pose estimation

John Yang, Yash Bhalgat, Simyung Chang, Fatih Porikli, and Nojun Kwak. Dynamic iterative refinement for efficient 3d hand pose estimation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 1869–1879, 2022. 4

work page 2022
[12]

Do transformers really perform bad for graph representation? In NeurIPS, 2021

Chengxuan Ying, Tianle Cai, Sijie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform bad for graph representation? In NeurIPS, 2021. 2

work page 2021
[13]

3d human pose estima- tion with spatial and temporal transformers

Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estima- tion with spatial and temporal transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 11656–11665, 2021. 1, 2

work page 2021
[14]

Motionbert: A unified perspective on learning human motion representations

Wentao Zhu, Xiaoxuan Ma, Ziyu Liu, Lingbo Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15085–15099, 2023. 1, 2 5

work page 2023

[1] [1]

First-person hand action bench- mark with rgb-d videos and 3d hand pose annotations

Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. First-person hand action bench- mark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 409–419, 2018. 1, 2, 4

work page 2018

[2] [2]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classi- fication with graph convolutional networks. InInternational Conference on Learning Representations, 2017. 1

work page 2017

[3] [3]

Gtignet: Global topology interaction graphormer network for 3d hand pose estimation.Neural Networks, 185:107221, 2025

Yanjun Liu, Wanshu Fan, Cong Wang, Shixi Wen, Xin Yang, Qiang Zhang, Xiaopeng Wei, and Dongsheng Zhou. Gtignet: Global topology interaction graphormer network for 3d hand pose estimation.Neural Networks, 185:107221, 2025. 1, 4

work page 2025

[4] [4]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 2

work page 2019

[5] [5]

Assemblyhands: Towards egocen- tric activity understanding via 3d hand pose estimation

Takehiko Ohkawa, Kun He, Takaaki Shiratori, Shunsuke Saito, and Yoichi Sato. Assemblyhands: Towards egocen- tric activity understanding via 3d hand pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 21220–21230,

work page

[6] [6]

Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos

Wonseok Roh, Seung Hyun Lee, Won Jeong Ryoo, Jakyung Lee, Gyeongrok Oh, Sooyeon Hwang, Hyung-Gun Chi, and Sangpil Kim. Functional hand type prior for 3d hand pose estimation and action recognition from egocentric view monocular videos. InBMVC, page 193, 2023. 4

work page 2023

[7] [7]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

work page 2017

[8] [8]

Graph at- tention networks

Petar Veli ˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li `o, and Yoshua Bengio. Graph at- tention networks. InInternational Conference on Learning Representations, 2018. 3

work page 2018

[9] [9]

Handgcnformer: A novel topology-aware transformer net- work for 3d hand pose estimation

Yintong Wang, LiLi Chen, Jiamao Li, and Xiaolin Zhang. Handgcnformer: A novel topology-aware transformer net- work for 3d hand pose estimation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5675–5684, 2023. 1

work page 2023

[10] [10]

Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocen- tric rgb videos

Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, and Wenping Wang. Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocen- tric rgb videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21243–21253, 2023. 4

work page 2023

[11] [11]

Dynamic iterative refinement for efficient 3d hand pose estimation

John Yang, Yash Bhalgat, Simyung Chang, Fatih Porikli, and Nojun Kwak. Dynamic iterative refinement for efficient 3d hand pose estimation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 1869–1879, 2022. 4

work page 2022

[12] [12]

Do transformers really perform bad for graph representation? In NeurIPS, 2021

Chengxuan Ying, Tianle Cai, Sijie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform bad for graph representation? In NeurIPS, 2021. 2

work page 2021

[13] [13]

3d human pose estima- tion with spatial and temporal transformers

Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen, and Zhengming Ding. 3d human pose estima- tion with spatial and temporal transformers. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 11656–11665, 2021. 1, 2

work page 2021

[14] [14]

Motionbert: A unified perspective on learning human motion representations

Wentao Zhu, Xiaoxuan Ma, Ziyu Liu, Lingbo Liu, Wayne Wu, and Yizhou Wang. Motionbert: A unified perspective on learning human motion representations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15085–15099, 2023. 1, 2 5

work page 2023