A Spatio-Temporal Graph Convolutional Network for Gesture Recognition from High-Density Electromyography

Mingming Zhang; Peiwen Fu; Wenjuan Zhong; Wenxuan Xiong; Yuyang Zhang

arxiv: 2312.00553 · v2 · submitted 2023-12-01 · 💻 cs.HC · eess.SP

A Spatio-Temporal Graph Convolutional Network for Gesture Recognition from High-Density Electromyography

Wenjuan Zhong , Yuyang Zhang , Peiwen Fu , Wenxuan Xiong , Mingming Zhang This is my paper

Pith reviewed 2026-05-24 05:32 UTC · model grok-4.3

classification 💻 cs.HC eess.SP

keywords gesture recognitionhigh-density surface electromyographyspatio-temporal graph convolutional networkhuman-machine interfacesprosthetic controlfunctional connectivitydeep learning

0 comments

The pith

A spatio-temporal graph network built on muscle channel connectivity reaches 91.07 percent accuracy for 65 hand gestures from high-density EMG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STGCN-GR, a method that first builds graphs from functional connectivity among high-density surface electromyography channels to represent muscle activity. It then applies separate temporal convolution to track signal changes over time and spatial graph convolution to learn how activity relates across different muscle locations. Tested on a public dataset containing 65 gestures, the approach reaches 91.07 percent accuracy and exceeds prior deep learning results on the identical data. A reader would care because more accurate recognition of many gestures could support finer control of upper-limb prosthetics. The work focuses on overcoming the inability of standard networks to use both the spatial layout and time patterns present in the recordings.

Core claim

The STGCN-GR method constructs muscle networks based on functional connectivity between channels to create a graph representation of HD-sEMG recordings. A temporal convolution module captures temporal dependencies in the HD-sEMG series while a spatial graph convolution module learns the intrinsic spatial topology information among distinct HD-sEMG channels. On a public dataset with 65 gestures the model achieves 91.07 percent accuracy and surpasses state-of-the-art deep learning methods applied to the same dataset.

What carries the argument

The STGCN-GR model, which converts HD-sEMG channels into graphs via functional connectivity and then combines temporal convolution for time-series dependencies with spatial graph convolution for topology learning.

Load-bearing premise

The construction of muscle networks based on functional connectivity between channels creates a graph representation that accurately captures the intrinsic spatial topology information among distinct HD-sEMG channels.

What would settle it

Retraining the model on the same 65-gesture dataset but replacing the functional-connectivity graphs with random channel connections and measuring whether accuracy falls substantially below 91.07 percent would test whether the specific graph topology is required.

Figures

Figures reproduced from arXiv: 2312.00553 by Mingming Zhang, Peiwen Fu, Wenjuan Zhong, Wenxuan Xiong, Yuyang Zhang.

**Figure 4.** Figure 4: The corresponding model accuracy for parameter k over a range from 2 to 6 with an interval of 1. 2 3 4 5 6 0.80 0.85 0.90 0.95 1.00 Index K Accuracy（*100%） Sub 11 Sub 17 Sub 19 Sub 15 Sub 20 Average [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Accurate hand gesture prediction is crucial for effective upper-limb prosthetic limbs control. As the high flexibility and multiple degrees of freedom exhibited by human hands, there has been a growing interest in integrating deep networks with high-density surface electromyography (HD-sEMG) grids to enhance gesture recognition capabilities. However, many existing methods fall short in fully exploit the specific spatial topology and temporal dependencies present in HD-sEMG data. Additionally, these studies are often limited number of gestures and lack generality. Hence, this study introduces a novel gesture recognition method, named STGCN-GR, which leverages spatio-temporal graph convolution networks for HD-sEMG-based human-machine interfaces. Firstly, we construct muscle networks based on functional connectivity between channels, creating a graph representation of HD-sEMG recordings. Subsequently, a temporal convolution module is applied to capture the temporal dependences in the HD-sEMG series and a spatial graph convolution module is employed to effectively learn the intrinsic spatial topology information among distinct HD-sEMG channels. We evaluate our proposed model on a public HD-sEMG dataset comprising a substantial number of gestures (i.e., 65). Our results demonstrate the remarkable capability of the STGCN-GR method, achieving an impressive accuracy of 91.07% in predicting gestures, which surpasses state-of-the-art deep learning methods applied to the same dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies a standard spatio-temporal GCN to HD-sEMG on a 65-gesture dataset and claims 91% accuracy, but missing details on graph construction and splits make the gain hard to trust.

read the letter

The paper's core move is to build a graph from functional connectivity across HD-sEMG channels, then run temporal convolution followed by spatial graph convolution to recognize 65 gestures. It reports 91.07% accuracy on a public dataset and states that this beats prior deep learning baselines on the same data. That combination on a larger gesture set is the actual new piece; the architecture itself follows existing STGCN patterns rather than deriving something from scratch. The practical framing for upper-limb prosthetics is clear and the two-module design is straightforward to follow. The main weakness is the lack of any description of how the functional connectivity matrix is computed, whether it is built only on training data, or what similarity metric and threshold are used. Without that, it is impossible to rule out leakage into the spatial convolution. The abstract also gives no train/test protocol, cross-validation scheme, or baseline implementation details, so the superiority claim cannot be checked. The stress-test point about the graph possibly capturing dataset-wide correlations instead of fixed spatial topology therefore lands as a real open question rather than a minor quibble. This is the sort of applied signal-processing paper that people working on EMG interfaces or prosthetic control might want to read if the methods section turns out clean. I would bring the full version to a reading group to walk through the connectivity step. I would not cite it yet. It is worth sending for peer review so the experimental setup can be examined directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes STGCN-GR, a spatio-temporal graph convolutional network for recognizing 65 hand gestures from high-density sEMG recordings. It first builds a graph of muscle networks from functional connectivity between electrode channels, then applies a temporal convolution module followed by a spatial graph convolution module to capture temporal dependencies and intrinsic spatial topology. On a public dataset the method is reported to reach 91.07% accuracy, exceeding prior deep-learning baselines applied to the same data.

Significance. If the performance gain is shown to arise from genuine topology-aware modeling rather than graph-construction artifacts, the approach could meaningfully advance HD-sEMG-based prosthetic interfaces by explicitly encoding both spatial electrode relationships and temporal dynamics on a large gesture vocabulary. The emphasis on a 65-class task and graph construction from functional connectivity are positive features that distinguish the work from many smaller-scale sEMG studies.

major comments (2)

[Abstract, method-description paragraph] Abstract, method-description paragraph: the construction of the muscle network 'based on functional connectivity between channels' supplies neither an equation nor an algorithm for the similarity metric or threshold. Because the adjacency matrix directly determines the support of the subsequent spatial graph convolution, the absence of this detail leaves open whether the reported 91.07% accuracy reflects learned topology or inadvertent leakage of subject- or label-specific correlations into the graph.
[Abstract, results paragraph] Abstract, results paragraph: the superiority claim ('surpasses state-of-the-art deep learning methods') is stated without any description of the train/test split, cross-validation procedure, number of subjects, statistical testing, or exact baseline implementations and hyper-parameters. These elements are load-bearing for the central empirical claim on the 65-gesture task.

minor comments (2)

[Abstract] Abstract: 'fall short in fully exploit the specific spatial topology' contains a grammatical error; the intended phrasing appears to be 'fall short in fully exploiting'.
[Abstract] Abstract: 'these studies are often limited number of gestures and lack generality' is grammatically incomplete and should be rephrased for clarity (e.g., 'are often limited to a small number of gestures and lack generality').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional methodological and experimental detail will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications.

read point-by-point responses

Referee: [Abstract, method-description paragraph] Abstract, method-description paragraph: the construction of the muscle network 'based on functional connectivity between channels' supplies neither an equation nor an algorithm for the similarity metric or threshold. Because the adjacency matrix directly determines the support of the subsequent spatial graph convolution, the absence of this detail leaves open whether the reported 91.07% accuracy reflects learned topology or inadvertent leakage of subject- or label-specific correlations into the graph.

Authors: We agree that the current description of graph construction is insufficiently precise. In the revised manuscript we will add an explicit equation for the functional-connectivity similarity (Pearson correlation computed on the time series of each electrode pair) together with the exact thresholding rule and whether the resulting adjacency matrix is formed subject-specifically or from pooled data. This addition will allow readers to verify that no label or subject information leaks into the graph topology. revision: yes
Referee: [Abstract, results paragraph] Abstract, results paragraph: the superiority claim ('surpasses state-of-the-art deep learning methods') is stated without any description of the train/test split, cross-validation procedure, number of subjects, statistical testing, or exact baseline implementations and hyper-parameters. These elements are load-bearing for the central empirical claim on the 65-gesture task.

Authors: We concur that the abstract and results section must supply these experimental details to support the performance claim. The full manuscript already employs a subject-independent leave-one-subject-out protocol on the public 65-gesture dataset; the revision will move the relevant numbers (subject count, split ratios, cross-validation scheme, baseline hyper-parameters, and any statistical tests) into the abstract and expand the results paragraph accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result on held-out data

full rationale

The paper reports an empirical accuracy of 91.07% on a public 65-gesture HD-sEMG dataset using STGCN-GR after constructing graphs from functional connectivity and applying spatio-temporal convolutions. No equations, derivations, or self-citations are provided that reduce this performance metric to a fitted parameter, input statistic, or prior result by construction. The central claim remains an externally falsifiable outcome on held-out data with no load-bearing step that collapses to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that functional-connectivity graphs faithfully encode spatial muscle topology and on standard supervised-learning assumptions about i.i.d. train/test splits. No invented physical entities; hyperparameters such as graph edge thresholds or layer counts are implicit free parameters not quantified in the abstract.

free parameters (1)

functional connectivity threshold or similarity metric
Used to decide which channels are connected in the graph; value not stated in abstract.

axioms (1)

domain assumption HD-sEMG channels form a meaningful graph whose edges reflect functional connectivity
Invoked when the paper states it constructs muscle networks based on functional connectivity.

pith-pipeline@v0.9.0 · 5786 in / 1274 out tokens · 26401 ms · 2026-05-24T05:32:15.381464+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct muscle networks based on functional connectivity between channels... weighted adjacency matrix is obtained by Pearson correlation... k-nearest neighbors (k-NN) strategy... Θ∗𝒢𝒙≈∑ θ_k T_k(tilde L) x
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spatial graph convolution module... temporal convolution module... 65 gestures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

An approach to continuous hand movement recognition using sEMG based on features fusion,

J. Li, L. Wei, Y. Wen, X. Liu, and H. Wang, "An approach to continuous hand movement recognition using sEMG based on features fusion," Vis. Comput., vol. 39, no. 5, pp. 2065-2079,

work page 2065
[2]

Graph neural networks for HD EMG-based movement intention recognition: an initial investigation,

S. M. Massa, D. Riboni, and K. Nazarpour, "Graph neural networks for HD EMG-based movement intention recognition: an initial investigation,” in 2022 IEEE International Conference on Recent Advances in Systems Science and Engineering, RASSE,

work page 2022
[3]

Multiplex recurrence network analysis of inter-muscular coordination during sustained grip and pinch contractions at different force levels,

N. Zhang, K. Li, G. Li, R. Nataraj, and N. Wei, "Multiplex recurrence network analysis of inter-muscular coordination during sustained grip and pinch contractions at different force levels," IEEE Trans. Neural Syst. Rehabil. Eng., vol. 29, pp. 2055-2066,

work page 2055
[4]

Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,

B. Yu, H. Yin, and Z. Zhu, "Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 3634–3640

work page 2018
[5]

ViT-HGR: vision transformer-based hand gesture recognition from high density surface EMG signals,

M. Montazerin, S. Zabihi, E. Rahimian, A. Mohammadi, and F. Naderkhani, "ViT-HGR: vision transformer-based hand gesture recognition from high density surface EMG signals,” in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2022, pp. 5115-5119

work page 2022
[6]

HYDRA-HGR: A hybrid transformer-based architecture for fusion of macroscopic and microscopic neural drive information,

M. Montazerin, E. Rahimian, F. Naderkhani, S, H. Alinejad-Rokny, and A. Mohammadi, "HYDRA-HGR: A hybrid transformer-based architecture for fusion of macroscopic and microscopic neural drive information," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1-5

work page 2023

[1] [1]

An approach to continuous hand movement recognition using sEMG based on features fusion,

J. Li, L. Wei, Y. Wen, X. Liu, and H. Wang, "An approach to continuous hand movement recognition using sEMG based on features fusion," Vis. Comput., vol. 39, no. 5, pp. 2065-2079,

work page 2065

[2] [2]

Graph neural networks for HD EMG-based movement intention recognition: an initial investigation,

S. M. Massa, D. Riboni, and K. Nazarpour, "Graph neural networks for HD EMG-based movement intention recognition: an initial investigation,” in 2022 IEEE International Conference on Recent Advances in Systems Science and Engineering, RASSE,

work page 2022

[3] [3]

Multiplex recurrence network analysis of inter-muscular coordination during sustained grip and pinch contractions at different force levels,

N. Zhang, K. Li, G. Li, R. Nataraj, and N. Wei, "Multiplex recurrence network analysis of inter-muscular coordination during sustained grip and pinch contractions at different force levels," IEEE Trans. Neural Syst. Rehabil. Eng., vol. 29, pp. 2055-2066,

work page 2055

[4] [4]

Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,

B. Yu, H. Yin, and Z. Zhu, "Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 3634–3640

work page 2018

[5] [5]

ViT-HGR: vision transformer-based hand gesture recognition from high density surface EMG signals,

M. Montazerin, S. Zabihi, E. Rahimian, A. Mohammadi, and F. Naderkhani, "ViT-HGR: vision transformer-based hand gesture recognition from high density surface EMG signals,” in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., 2022, pp. 5115-5119

work page 2022

[6] [6]

HYDRA-HGR: A hybrid transformer-based architecture for fusion of macroscopic and microscopic neural drive information,

M. Montazerin, E. Rahimian, F. Naderkhani, S, H. Alinejad-Rokny, and A. Mohammadi, "HYDRA-HGR: A hybrid transformer-based architecture for fusion of macroscopic and microscopic neural drive information," in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1-5

work page 2023