pith. sign in

arxiv: 2605.05712 · v1 · submitted 2026-05-07 · 💻 cs.CV

EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation

Pith reviewed 2026-05-08 15:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric visionelectromyographyhand pose estimationmultimodal datasetbimanual gesturesEMG fusionwrist EMGjoint angle prediction
0
0 comments X

The pith

Bilateral wrist EMG fused with egocentric vision improves bimanual hand pose estimation over vision alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoEMG, a dataset that records 16 channels of bilateral wrist EMG at 2 kHz together with egocentric wide-angle RGB video and motion-capture ground truth for hand joint angles. Data come from 41 participants performing 60 gesture classes (30 single-hand, 30 bimanual) for more than 10 hours, with additional IMU and external RGB-D streams. The authors define three shared benchmark tasks—EMG-to-pose, vision-to-pose, and multimodal fusion—using a common joint-angle target and the same cross-gesture and cross-user splits. Baselines include EMGFormer for the EMG path and ResNet/ViT models for vision; a residual fusion architecture is shown to outperform matched lightweight vision-only models on the same splits.

Core claim

EgoEMG supplies synchronized bilateral EMG (8 channels per wrist at 2 kHz), 120 Hz IMU, egocentric RGB, external RGB-D, and mocap-derived wrist and finger angles for 41 participants across 60 gesture classes. Under a common joint-angle regression target and generalization axes (cross-gesture, cross-user, combined), a residual fusion model that adds EMG features to a vision backbone improves accuracy relative to vision-only baselines of comparable size.

What carries the argument

The residual fusion architecture that injects EMG-derived features into a vision backbone for joint-angle prediction.

If this is right

  • EMG supplies fine finger articulation under occlusion or poor lighting where vision fails.
  • Vision supplies global hand configuration that EMG alone cannot determine.
  • The shared benchmark splits let future work compare EMG-only, vision-only, and fused approaches on identical generalization axes.
  • The dataset size (over 10 hours, 41 users) supports training and testing of lightweight wearable models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wearable wristbands plus a head-mounted camera could enable precise bimanual tracking for AR/VR without external cameras.
  • The same fusion approach may extend to continuous everyday hand actions beyond the 60 discrete gestures studied.
  • Prosthetic or rehabilitation systems could combine surface EMG with egocentric video for more robust control.

Load-bearing premise

The EMG, video, and mocap streams are accurately time-aligned and the 41-participant, 60-gesture collection covers enough real-world hand-movement variation for the benchmark results to generalize.

What would settle it

Release of a new synchronized multimodal recording with fresh users or gestures where the residual fusion model shows no accuracy gain over the matched vision-only baseline.

Figures

Figures reproduced from arXiv: 2605.05712 by Jianjiang Feng, Jiayi Yu, Jie Zhou, Yanbo Duan, Yitao Wang, Ziheng Xi.

Figure 1
Figure 1. Figure 1: Data collection setup for EgoEMG. Bilateral EMG wristbands, head-mounted egocentric RGB, external ZED 2i RGB-D, and optical motion capture with hand markers for pose labels. Transformer-based reconstructors [Pavlakos et al., 2024, Potamias et al., 2024]. Egocentric systems and datasets such as UmeTrack [Han et al., 2022], HOT3D [Banerjee et al., 2024], H2O [Kwon et al., 2021], and WristPP [Xi et al., 2026]… view at source ↗
Figure 2
Figure 2. Figure 2: Representative synchronized samples from view at source ↗
Figure 3
Figure 3. Figure 3: (a) MANO shape diversity across EgoEMG participants (82 hands). (b) Evaluation splits on the participant–gesture matrix (Train, Gesture/User/Both). 3.4 Evaluation Splits and Release Format Following the protocol of EMG2Pose [Salter et al., 2024], we define two primary generalization axes: gesture and user. Based on these axes, we evaluate three splits: a gesture split, where held-out gestures are evaluated… view at source ↗
Figure 4
Figure 4. Figure 4: EMGFormer architecture for EMG-to-pose. Bilateral EMG is encoded by a TDS-style view at source ↗
Figure 5
Figure 5. Figure 5: Vision-to-pose baseline. A 256 × 256 hand crop from the center frame is processed by a ResNet or ViT backbone, followed by an MLP head that regresses 22 joint angles. 4.3 EMG Baseline As the EMG-to-pose baseline, we introduce EMGFormer ( view at source ↗
Figure 6
Figure 6. Figure 6: EMG+vision fusion baseline. Vision branch predicts view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on EgoEMG. The fused model (green) recovers articulation closer to GT under motion blur, self-occlusion, and depth ambiguity. a strong hand-specialized visual model rather than only generic backbones. A per-gesture analysis across all 60 gesture classes (Appendix D.5) shows that fusion improves over vision-only on nearly all gestures. An occlusion-stratified analysis (Appendix D.6) s… view at source ↗
Figure 8
Figure 8. Figure 8: Architecture of markers2mano pipeline. The primary inference model is a Graph Transformer that maps the 21 marker positions M ∈ R 21×3 to MANO pose θ ∈ R 48 and shape β ∈ R 10 parameters. The architecture is shown in view at source ↗
Figure 9
Figure 9. Figure 9: Data augmentation pipeline for markers2mano training. frames. To balance the training distribution, we assign each dataset a sampling ratio based on its scale tier: large-scale datasets (100M+ frames, e.g., GigaHand) at 0.14%; medium-scale datasets (1M+, e.g., HOT3D, InterHand2.6M) at 1.5%; and small-scale datasets (100K+, e.g., FreiHand, HO3D) at 15–20%. This yields approximately 0.7M training meshes ( view at source ↗
Figure 10
Figure 10. Figure 10: Composition of the markers2mano training data. Left: original frame counts of 11 public datasets (log scale), spanning 76K to 183M frames. Each dataset is assigned a sampling ratio by scale tier—0.14% (GigaHand, 100M+), 1.5% (1M+), or 15–20% (100K+)—shown as annotations. Right: target distribution of the training mixture after applying the tiered sampling ratios, yielding approximately 0.7M training meshe… view at source ↗
Figure 11
Figure 11. Figure 11: Mesh reprojection and visualization of UmeTrack and MANO mesh reconstruction on view at source ↗
Figure 12
Figure 12. Figure 12: Per-gesture vision-only vs. fusion MAE on view at source ↗
Figure 13
Figure 13. Figure 13: Fusion benefit as a function of hand self-occlusion on the view at source ↗
read the original abstract

Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks -- EMG-to-pose, vision-to-pose, and EMG+vision fusion -- under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EgoEMG, a new multimodal egocentric dataset for bimanual hand pose estimation comprising bilateral wristband EMG (16 channels total at 2 kHz), 120 Hz IMU, egocentric wide-angle RGB, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. It covers 41 participants performing 60 gesture classes (30 single-hand and 30 bimanual) for over 10 hours of data. The authors define three benchmark tasks—EMG-to-pose (using EMGFormer), vision-to-pose (ResNet/ViT backbones), and EMG+vision residual fusion—under shared joint-angle targets and generalization splits (cross-gesture, cross-user, combined), reporting that the fusion architecture improves over matched lightweight vision-only baselines.

Significance. If synchronization and labeling quality hold, EgoEMG would fill a notable gap as the first publicly released dataset synchronizing high-rate bilateral EMG with egocentric vision and precise mocap ground truth for hand articulation. The scale (41 participants, 60 gestures, >10 h), bimanual coverage, and explicit cross-user/cross-gesture splits make it a useful resource for multimodal sensing research. The residual fusion baseline provides an initial demonstration of complementary modality benefits.

major comments (2)
  1. [§3.2] §3.2 (Synchronization and Alignment): The central claim that EMG, video, IMU, and mocap labels share a common timeline with sub-frame precision is load-bearing for all three benchmark tasks, yet the manuscript provides no quantitative validation of residual alignment error (e.g., no cross-correlation results, hardware trigger latency measurements, or post-alignment drift statistics). Without such evidence, reported EMG-to-pose and fusion gains cannot be interpreted as modality-specific rather than alignment artifacts.
  2. [§5.3] §5.3 (Fusion Results): The statement that the residual fusion architecture 'improves over matched lightweight vision-only baselines' is presented without tabulated error metrics, confidence intervals, or statistical tests across the cross-user and cross-gesture splits. This weakens the empirical support for the multimodal contribution.
minor comments (2)
  1. [Table 1] Table 1: Confirm that the listed sampling rates and channel counts match the text description of 2 kHz bilateral EMG and 120 Hz IMU.
  2. [§2] §2 (Related Work): Add a brief comparison row or paragraph quantifying how EgoEMG's participant count, gesture diversity, and modality synchronization exceed the closest prior EMG-vision datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. The feedback highlights important aspects of validation and reporting that we will address in the revision to strengthen the paper.

read point-by-point responses
  1. Referee: §3.2 (Synchronization and Alignment): The central claim that EMG, video, IMU, and mocap labels share a common timeline with sub-frame precision is load-bearing for all three benchmark tasks, yet the manuscript provides no quantitative validation of residual alignment error (e.g., no cross-correlation results, hardware trigger latency measurements, or post-alignment drift statistics). Without such evidence, reported EMG-to-pose and fusion gains cannot be interpreted as modality-specific rather than alignment artifacts.

    Authors: We agree that quantitative evidence of alignment precision is necessary to support the claims. The manuscript describes the hardware-based synchronization using shared triggers and timestamp logging but does not report explicit error metrics. In the revised version we will add a dedicated paragraph in §3.2 with measured hardware trigger latencies, cross-correlation results between EMG and video streams, and statistics on residual drift across sessions to confirm sub-frame alignment. revision: yes

  2. Referee: §5.3 (Fusion Results): The statement that the residual fusion architecture 'improves over matched lightweight vision-only baselines' is presented without tabulated error metrics, confidence intervals, or statistical tests across the cross-user and cross-gesture splits. This weakens the empirical support for the multimodal contribution.

    Authors: We concur that the fusion results require more detailed quantitative support. The current text and figures provide only summary statements. In the revision we will expand §5.3 with full tables reporting mean errors, standard deviations, and 95% confidence intervals for all three tasks under each generalization split. We will also add paired statistical tests (e.g., Wilcoxon signed-rank) to evaluate the significance of the fusion improvements over the matched vision-only baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with empirical benchmarks only

full rationale

The paper presents a new multimodal dataset (EgoEMG) and standard ML baselines (EMGFormer, ResNet/ViT, residual fusion) evaluated on cross-gesture and cross-user splits. No derivation chain, equations, or predictions are claimed that reduce by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. Synchronization and labeling are empirical assumptions subject to external verification, not internal reductions. This matches the default non-circular case for data-collection papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical data-collection and benchmarking effort using off-the-shelf sensors and existing neural architectures; no free parameters are introduced in a derivation, no new axioms are postulated, and no invented entities are required.

pith-pipeline@v0.9.0 · 5575 in / 1331 out tokens · 54739 ms · 2026-05-08T15:07:30.994353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Per-channel mean subtraction (DC removal)

  2. [2]

    Narrow-band notch filters at 50 Hz (power line frequency) and 100 Hz (first harmonic)

  3. [3]

    Broadband bandpass filter 20–850 Hz

  4. [4]

    I Love You

    Soft roll-off above 900 Hz to suppress high-frequency noise. No amplitude normalization is applied; filtered EMG values are re- tained in mV . Both raw ( observation.emg.left/right) and filtered (observation.emg.left_filtered/right_filtered) streams are preserved in the release. Training uses the filtered EMG by default. 14 Table 4: Complete gesture vocab...

  5. [5]

    Conv1d:16→256, kernel 11, stride 5 21

  6. [6]

    Conv1d:256→256, kernel 5, stride 2

  7. [7]

    TDS Stage 1: in-conv kernel 9, stride 5; 1 block of TDSConv (kernel width 5, channels 8, feature width 32)

  8. [8]

    C.2 Vision and Fusion Training Details Vision baseline.We provide six fully fine-tuned generic vision backbones: ResNet-{18, 50, 152} and ViT-{Small, Base, Large}

    TDS Stage 2: in-conv kernel 3, stride 1; 1 block of TDSConv (kernel width 3, channels 8, feature width 32) The final output is a(256,146) feature map, corresponding to an effective frame rate of approximately 37 Hz with a receptive field spanning the full input window. C.2 Vision and Fusion Training Details Vision baseline.We provide six fully fine-tuned ...

  9. [9]

    The MANO mesh vertices are transformed from world coordinates to camera coordinates using the calibrated extrinsic matrix (vcam =R C←W vW +t C←W )

  10. [10]

    For each pixel covered by a triangle, the interpolated depth is stored if it is smaller than the current buffer value

    All mesh triangles are rasterized into a depth buffer (z-buffer) at the native video resolution via pinhole projection with the calibrated intrinsic matrix K. For each pixel covered by a triangle, the interpolated depth is stored if it is smaller than the current buffer value

  11. [11]

    A vertex is deemedvisibleif, within a 5×5 pixel neighbourhood of its projected location, any z-buffer entry agrees with the vertex depth to withinϵ= 5mm

  12. [12]

    Each triangle’s 3-D surface area (half the cross-product norm of two edge vectors) is distributed equally to its three vertices, yielding a per-vertex area weightw i. 25 4 6 8 10 12 Vision-only MAE (°) 4 6 8 10 12Fusion MAE (°) Fusion worse Fusion better ASL7FingerPullRight HandClasp ASL9 Single-hand Symmetric bimanual Asymmetric bimanual Figure 12: Per-g...

  13. [13]

    Limitations

    Theself-occlusion scoreis defined as socc = 1− P i∈Vvis wi PV i=1 wi ,(4) where Vvis is the set of visible vertices. A score of 0 means the hand is fully visible; a score of 1 means it is fully occluded from the camera viewpoint. Results.Figure 13(a) shows that both vision-only and fusion MAE vary with occlusion, while the gap between them—the fusion gain...

  14. [14]

    Risk disclosure to participants was part of the consent process

    Institutional review board (IRB) approvals or equivalent for research with human subjects 33 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...