EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation
Pith reviewed 2026-05-08 15:07 UTC · model grok-4.3
The pith
Bilateral wrist EMG fused with egocentric vision improves bimanual hand pose estimation over vision alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EgoEMG supplies synchronized bilateral EMG (8 channels per wrist at 2 kHz), 120 Hz IMU, egocentric RGB, external RGB-D, and mocap-derived wrist and finger angles for 41 participants across 60 gesture classes. Under a common joint-angle regression target and generalization axes (cross-gesture, cross-user, combined), a residual fusion model that adds EMG features to a vision backbone improves accuracy relative to vision-only baselines of comparable size.
What carries the argument
The residual fusion architecture that injects EMG-derived features into a vision backbone for joint-angle prediction.
If this is right
- EMG supplies fine finger articulation under occlusion or poor lighting where vision fails.
- Vision supplies global hand configuration that EMG alone cannot determine.
- The shared benchmark splits let future work compare EMG-only, vision-only, and fused approaches on identical generalization axes.
- The dataset size (over 10 hours, 41 users) supports training and testing of lightweight wearable models.
Where Pith is reading between the lines
- Wearable wristbands plus a head-mounted camera could enable precise bimanual tracking for AR/VR without external cameras.
- The same fusion approach may extend to continuous everyday hand actions beyond the 60 discrete gestures studied.
- Prosthetic or rehabilitation systems could combine surface EMG with egocentric video for more robust control.
Load-bearing premise
The EMG, video, and mocap streams are accurately time-aligned and the 41-participant, 60-gesture collection covers enough real-world hand-movement variation for the benchmark results to generalize.
What would settle it
Release of a new synchronized multimodal recording with fresh users or gestures where the residual fusion model shows no accuracy gain over the matched vision-only baseline.
Figures
read the original abstract
Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks -- EMG-to-pose, vision-to-pose, and EMG+vision fusion -- under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EgoEMG, a new multimodal egocentric dataset for bimanual hand pose estimation comprising bilateral wristband EMG (16 channels total at 2 kHz), 120 Hz IMU, egocentric wide-angle RGB, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. It covers 41 participants performing 60 gesture classes (30 single-hand and 30 bimanual) for over 10 hours of data. The authors define three benchmark tasks—EMG-to-pose (using EMGFormer), vision-to-pose (ResNet/ViT backbones), and EMG+vision residual fusion—under shared joint-angle targets and generalization splits (cross-gesture, cross-user, combined), reporting that the fusion architecture improves over matched lightweight vision-only baselines.
Significance. If synchronization and labeling quality hold, EgoEMG would fill a notable gap as the first publicly released dataset synchronizing high-rate bilateral EMG with egocentric vision and precise mocap ground truth for hand articulation. The scale (41 participants, 60 gestures, >10 h), bimanual coverage, and explicit cross-user/cross-gesture splits make it a useful resource for multimodal sensing research. The residual fusion baseline provides an initial demonstration of complementary modality benefits.
major comments (2)
- [§3.2] §3.2 (Synchronization and Alignment): The central claim that EMG, video, IMU, and mocap labels share a common timeline with sub-frame precision is load-bearing for all three benchmark tasks, yet the manuscript provides no quantitative validation of residual alignment error (e.g., no cross-correlation results, hardware trigger latency measurements, or post-alignment drift statistics). Without such evidence, reported EMG-to-pose and fusion gains cannot be interpreted as modality-specific rather than alignment artifacts.
- [§5.3] §5.3 (Fusion Results): The statement that the residual fusion architecture 'improves over matched lightweight vision-only baselines' is presented without tabulated error metrics, confidence intervals, or statistical tests across the cross-user and cross-gesture splits. This weakens the empirical support for the multimodal contribution.
minor comments (2)
- [Table 1] Table 1: Confirm that the listed sampling rates and channel counts match the text description of 2 kHz bilateral EMG and 120 Hz IMU.
- [§2] §2 (Related Work): Add a brief comparison row or paragraph quantifying how EgoEMG's participant count, gesture diversity, and modality synchronization exceed the closest prior EMG-vision datasets.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. The feedback highlights important aspects of validation and reporting that we will address in the revision to strengthen the paper.
read point-by-point responses
-
Referee: §3.2 (Synchronization and Alignment): The central claim that EMG, video, IMU, and mocap labels share a common timeline with sub-frame precision is load-bearing for all three benchmark tasks, yet the manuscript provides no quantitative validation of residual alignment error (e.g., no cross-correlation results, hardware trigger latency measurements, or post-alignment drift statistics). Without such evidence, reported EMG-to-pose and fusion gains cannot be interpreted as modality-specific rather than alignment artifacts.
Authors: We agree that quantitative evidence of alignment precision is necessary to support the claims. The manuscript describes the hardware-based synchronization using shared triggers and timestamp logging but does not report explicit error metrics. In the revised version we will add a dedicated paragraph in §3.2 with measured hardware trigger latencies, cross-correlation results between EMG and video streams, and statistics on residual drift across sessions to confirm sub-frame alignment. revision: yes
-
Referee: §5.3 (Fusion Results): The statement that the residual fusion architecture 'improves over matched lightweight vision-only baselines' is presented without tabulated error metrics, confidence intervals, or statistical tests across the cross-user and cross-gesture splits. This weakens the empirical support for the multimodal contribution.
Authors: We concur that the fusion results require more detailed quantitative support. The current text and figures provide only summary statements. In the revision we will expand §5.3 with full tables reporting mean errors, standard deviations, and 95% confidence intervals for all three tasks under each generalization split. We will also add paired statistical tests (e.g., Wilcoxon signed-rank) to evaluate the significance of the fusion improvements over the matched vision-only baselines. revision: yes
Circularity Check
No circularity: dataset release with empirical benchmarks only
full rationale
The paper presents a new multimodal dataset (EgoEMG) and standard ML baselines (EMGFormer, ResNet/ViT, residual fusion) evaluated on cross-gesture and cross-user splits. No derivation chain, equations, or predictions are claimed that reduce by construction to fitted inputs, self-definitions, or self-citation load-bearing premises. Synchronization and labeling are empirical assumptions subject to external verification, not internal reductions. This matches the default non-circular case for data-collection papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Per-channel mean subtraction (DC removal)
-
[2]
Narrow-band notch filters at 50 Hz (power line frequency) and 100 Hz (first harmonic)
-
[3]
Broadband bandpass filter 20–850 Hz
-
[4]
Soft roll-off above 900 Hz to suppress high-frequency noise. No amplitude normalization is applied; filtered EMG values are re- tained in mV . Both raw ( observation.emg.left/right) and filtered (observation.emg.left_filtered/right_filtered) streams are preserved in the release. Training uses the filtered EMG by default. 14 Table 4: Complete gesture vocab...
work page 2019
-
[5]
Conv1d:16→256, kernel 11, stride 5 21
-
[6]
Conv1d:256→256, kernel 5, stride 2
-
[7]
TDS Stage 1: in-conv kernel 9, stride 5; 1 block of TDSConv (kernel width 5, channels 8, feature width 32)
-
[8]
TDS Stage 2: in-conv kernel 3, stride 1; 1 block of TDSConv (kernel width 3, channels 8, feature width 32) The final output is a(256,146) feature map, corresponding to an effective frame rate of approximately 37 Hz with a receptive field spanning the full input window. C.2 Vision and Fusion Training Details Vision baseline.We provide six fully fine-tuned ...
work page 2024
-
[9]
The MANO mesh vertices are transformed from world coordinates to camera coordinates using the calibrated extrinsic matrix (vcam =R C←W vW +t C←W )
-
[10]
All mesh triangles are rasterized into a depth buffer (z-buffer) at the native video resolution via pinhole projection with the calibrated intrinsic matrix K. For each pixel covered by a triangle, the interpolated depth is stored if it is smaller than the current buffer value
-
[11]
A vertex is deemedvisibleif, within a 5×5 pixel neighbourhood of its projected location, any z-buffer entry agrees with the vertex depth to withinϵ= 5mm
-
[12]
Each triangle’s 3-D surface area (half the cross-product norm of two edge vectors) is distributed equally to its three vertices, yielding a per-vertex area weightw i. 25 4 6 8 10 12 Vision-only MAE (°) 4 6 8 10 12Fusion MAE (°) Fusion worse Fusion better ASL7FingerPullRight HandClasp ASL9 Single-hand Symmetric bimanual Asymmetric bimanual Figure 12: Per-g...
-
[13]
Theself-occlusion scoreis defined as socc = 1− P i∈Vvis wi PV i=1 wi ,(4) where Vvis is the set of visible vertices. A score of 0 means the hand is fully visible; a score of 1 means it is fully occluded from the camera viewpoint. Results.Figure 13(a) shows that both vision-only and fusion MAE vary with occlusion, while the gap between them—the fusion gain...
-
[14]
Risk disclosure to participants was part of the consent process
Institutional review board (IRB) approvals or equivalent for research with human subjects 33 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.