MediaPipe Hands: On-device Real-time Hand Tracking
read the original abstract
We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.
This paper has not been read by Pith yet.
Forward citations
Cited by 29 Pith papers
-
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
-
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
-
DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild
DIPSER supplies multi-view RGB video and smartwatch data from natural in-person classes with attention and emotion labels from self-report plus four experts, including underrepresented ethnicities.
-
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization
Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
-
Contextual Role Modulates Object Representational Geometry in the Human Brain
fMRI during naturalistic movies shows context-dependent double dissociation in object representational geometry: action targets organized by affordances in parietal networks, passive objects by semantics in occipito-t...
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere releases an open mobile app, processing pipeline, and 200-hour dataset for long-horizon egocentric data collection on commodity hardware to support vision-language-action model training.
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere supplies an open mobile app, 200-hour long-form egocentric dataset, and processing pipeline that enables collection of persistent-state egocentric trajectories on commodity hardware for VLA and foun...
-
Decoding High-Dimensional Finger Motion from EMG Using Riemannian Features and RNNs
TRR combines multi-band Riemannian features with a GRU to decode high-dimensional finger kinematics from EMG, achieving 9.79° intra-subject and 16.71° cross-subject average absolute error while running at ~10 Hz on a ...
-
VRSafe: A Secure Virtual Keyboard to Mitigate Keystroke Inference in Virtual Reality
VRSafe adds false positive keystrokes to VR typing data to reduce keystroke inference attack accuracy and includes an efficient malicious login detector.
-
Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay
A data-free class-incremental learning method for gesture recognition using prototype-guided pseudo feature replay with four components that achieves 11.8% and 12.8% mean global accuracy gains on SHREC 2017 3D and Ego...
-
Cross-view Multimodal Vision-Based Assessment Framework for Traditional Chinese Medicine Rehabilitation Training
A cross-view multimodal AQA framework for TCM rehabilitation training that reports over 10% relative F1 improvement on dual-view datasets for tasks like needle depth and insertion time.
-
Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
Hand trajectory encoding fused with video-text features via cross-attention improves Ego4D NLQ grounding performance, with largest gains on hand-object interaction and quantity/state queries.
-
RoboHitch: Learning Visual Affordance from Disordered Keypoints for Hitch Knots Tying
A learning framework that predicts pick-and-place affordances for hitch knots from unordered keypoints and images via graph and convolutional autoencoders fused by cross-attention.
-
AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
AVI-HT adaptively fuses vision and IMU data via attention to cut 3D hand keypoint error by 16.1% (24.2% wrist-aligned) on a new 100K+ sample DexGloveHOI dataset in occluded hand-object scenarios.
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere provides an open infrastructure and 200-hour dataset for collecting long-horizon egocentric trajectories on commodity phones to support VLA model training.
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere releases an open-source smartphone app, 200-hour egocentric dataset with persistent tracking, and pipeline to enable long-horizon data collection for VLA and foundation model research on commodity hardware.
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
An open framework with a free smartphone app, STERA pipeline, and 200-hour dataset enables hour-plus egocentric data collection on commodity hardware and demonstrates utility by lowering VLA action-prediction error af...
-
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
-
Real-Time Hand Gesture Recognition: Integrating Skeleton-Based Data Fusion and Multi-Stream CNN
Skeleton data from hand gestures is fused into RGB images and classified by an e2eET multi-stream CNN, yielding competitive accuracy on five datasets and real-time operation on consumer hardware.
-
SignVLA: Real-Time Sign Language-Guided Robotic Manipulation via Attention LSTM and Vision-Language-Action Models
SignVLA adds a sign-to-text module with attention LSTM and temporal stabilization to enable sign-language input for VLA-based robotic manipulation tasks.
-
Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications
Hand pose estimation accuracy generalizes to hand-impaired populations from spinal cord injury with negligible effects from object occlusions.
-
EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations
EaDex combines single-camera RGB-D capture, MANO retargeting, and dynamic demonstration annealing to achieve 55.3% relative improvement over baseline on nine cross-embodiment dexterous object-opening tasks across three hands.
-
Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
Multi-stage training that first mixes real and inpainted synthetic hand images then fine-tunes on real data improves mAP on glove-wearing test images over real-only baselines.
-
Contextual Role Modulates Object Representational Geometry in the Human Brain
fMRI during naturalistic movies shows context (action target vs passive) remaps object representational geometry via double dissociation in affordance vs semantic dimensions within selective networks.
-
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
MobileEgo Anywhere releases a 200-hour long-form egocentric dataset with persistent state tracking plus the STERA open infrastructure and processing pipeline to convert commodity mobile captures into training-ready fo...
-
Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics
An egocentric vision pipeline with MediaPipe hand tracking and damped-least-squares IK achieves 86.7% success on structured pick-and-place for the SO-ARM101 robot but falls to 9.3% in real-world environments with occlusions.
-
Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance
An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concl...
-
Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking
A vision-based system maps real-time hand gestures from a single camera to image translation, rotation, and zoom commands for touchless intraoperative navigation.
-
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.