MediaPipe Hands: On-device Real-time Hand Tracking

Andrei Tkachenka; Andrey Vakunov; Chuo-Ling Chang; Fan Zhang; George Sung; Matthias Grundmann; Valentin Bazarevsky

arxiv: 2006.10214 · v1 · pith:J4RU3MJHnew · submitted 2020-06-18 · 💻 cs.CV

MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang , Valentin Bazarevsky , Andrey Vakunov , Andrei Tkachenka , George Sung , Chuo-Ling Chang , Matthias Grundmann This is my paper

classification 💻 cs.CV

keywords handmediapipepipelinereal-timehandsmodelon-devicetracking

0 comments

read the original abstract

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
cs.RO 2026-05 unverdicted novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
cs.HC 2026-05 unverdicted novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild
cs.CV 2025-02 unverdicted novelty 7.0

DIPSER supplies multi-view RGB video and smartwatch data from natural in-person classes with attention and emotion labels from self-report plus four experts, including underrepresented ethnicities.
Enforcing Human-like Kinematics in Dexterous Piano Playing via Adversarial Posture Regularization
cs.RO 2026-06 unverdicted novelty 6.0

Adversarial Posture Regularization matches RL policy posture distributions to casual human piano-playing data to enforce human-like kinematics in dexterous hands, outperforming baselines on cPSI, BSE, and FAC metrics.
Contextual Role Modulates Object Representational Geometry in the Human Brain
q-bio.NC 2026-05 unverdicted novelty 6.0

fMRI during naturalistic movies shows context-dependent double dissociation in object representational geometry: action targets organized by affordances in parietal networks, passive objects by semantics in occipito-t...
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 6.0

MobileEgo Anywhere releases an open mobile app, processing pipeline, and 200-hour dataset for long-horizon egocentric data collection on commodity hardware to support vision-language-action model training.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 6.0

MobileEgo Anywhere supplies an open mobile app, 200-hour long-form egocentric dataset, and processing pipeline that enables collection of persistent-state egocentric trajectories on commodity hardware for VLA and foun...
Decoding High-Dimensional Finger Motion from EMG Using Riemannian Features and RNNs
cs.LG 2026-04 unverdicted novelty 6.0

TRR combines multi-band Riemannian features with a GRU to decode high-dimensional finger kinematics from EMG, achieving 9.79° intra-subject and 16.71° cross-subject average absolute error while running at ~10 Hz on a ...
VRSafe: A Secure Virtual Keyboard to Mitigate Keystroke Inference in Virtual Reality
cs.CR 2026-04 unverdicted novelty 6.0

VRSafe adds false positive keystrokes to VR typing data to reduce keystroke inference attack accuracy and includes an efficient malicious login detector.
Data-Free Class-Incremental Gesture Recognition with Prototype-Guided Pseudo Feature Replay
cs.CV 2025-05 unverdicted novelty 6.0

A data-free class-incremental learning method for gesture recognition using prototype-guided pseudo feature replay with four components that achieves 11.8% and 12.8% mean global accuracy gains on SHREC 2017 3D and Ego...
Cross-view Multimodal Vision-Based Assessment Framework for Traditional Chinese Medicine Rehabilitation Training
cs.CV 2026-06 unverdicted novelty 5.0

A cross-view multimodal AQA framework for TCM rehabilitation training that reports over 10% relative F1 improvement on dual-view datasets for tasks like needle depth and insertion time.
Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
cs.CV 2026-06 unverdicted novelty 5.0

Hand trajectory encoding fused with video-text features via cross-attention improves Ego4D NLQ grounding performance, with largest gains on hand-object interaction and quantity/state queries.
RoboHitch: Learning Visual Affordance from Disordered Keypoints for Hitch Knots Tying
cs.RO 2026-05 unverdicted novelty 5.0

A learning framework that predicts pick-and-place affordances for hitch knots from unordered keypoints and images via graph and convolutional autoencoders fused by cross-attention.
AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
cs.CV 2026-05 unverdicted novelty 5.0

AVI-HT adaptively fuses vision and IMU data via attention to cut 3D hand keypoint error by 16.1% (24.2% wrist-aligned) on a new 100K+ sample DexGloveHOI dataset in occluded hand-object scenarios.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 5.0

MobileEgo Anywhere provides an open infrastructure and 200-hour dataset for collecting long-horizon egocentric trajectories on commodity phones to support VLA model training.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 5.0

MobileEgo Anywhere releases an open-source smartphone app, 200-hour egocentric dataset with persistent tracking, and pipeline to enable long-horizon data collection for VLA and foundation model research on commodity hardware.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 5.0

An open framework with a free smartphone app, STERA pipeline, and 200-hour dataset enables hour-plus egocentric data collection on commodity hardware and demonstrates utility by lowering VLA action-prediction error af...
Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation
cs.CV 2026-04 unverdicted novelty 5.0

Prompt-driven image-to-video generation produces deictic gestures that match real data visually, add useful variety, and improve downstream recognition models when mixed with human recordings.
Real-Time Hand Gesture Recognition: Integrating Skeleton-Based Data Fusion and Multi-Stream CNN
cs.CV 2024-06 unverdicted novelty 5.0

Skeleton data from hand gestures is fused into RGB images and classified by an e2eET multi-stream CNN, yielding competitive accuracy on five datasets and real-time operation on consumer hardware.
SignVLA: Real-Time Sign Language-Guided Robotic Manipulation via Attention LSTM and Vision-Language-Action Models
cs.AI 2026-06 unverdicted novelty 4.0

SignVLA adds a sign-to-text module with attention LSTM and temporal stabilization to enable sign-language input for VLA-based robotic manipulation tasks.
Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications
cs.CV 2026-06 unverdicted novelty 4.0

Hand pose estimation accuracy generalizes to hand-impaired populations from spinal cord injury with negligible effects from object occlusions.
EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations
cs.RO 2026-06 unverdicted novelty 4.0

EaDex combines single-camera RGB-D capture, MANO retargeting, and dynamic demonstration annealing to achieve 55.3% relative improvement over baseline on nine cross-embodiment dexterous object-opening tasks across three hands.
Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
cs.CV 2026-06 unverdicted novelty 4.0

Multi-stage training that first mixes real and inpainted synthetic hand images then fine-tunes on real data improves mAP on glove-wearing test images over real-only baselines.
Contextual Role Modulates Object Representational Geometry in the Human Brain
q-bio.NC 2026-05 unverdicted novelty 4.0

fMRI during naturalistic movies shows context (action target vs passive) remaps object representational geometry via double dissociation in affordance vs semantic dimensions within selective networks.
MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware
cs.CV 2026-05 unverdicted novelty 4.0

MobileEgo Anywhere releases a 200-hour long-form egocentric dataset with persistent state tracking plus the STERA open infrastructure and processing pipeline to convert commodity mobile captures into training-ready fo...
Vision-Based Hand Shadowing for Robotic Manipulation via Inverse Kinematics
cs.RO 2026-03 conditional novelty 4.0

An egocentric vision pipeline with MediaPipe hand tracking and damped-least-squares IK achieves 86.7% success on structured pick-and-place for the SO-ARM101 robot but falls to 9.3% in real-world environments with occlusions.
Empirical Evaluation of Multi-Modal Touch Detection in Over-the-Shoulder Video Surveillance
cs.CV 2026-06 accept novelty 3.0

An empirical evaluation of a multi-modal touch detector using MediaPipe, HSV skin filtering, motion differencing, and Canny edges finds low F1 scores on staged video and excessive false positives on real videos, concl...
Touchless Intraoperative Image Access System Based on Vision-Based Hand Tracking
cs.CV 2026-04 unverdicted novelty 3.0

A vision-based system maps real-time hand gestures from a single camera to image translation, rotation, and zoom commands for touchless intraoperative navigation.
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
cs.HC 2026-04 unverdicted novelty 3.0

Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.