BlazePose: On-device Real-time Body Pose tracking
read the original abstract
We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
The paper presents AIGaitor, a privacy-preserving on-device monocular motion analysis system that performs end-to-end pose estimation and deep learning gait analysis on consumer smartphones.
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
-
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...
-
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation featuring four dimensions, challenging scenarios, and an adaptive hybrid evaluation framework that achieves 91.5% Spearman correlati...
-
BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton
BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.
-
DIPSER: A Dataset for In-Person Student Engagement Recognition in the Wild
DIPSER supplies multi-view RGB video and smartwatch data from natural in-person classes with attention and emotion labels from self-report plus four experts, including underrepresented ethnicities.
-
From Multimodal Signals to Adaptive XR Experiences for De-escalation Training
An early multimodal XR prototype fuses five signal streams with an interpretation layer to detect escalation cues and enable adaptive de-escalation training.
-
Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model
Facial emotion embeddings improve short-term pose forecasting accuracy for emotion-driven motions when fused via normalized gating in a lightweight LSTM world model, but not with simple multimodal fusion.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.