hub

MediaPipe: A Framework for Building Perception Pipelines

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays · 2019 · cs.DC · arXiv 1906.08172

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

open full Pith review browse 28 citing papers arXiv PDF

abstract

Building applications that perceive the world around them is challenging. A developer needs to (a) select and develop corresponding machine learning algorithms and models, (b) build a series of prototypes and demos, (c) balance resource consumption against the quality of the solutions, and finally (d) identify and mitigate problematic cases. The MediaPipe framework addresses all of these challenges. A developer can use MediaPipe to build prototypes by combining existing perception components, to advance them to polished cross-platform applications and measure system performance and resource consumption on target platforms. We show that these features enable a developer to focus on the algorithm or model development and use MediaPipe as an environment for iteratively improving their application with results reproducible across different devices and platforms. MediaPipe will be open-sourced at https://github.com/google/mediapipe.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.

SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

First 3D SMPL-X annotations for the Ishara-500 Saudi Sign Language dataset plus a specialized monocular reconstruction pipeline claiming up to 32% hand accuracy gains.

D-Rex : Diffusion Rendering for Relightable Expressive Avatars

cs.GR · 2026-04-30 · conditional · novelty 7.0

D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.

Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.

Face Anything: 4D Face Reconstruction from Any Image Sequence

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.

AvatarPointillist: AutoRegressive 4D Gaussian Avatarization

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.

FaceValue: Exploring Real-Time Self-View Overlays to Prompt Meaning-Oriented Self-Awareness in Remote Meetings

cs.HC · 2026-04-30 · unverdicted · novelty 6.0

A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.

FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

BioLip detects lip-sync deepfakes via temporal lip jitter, a measurable elevation in lip position variance caused by generative models violating biomechanical articulation constraints.

AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.

Bootstrapping Sign Language Annotations with Sign Language Models

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.

A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

A replay pipeline on a 3D eye simulator generates 144 sessions of synthetic eye movement video that preserves source temporal dynamics for script-reading detection.

Facial-Expression-Aware Prompting for Empathetic LLM Tutoring

cs.HC · 2026-03-10 · unverdicted · novelty 6.0

Facial expression signals via prompt integration improve empathetic responsiveness in LLM-based tutoring systems.

Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

cs.CV · 2026-01-11 · conditional · novelty 6.0

VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

cs.RO · 2024-10-08 · unverdicted · novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

A subject-invariant cross-modal prompt-tuning method with decoupled shared-specific adapters fuses facial and rPPG features in a frozen ViT to improve video-based emotion recognition accuracy and cross-subject generalization on MAHNOB-HCI and DEAP.

Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior

cs.RO · 2026-05-05 · unverdicted · novelty 5.0 · 2 refs

Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.

UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks

cs.CR · 2026-04-25 · unverdicted · novelty 5.0

UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.

Sentiment Analysis of German Sign Language Fairy Tales

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

A new dataset and XGBoost model predict sentiment in German Sign Language fairy tale videos from motion features at 0.631 balanced accuracy, showing body movements contribute equally to facial ones.

HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.

On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition

cs.HC · 2026-04-06 · unverdicted · novelty 5.0

Extensor-side monopolar electrodes outperform flexor-side and bipolar setups for wrist sEMG thumb gesture recognition, with performance rising but leveling off as channel count increases.

citing papers explorer

Showing 28 of 28 citing papers.

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras cs.CV · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 57 · internal anchor
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video cs.CV · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
First 3D SMPL-X annotations for the Ishara-500 Saudi Sign Language dataset plus a specialized monocular reconstruction pipeline claiming up to 32% hand accuracy gains.
D-Rex : Diffusion Rendering for Relightable Expressive Avatars cs.GR · 2026-04-30 · conditional · none · ref 28 · internal anchor
D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography cs.CV · 2026-04-26 · unverdicted · none · ref 24 · internal anchor
A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.
Face Anything: 4D Face Reconstruction from Any Image Sequence cs.CV · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization cs.CV · 2026-04-06 · unverdicted · none · ref 43 · internal anchor
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages cs.CV · 2026-05-03 · unverdicted · none · ref 9 · internal anchor
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
FaceValue: Exploring Real-Time Self-View Overlays to Prompt Meaning-Oriented Self-Awareness in Remote Meetings cs.HC · 2026-04-30 · unverdicted · none · ref 68 · internal anchor
A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.
FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing cs.CV · 2026-04-29 · unverdicted · none · ref 20 · internal anchor
A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection cs.CV · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
BioLip detects lip-sync deepfakes via temporal lip jitter, a measurable elevation in lip position variance caused by generative models violating biomechanical articulation constraints.
AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection cs.CV · 2026-04-17 · unverdicted · none · ref 21 · internal anchor
AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.
Bootstrapping Sign Language Annotations with Sign Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 28 · internal anchor
A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.
A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator cs.CV · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
A replay pipeline on a 3D eye simulator generates 144 sessions of synthetic eye movement video that preserves source temporal dynamics for script-reading detection.
Facial-Expression-Aware Prompting for Empathetic LLM Tutoring cs.HC · 2026-03-10 · unverdicted · none · ref 14 · internal anchor
Facial expression signals via prompt integration improve empathetic responsiveness in LLM-based tutoring systems.
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos cs.CV · 2026-01-11 · conditional · none · ref 9 · internal anchor
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation cs.RO · 2024-10-08 · unverdicted · none · ref 13 · internal anchor
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition cs.CV · 2026-05-07 · unverdicted · none · ref 49 · internal anchor
A subject-invariant cross-modal prompt-tuning method with decoupled shared-specific adapters fuses facial and rPPG features in a frozen ViT to improve video-based emotion recognition accuracy and cross-subject generalization on MAHNOB-HCI and DEAP.
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior cs.RO · 2026-05-05 · unverdicted · none · ref 23 · 2 links · internal anchor
Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks cs.CR · 2026-04-25 · unverdicted · none · ref 25 · internal anchor
UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.
Sentiment Analysis of German Sign Language Fairy Tales cs.CL · 2026-04-17 · unverdicted · none · ref 12 · internal anchor
A new dataset and XGBoost model predict sentiment in German Sign Language fairy tale videos from motion features at 0.631 balanced accuracy, showing body movements contribute equally to facial ones.
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment cs.CV · 2026-04-09 · unverdicted · none · ref 27 · internal anchor
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.
On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition cs.HC · 2026-04-06 · unverdicted · none · ref 21 · internal anchor
Extensor-side monopolar electrodes outperform flexor-side and bipolar setups for wrist sEMG thumb gesture recognition, with performance rising but leveling off as channel count increases.
Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction cs.CV · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
A robot detects initiation of interaction via audio-visual fusion of speech localization and face/gaze cues, implemented as a state machine in ROS and tested on a mobile platform.
Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model cs.CV · 2026-04-26 · unverdicted · none · ref 20 · internal anchor
Facial emotion embeddings improve short-term pose forecasting accuracy for emotion-driven motions when fused via normalized gating in a lightweight LSTM world model, but not with simple multimodal fusion.
Real-Time Cellist Postural Evaluation With On-Device Computer Vision cs.HC · 2026-04-19 · unverdicted · none · ref 10 · internal anchor
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering cs.CE · 2026-04-07 · unverdicted · none · ref 35 · internal anchor
A modular XR platform integrates Whisper, NLLB, AWS Polly, RoBERTa, flan-t5, and MediaPipe to deliver real-time multilingual and International Sign support for education, with benchmarks showing AWS Polly's low latency and EuroLLM's higher BLEU score.

MediaPipe: A Framework for Building Perception Pipelines

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer