COSM is a cooperative scheduling framework for concurrent PIM and CPU execution on mobile devices that hides PIM latency and overlaps execution with data transfer, achieving up to 2.8x PIM throughput with less than 2% CPU performance loss.
hub Mixed citations
MediaPipe: A Framework for Building Perception Pipelines
Mixed citation behavior. Most common role is background (60%).
abstract
Building applications that perceive the world around them is challenging. A developer needs to (a) select and develop corresponding machine learning algorithms and models, (b) build a series of prototypes and demos, (c) balance resource consumption against the quality of the solutions, and finally (d) identify and mitigate problematic cases. The MediaPipe framework addresses all of these challenges. A developer can use MediaPipe to build prototypes by combining existing perception components, to advance them to polished cross-platform applications and measure system performance and resource consumption on target platforms. We show that these features enable a developer to focus on the algorithm or model development and use MediaPipe as an environment for iteratively improving their application with results reproducible across different devices and platforms. MediaPipe will be open-sourced at https://github.com/google/mediapipe.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
CHOIR reconstructs articulated hand motion, object shape with 6D pose, and contact from monocular videos via coarse initialization, generative spatial rectification, and contact-aware joint optimization.
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Tamaththul3D releases SMPL-X annotations for the Ishara-500 dataset and a decoupled reconstruction pipeline using geometric inverse kinematics that cuts hand error by up to 32% and runs 32x faster while generalizing across sign languages.
D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
DeepSpeak provides over 100 hours of consented, identity-matched real and modern deepfake audiovisual content focused on talking heads, with evaluations showing existing detectors fail to generalize without retraining.
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
A cascaded LoRA diffusion pipeline in UV space with cross-intrinsic attention and differentiable BRDF shading produces 4K PBR avatar assets from single images after training on under 100 scans.
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
Bengal-HP_RU is the first publicly available head pose dataset for Bengali subjects, with 12,894 images collected from Wikimedia Commons and partitioned by uploader identity.
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
CogPortrait uses MLLM-based hierarchical planning to convert high-level labels into eye keypoints and a conditioned DiT model to produce portrait animations with improved eye-region accuracy on the new EMH benchmark.
PaintCopilot models painting as an open-ended autoregressive process that predicts coherent brushstrokes from partial canvas observations using a ViT target predictor, flow-matching stroke generator, and VAE region sampler.
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.
A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.
A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.
citing papers explorer
-
COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices
COSM is a cooperative scheduling framework for concurrent PIM and CPU execution on mobile devices that hides PIM latency and overlaps execution with data transfer, achieving up to 2.8x PIM throughput with less than 2% CPU performance loss.
-
Recognizing Co-Speech Gestures in-the-Wild
Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
-
CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction
CHOIR reconstructs articulated hand motion, object shape with 6D pose, and contact from monocular videos via coarse initialization, generative spatial rectification, and contact-aware joint optimization.
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
-
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
-
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
-
Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video
Tamaththul3D releases SMPL-X annotations for the Ishara-500 dataset and a decoupled reconstruction pipeline using geometric inverse kinematics that cuts hand error by up to 32% and runs 32x faster while generalizing across sign languages.
-
D-Rex : Diffusion Rendering for Relightable Expressive Avatars
D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
-
Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography
A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.
-
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
-
The DeepSpeak Dataset
DeepSpeak provides over 100 hours of consented, identity-matched real and modern deepfake audiovisual content focused on talking heads, with evaluations showing existing detectors fail to generalize without retraining.
-
MirrorPPR: Exemplar-Based Portrait Photo Retouching
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
-
Monocular Avatar Reconstruction via Cascaded Diffusion Priors and UV-Space Differentiable Shading
A cascaded LoRA diffusion pipeline in UV space with cross-intrinsic attention and differentiable BRDF shading produces 4K PBR avatar assets from single images after training on under 100 scans.
-
EMOSH: Expressive Motion and Shape Disentanglement for Human Animation
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
-
Bengal-HP_RU: A Dataset of Bengal People For Head Pose Estimation
Bengal-HP_RU is the first publicly available head pose dataset for Bengali subjects, with 12,894 images collected from Wikimedia Commons and partitioned by uploader identity.
-
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
-
CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning
CogPortrait uses MLLM-based hierarchical planning to convert high-level labels into eye keypoints and a conditioned DiT model to produce portrait animations with improved eye-region accuracy on the new EMH benchmark.
-
PaintCopilot: Modeling Painting as Autonomous Artistic Continuation
PaintCopilot models painting as an open-ended autoregressive process that predicts coherent brushstrokes from partial canvas observations using a ViT target predictor, flow-matching stroke generator, and VAE region sampler.
-
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
-
FaceValue: Exploring Real-Time Self-View Overlays to Prompt Meaning-Oriented Self-Awareness in Remote Meetings
A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.
-
FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing
A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
-
AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection
AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.
-
Bootstrapping Sign Language Annotations with Sign Language Models
A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.
-
A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator
A replay pipeline on a 3D eye simulator generates 144 sessions of synthetic eye movement video that preserves source temporal dynamics for script-reading detection.
-
Facial-Expression-Aware Prompting for Empathetic LLM Tutoring
Facial expression signals via prompt integration improve empathetic responsiveness in LLM-based tutoring systems.
-
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
-
Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
DenseMarks learns a canonical 3D embedding space for human head images by training a Vision Transformer with contrastive loss on pairwise point tracks from in-the-wild videos, plus landmark and segmentation supervision.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
MindFlow: Harmonizing Cognitive Semantics and Acoustic Dynamics for Facial Animation Generation in Dyadic Conversations
MindFlow presents a neuroscience-inspired dual-stream generative model that uses chunk-state emotional modeling and conditional flow matching to produce facial animations with improved semantic fit and motion realism in dyadic conversations.
-
SAC$^2$-Net: Semantic Anchoring and Complementary-Consensus Fusion for Multimodal Micro-Expression Recognition
SAC²-Net uses semantic anchoring soft alignment and complementary-consensus fusion to report SOTA or competitive results on five MER benchmarks.
-
TRAFA: Anticipating User Actions to Reduce Errors in Procedural Tasks with Predictive Feedback
TRAFA introduces a Track-Forecast-Act pipeline for predictive feedback in procedural tasks and reports improved accuracy and efficiency over reactive systems in a user study.
-
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
-
Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition
A subject-invariant cross-modal prompt-tuning method with decoupled shared-specific adapters fuses facial and rPPG features in a frozen ViT to improve video-based emotion recognition accuracy and cross-subject generalization on MAHNOB-HCI and DEAP.
-
Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior
Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.
-
UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks
UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.
-
Sentiment Analysis of German Sign Language Fairy Tales
A new dataset and XGBoost model predict sentiment in German Sign Language fairy tale videos from motion features at 0.631 balanced accuracy, showing body movements contribute equally to facial ones.
-
HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.
-
On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition
Extensor-side monopolar electrodes outperform flexor-side and bipolar setups for wrist sEMG thumb gesture recognition, with performance rising but leveling off as channel count increases.
-
A Comprehensive Survey of Action Quality Assessment: Method and Benchmark
This survey proposes a modality-driven hierarchical taxonomy for AQA methods, establishes a unified benchmark for video-based approaches across datasets, and outlines research trends and challenges.
-
Evaluating Idle Animation Believability: a User Perspective
Users cannot distinguish genuine from acted idle animations but perceive handmade and recorded ones differently; ReActIdle dataset released to simplify future recording.
-
Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction
A robot detects initiation of interaction via audio-visual fusion of speech localization and face/gaze cues, implemented as a state machine in ROS and tested on a mobile platform.
-
Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model
Facial emotion embeddings improve short-term pose forecasting accuracy for emotion-driven motions when fused via normalized gating in a lightweight LSTM world model, but not with simple multimodal fusion.
-
Real-Time Cellist Postural Evaluation With On-Device Computer Vision
Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.
-
AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering
A modular XR platform integrates Whisper, NLLB, AWS Polly, RoBERTa, flan-t5, and MediaPipe to deliver real-time multilingual and International Sign support for education, with benchmarks showing AWS Polly's low latency and EuroLLM's higher BLEU score.
- Face Anything: 4D Face Reconstruction from Any Image Sequence
- BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling