Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
hub Mixed citations
MediaPipe: A Framework for Building Perception Pipelines
Mixed citation behavior. Most common role is background (60%).
abstract
Building applications that perceive the world around them is challenging. A developer needs to (a) select and develop corresponding machine learning algorithms and models, (b) build a series of prototypes and demos, (c) balance resource consumption against the quality of the solutions, and finally (d) identify and mitigate problematic cases. The MediaPipe framework addresses all of these challenges. A developer can use MediaPipe to build prototypes by combining existing perception components, to advance them to polished cross-platform applications and measure system performance and resource consumption on target platforms. We show that these features enable a developer to focus on the algorithm or model development and use MediaPipe as an environment for iteratively improving their application with results reproducible across different devices and platforms. MediaPipe will be open-sourced at https://github.com/google/mediapipe.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CHOIR reconstructs articulated hand motion, object shape with 6D pose, and contact from monocular videos via coarse initialization, generative spatial rectification, and contact-aware joint optimization.
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Tamaththul3D releases SMPL-X annotations for the Ishara-500 dataset and a decoupled reconstruction pipeline using geometric inverse kinematics that cuts hand error by up to 32% and runs 32x faster while generalizing across sign languages.
D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
DeepSpeak provides over 100 hours of consented, identity-matched real and modern deepfake audiovisual content focused on talking heads, with evaluations showing existing detectors fail to generalize without retraining.
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
A cascaded LoRA diffusion pipeline in UV space with cross-intrinsic attention and differentiable BRDF shading produces 4K PBR avatar assets from single images after training on under 100 scans.
EMOSH proposes an Expressive Human Model with disentangled parameters, coarse-to-fine motion injection, and spatially-aligned conditioning to generate high-fidelity expressive human videos without driving-subject shape leakage.
Bengal-HP_RU is the first publicly available head pose dataset for Bengali subjects, with 12,894 images collected from Wikimedia Commons and partitioned by uploader identity.
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
CogPortrait uses MLLM-based hierarchical planning to convert high-level labels into eye keypoints and a conditioned DiT model to produce portrait animations with improved eye-region accuracy on the new EMH benchmark.
PaintCopilot models painting as an open-ended autoregressive process that predicts coherent brushstrokes from partial canvas observations using a ViT target predictor, flow-matching stroke generator, and VAE region sampler.
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.
A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.
A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.
A replay pipeline on a 3D eye simulator generates 144 sessions of synthetic eye movement video that preserves source temporal dynamics for script-reading detection.
citing papers explorer
-
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.