iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
hub
Rtmpose: Real-time multi-person pose estimation based on mmpose
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
BackTranslation2.0 is a linguistically motivated evaluation metric for sign language production that uses an agentic tool pipeline and LLM cross-referencing to score four dimensions and shows strong human correlation on a BSL dataset.
SIGNET uses attention-based aggregation of multiple pretrained sign language backbones and gated fusion to achieve state-of-the-art cross-language sign language translation on How2Sign, Phoenix14T, CSL-Daily, and MeineDGS.
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
A multi-frame network with SAM 3-derived mask priors achieves 72.4% F1 tip and 58.0% F1 anchor localization in surgical videos without manual mask annotations for training.
VDSB-GWSyn uses DSB conditioned on vessel masks and a shape prior to synthesize guidewires, yielding downstream localization gains when used for pre-training.
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
DanceHMR proposes a temporal whole-body mesh recovery method that fuses body context with hand observations for improved hand detail and stability in monocular videos.
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
SPARK applies keypoint detection with YOLO models to monocular images for low-latency 3D pose estimation of racing opponents, claiming better accuracy and speed than prior camera methods on real racing data.
YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across detection, segmentation, pose, and oriented detection.
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
citing papers explorer
-
iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
-
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
-
Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
-
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
-
BackTranslation2.0 -- A Linguistically Motivated Metric to Assess Sign Language Production
BackTranslation2.0 is a linguistically motivated evaluation metric for sign language production that uses an agentic tool pipeline and LLM cross-referencing to score four dimensions and shows strong human correlation on a BSL dataset.
-
SIGNET: Motion-Level Knowledge Transfer for Cross-Language Sign Language Translation
SIGNET uses attention-based aggregation of multiple pretrained sign language backbones and gated fusion to achieve state-of-the-art cross-language sign language translation on How2Sign, Phoenix14T, CSL-Daily, and MeineDGS.
-
Emotion Recognition in Sign Language Conversation
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
-
High-Fidelity Single-Image Head Modeling with Industry-Grade Topology
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
-
Dense Structural Priors for Sparse Functional Landmark Localization in Surgical Videos
A multi-frame network with SAM 3-derived mask priors achieves 72.4% F1 tip and 58.0% F1 anchor localization in surgical videos without manual mask annotations for training.
-
VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography
VDSB-GWSyn uses DSB conditioned on vessel masks and a shape prior to synthesize guidewires, yielding downstream localization gains when used for pre-training.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos
DanceHMR proposes a temporal whole-body mesh recovery method that fuses body context with hand observations for improved hand detail and stability in monocular videos.
-
Inferring World Belief States in Dynamic Real-World Environments
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
-
Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
-
SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints
SPARK applies keypoint detection with YOLO models to monocular images for low-latency 3D pose estimation of racing opponents, claiming better accuracy and speed than prior camera methods on real racing data.
-
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across detection, segmentation, pose, and oriented detection.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.