iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
hub
Rtmpose: Real-time multi-person pose estimation based on mmpose
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
citing papers explorer
-
iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
-
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
-
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
-
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
-
Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
-
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
-
Emotion Recognition in Sign Language Conversation
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
-
High-Fidelity Single-Image Head Modeling with Industry-Grade Topology
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
Inferring World Belief States in Dynamic Real-World Environments
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
-
Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
- AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
- DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos