iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
hub
Rtmpose: Real-time multi-person pose estimation based on mmpose
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
citing papers explorer
-
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
-
Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.