hub

Rtmpose: Real-time multi-person pose estimation based on mmpose

· 2023 · arXiv 2303.07399

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

cs.CV · 2026-05-16 · unverdicted · novelty 8.0

iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.

Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

cs.CV · 2026-03-30 · unverdicted · novelty 7.0

LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.

BackTranslation2.0 -- A Linguistically Motivated Metric to Assess Sign Language Production

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

BackTranslation2.0 is a linguistically motivated evaluation metric for sign language production that uses an agentic tool pipeline and LLM cross-referencing to score four dimensions and shows strong human correlation on a BSL dataset.

SIGNET: Motion-Level Knowledge Transfer for Cross-Language Sign Language Translation

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

SIGNET uses attention-based aggregation of multiple pretrained sign language backbones and gated fusion to achieve state-of-the-art cross-language sign language translation on How2Sign, Phoenix14T, CSL-Daily, and MeineDGS.

Emotion Recognition in Sign Language Conversation

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.

High-Fidelity Single-Image Head Modeling with Industry-Grade Topology

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

cs.CV · 2025-09-24 · unverdicted · novelty 6.0

EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.

Dense Structural Priors for Sparse Functional Landmark Localization in Surgical Videos

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

A multi-frame network with SAM 3-derived mask priors achieves 72.4% F1 tip and 58.0% F1 anchor localization in surgical videos without manual mask annotations for training.

VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

VDSB-GWSyn uses DSB conditioned on vessel masks and a shape prior to synthesize guidewires, yielding downstream localization gains when used for pre-training.

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

cs.CV · 2026-05-21 · accept · novelty 5.0

The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.

DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos

cs.CV · 2026-05-18 · unverdicted · novelty 5.0 · 2 refs

DanceHMR proposes a temporal whole-body mesh recovery method that fuses body context with hand observations for improved hand detail and stability in monocular videos.

Inferring World Belief States in Dynamic Real-World Environments

cs.RO · 2026-04-13 · unverdicted · novelty 5.0

A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

cs.CV · 2025-09-12 · unverdicted · novelty 5.0

Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.

SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints

cs.RO · 2026-06-16 · unverdicted · novelty 4.0

SPARK applies keypoint detection with YOLO models to monocular images for low-latency 3D pose estimation of racing opponents, claiming better accuracy and speed than prior camera methods on real racing data.

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

cs.CV · 2026-06-02 · unverdicted · novelty 4.0

YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across detection, segmentation, pose, and oriented detection.

World Simulation with Video Foundation Models for Physical AI

cs.CV · 2025-10-28 · unverdicted · novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

citing papers explorer

Showing 21 of 21 citing papers.

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning cs.CV · 2026-05-16 · unverdicted · none · ref 86
iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotion understanding.
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing cs.CV · 2026-05-20 · unverdicted · none · ref 41 · 2 links
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition cs.CV · 2026-05-12 · unverdicted · none · ref 11
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction cs.RO · 2026-04-30 · unverdicted · none · ref 17
A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition cs.CV · 2026-04-14 · unverdicted · none · ref 31
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement cs.CV · 2026-04-10 · unverdicted · none · ref 72
The paper presents a multimodal framework, dataset, and reconstruction pipeline to create immersive volumetric videos supporting large 6-DoF audiovisual interaction from real multi-view captures.
LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition cs.CV · 2026-03-30 · unverdicted · none · ref 29
LA-Sign achieves state-of-the-art skeleton-based sign language recognition on WLASL and MSASL by using recurrent looped transformers with adaptive hyperbolic geometry alignment.
BackTranslation2.0 -- A Linguistically Motivated Metric to Assess Sign Language Production cs.CV · 2026-06-27 · unverdicted · none · ref 19
BackTranslation2.0 is a linguistically motivated evaluation metric for sign language production that uses an agentic tool pipeline and LLM cross-referencing to score four dimensions and shows strong human correlation on a BSL dataset.
SIGNET: Motion-Level Knowledge Transfer for Cross-Language Sign Language Translation cs.CV · 2026-06-26 · unverdicted · none · ref 26
SIGNET uses attention-based aggregation of multiple pretrained sign language backbones and gated fusion to achieve state-of-the-art cross-language sign language translation on How2Sign, Phoenix14T, CSL-Daily, and MeineDGS.
Emotion Recognition in Sign Language Conversation cs.CL · 2026-05-22 · unverdicted · none · ref 31
Introduces the eJSL Dialog dataset (1,920 videos in 480 dialogues from STUDIES corpus) for conversational sign language emotion recognition and benchmarks models revealing a domain gap with generic multimodal approaches.
High-Fidelity Single-Image Head Modeling with Industry-Grade Topology cs.CV · 2026-05-06 · unverdicted · none · ref 182
A single-image head reconstruction method uses coarse-to-fine optimization with normal consistency, landmarks, and geometry-aware constraints on curvature and conformality to produce meshes with industry-grade topology and preserved facial identity.
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning cs.CV · 2025-09-24 · unverdicted · none · ref 10
EditVerse unifies image and video editing and generation in one transformer model via unified token sequences and in-context learning, trained jointly on curated video editing data plus image/video corpora and evaluated on a new instruction-based benchmark.
Dense Structural Priors for Sparse Functional Landmark Localization in Surgical Videos cs.CV · 2026-06-30 · unverdicted · none · ref 7
A multi-frame network with SAM 3-derived mask priors achieves 72.4% F1 tip and 58.0% F1 anchor localization in surgical videos without manual mask annotations for training.
VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography cs.CV · 2026-05-27 · unverdicted · none · ref 9
VDSB-GWSyn uses DSB conditioned on vessel masks and a shape prior to synthesize guidewires, yielding downstream localization gains when used for pre-training.
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025 cs.CV · 2026-05-21 · accept · none · ref 35
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos cs.CV · 2026-05-18 · unverdicted · none · ref 9 · 2 links
DanceHMR proposes a temporal whole-body mesh recovery method that fuses body context with hand observations for improved hand detail and stability in monocular videos.
Inferring World Belief States in Dynamic Real-World Environments cs.RO · 2026-04-13 · unverdicted · none · ref 9
A robot infers human world belief states from observations in dynamic 3D household environments to enable fluent human-robot teamwork.
Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation cs.CV · 2025-09-12 · unverdicted · none · ref 16
Extends online 2D multi-camera tracking to 3D via depth-based point cloud reconstruction, clustering for 3D boxes, and local ID consistency for global data association, placing 3rd on 2025 AI City Challenge 3D MTMC dataset.
SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints cs.RO · 2026-06-16 · unverdicted · none · ref 36
SPARK applies keypoint detection with YOLO models to monocular images for low-latency 3D pose estimation of racing opponents, claiming better accuracy and speed than prior camera methods on real racing data.
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models cs.CV · 2026-06-02 · unverdicted · none · ref 28
YOLO26 presents a unified real-time vision model family with dual-head end-to-end design, new training components, and task-specific heads that reports improved mAP-latency tradeoffs on COCO and LVIS benchmarks across detection, segmentation, pose, and oriented detection.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 37
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Rtmpose: Real-time multi-person pose estimation based on mmpose

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer