PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding

Chunhui Liu , Yueyu Hu , Yanghao Li , Sijie Song , Jiaying Liu

Authors on Pith no claims yet

classification 💻 cs.CV

keywords actionhumandatasetpku-mmdbenchmarkbenchmarkscontainscontinuous

read the original abstract

Despite the fact that many 3D human activity benchmarks being proposed, most existing action datasets focus on the action recognition tasks for the segmented videos. There is a lack of standard large-scale benchmarks, especially for current popular data-hungry deep learning based methods. In this paper, we introduce a new large scale benchmark (PKU-MMD) for continuous multi-modality 3D human action understanding and cover a wide range of complex human activities with well annotated information. PKU-MMD contains 1076 long video sequences in 51 action categories, performed by 66 subjects in three camera views. It contains almost 20,000 action instances and 5.4 million frames in total. Our dataset also provides multi-modality data sources, including RGB, depth, Infrared Radiation and Skeleton. With different modalities, we conduct extensive experiments on our dataset in terms of two scenarios and evaluate different methods by various metrics, including a new proposed evaluation protocol 2D-AP. We believe this large-scale dataset will benefit future researches on action detection for the community.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition
cs.CV 2026-05 unverdicted novelty 7.0

PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptatio...
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition
cs.HC 2026-05 unverdicted novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Universal Skeleton Understanding via Differentiable Rendering and MLLMs
cs.CV 2026-03 unverdicted novelty 7.0

SkeletonLLM translates arbitrary skeleton sequences into visual image sequences via a differentiable renderer DrAction, allowing MLLMs to perform open-vocabulary action recognition, captioning, and QA across heterogen...
Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
cs.CV 2026-04 unverdicted novelty 6.0

FDSM recovers fine-grained motion details in zero-shot skeleton action recognition by integrating semantic-guided spectral residual, timestep-adaptive spectral loss, and curriculum-based semantic abstraction, reaching...