CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.
Modeling temporal dynamics and spatial configurations of actions using two- stream recurrent neural networks
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
A Transformer model with unified skeleton representation, two-stream motion encoder, and multi-grained motion-text contrastive alignment achieves effective recognition on a new integrated heterogeneous open-vocabulary skeleton dataset.
citing papers explorer
-
Marrying Text-to-Motion Generation with Skeleton-Based Action Recognition
CoAMD unifies skeleton-based action recognition and text-to-motion generation through autoregressive diffusion guided by a multi-modal recognizer, reporting SOTA results on 13 benchmarks for four tasks.
-
Towards Universal Skeleton-Based Action Recognition
A Transformer model with unified skeleton representation, two-stream motion encoder, and multi-grained motion-text contrastive alignment achieves effective recognition on a new integrated heterogeneous open-vocabulary skeleton dataset.