TOAU compresses human motion videos to 9 bits per frame with pose estimation and VQ-VAE, then aligns the tokens to a vision-language model via a lightweight projector, achieving 1% transmission payload and 20% latency of video codecs while maintaining comparable action understanding accuracy.
Motion-X: A large-scale 3D expressive whole-body human motion dataset
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
dataset 1
citation-polarity summary
fields
eess.SP 1years
2026 1verdicts
UNVERDICTED 1roles
dataset 1polarities
use dataset 1representative citing papers
citing papers explorer
-
Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
TOAU compresses human motion videos to 9 bits per frame with pose estimation and VQ-VAE, then aligns the tokens to a vision-language model via a lightweight projector, achieving 1% transmission payload and 20% latency of video codecs while maintaining comparable action understanding accuracy.