Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions
Pith reviewed 2026-05-22 06:27 UTC · model grok-4.3
The pith
Multiview motion features paired with orientation-aware text prompts improve zero-shot action recognition under viewpoint shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a novel orientation-aware action recognition approach that combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase, using a new orientation-aware motion encoding network to learn different motion features and adapting a specific orientation-aware text prompt to match the corresponding features at inference.
What carries the argument
The orientation-aware motion encoding network that learns distinct motion features from multiview data and is aligned at inference with adapted orientation-aware text prompts to reduce the domain gap.
If this is right
- The approach consistently improves zero-shot action recognition performance across NTU-RGB+D, BABEL, NW-UCLA, and two surveillance datasets.
- It outperforms recent state-of-the-art zero-shot approaches on these benchmarks.
- The learned representations yield competitive results on both cross-domain and same-domain recognition of actions observed during training.
Where Pith is reading between the lines
- This alignment strategy could lower the cost of adapting action recognition systems to new camera setups in deployed environments.
- The same orientation modeling might transfer to other tasks that pair visual motion with language descriptions under geometric variation.
- Further tests on datasets containing more extreme viewpoint differences would show whether the multiview training fully captures the necessary invariance.
Load-bearing premise
The method assumes that an orientation-aware motion encoding network trained on multiview data can be effectively aligned at inference with adapted text prompts to close the domain gap from body orientation and camera viewpoint variations without requiring strong additional annotations.
What would settle it
Performance gains would disappear if the method were tested on actions recorded from body orientations or camera viewpoints lying well outside the range of the multiview training data.
Figures
read the original abstract
Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: https://icb-vision-ai.github.io/OrientationAware-HAR
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an orientation-aware approach for zero-shot action recognition (ZSAR) to address domain shifts from variations in human body orientation and camera viewpoints. It combines multiview motion cues with textual action descriptions during training via a new orientation-aware motion encoding network, and adapts orientation-aware text prompts at inference to align features and close the domain gap without strong additional annotations. Extensive experiments on NTU-RGB+D, BABEL, NW-UCLA, and two surveillance datasets report consistent outperformance over recent SOTA ZSAR methods, with additional results showing strong transfer learning for seen actions in cross-domain and same-domain settings. Code and trained models are released.
Significance. If the reported gains hold under standard evaluation protocols, the work could meaningfully advance practical ZSAR deployment in real-world settings with viewpoint and orientation variability, such as surveillance or robotics, by avoiding heavy annotation requirements. The explicit use of multiview training data paired with prompt adaptation targets a recognized limitation in prior ZSAR literature. Releasing code and models is a clear strength that supports reproducibility and follow-on work.
major comments (1)
- [§5] §5 (Experiments): Although ablations on multiview fusion are reported, the central performance claim would be strengthened by including error bars from multiple random seeds or statistical significance tests for the headline improvements over baselines on NTU-RGB+D and BABEL; without them the magnitude of gains is harder to interpret as robust rather than run-dependent.
minor comments (3)
- [Abstract] Abstract: The two surveillance datasets are referenced but not named; adding their identities (e.g., in parentheses) would improve immediate clarity for readers.
- [§3.2] §3.2 (Method): The exact mechanism for adapting the orientation-aware text prompt at inference (e.g., whether it is a learned mapping or a fixed template) could be stated more explicitly to allow precise replication.
- [Figure 3] Figure 3: The caption and legend could more clearly distinguish the multiview fusion variants from the single-view baselines to avoid reader confusion when comparing curves.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation, recognition of the work's practical relevance for real-world ZSAR, and recommendation for minor revision. We appreciate the constructive suggestion regarding experimental robustness and address it below.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): Although ablations on multiview fusion are reported, the central performance claim would be strengthened by including error bars from multiple random seeds or statistical significance tests for the headline improvements over baselines on NTU-RGB+D and BABEL; without them the magnitude of gains is harder to interpret as robust rather than run-dependent.
Authors: We agree that reporting variability across runs would make the headline gains easier to interpret. In the revised version we will add error bars (mean ± std) computed over five independent random seeds for all reported results on NTU-RGB+D and BABEL, and we will include paired t-test p-values comparing our method against each baseline to establish statistical significance of the observed improvements. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
This paper presents an empirical ZSAR method combining multiview motion encoding with adapted text prompts. Its central claims of improved cross-domain performance are supported by reported experiments on standard splits of NTU-RGB+D, BABEL, NW-UCLA and surveillance datasets, plus ablations on multiview fusion and transfer results. No derivation chain, equations, or first-principles results are present that reduce by construction to fitted inputs or self-citations; the orientation-aware components are trained and evaluated against independent benchmarks without tautological re-use of the target metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiview motion data combined with text descriptions can be aligned via orientation awareness to reduce domain gaps in action recognition.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
orientation-aware motion encoding network … positional encoding … γ(θ) = (sin(2^0 πθ), cos(2^0 πθ), …) … L=192 … dual-branch attention … contrastive loss Lsym … λLce + (1−λ)Lsym
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
twelve virtual views … body orientations from −180° to 150° … uniform step of 30° … 8-tick period never mentioned
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learning local feature descriptors with triplets and shallow convolutional neural networks. In: BMVC (2016)
work page 2016
-
[2]
arXiv preprint arXiv:1910.12029 (2019)
Chang, J.Y., Moon, G., Lee, K.M.: Poselifter: Absolute 3d human pose lifting network from a single noisy 2d human pose. arXiv preprint arXiv:1910.12029 (2019)
-
[3]
Chen, Y., Guo, J., Guo, S., Tao, D.: Neuron: Learning context-aware evolving representations for zero-shot skeleton action recognition. In: CVPR (2025)
work page 2025
-
[4]
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: Scale- aware representation learning for bottom-up human pose estimation. In: CVPR (2020)
work page 2020
-
[5]
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: CVPR (2022)
work page 2022
-
[6]
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. NeurIPS (2013)
work page 2013
-
[7]
Hou, R., Li, Y., Zhang, N., Zhou, Y., Yang, X., Wang, Z.: Shifting perspective to see difference: A novel multi-view method for skeleton based action recognition. In: ACM MM (2022)
work page 2022
-
[8]
Hubert Tsai, Y.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual- semantic embeddings. In: ICCV (2017)
work page 2017
-
[9]
Li, S.W., Zi-Xiang Wei, W.J.C., Yi-Hsin Yu, C.Y.Y., jen Hsu, J.Y.: Sa-dvae: Im- proving zero-shot skeleton-based action recognition by disentangled variational au- toencoders. In: ECCV (2024)
work page 2024
-
[10]
Li, W., Wong, Y., Liu, A.A., Li, Y., Su, Y.T., Kankanhalli, M.: Multi-camera action dataset (mcad). In: W ACV (2017)
work page 2017
-
[11]
Liu, H., Liu, Y., Ren, M., Wang, H., Wang, Y., Sun, Z.: Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. In: CVPR (2025)
work page 2025
-
[12]
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. TPAMI (2020)
work page 2020
-
[13]
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM ToG. (2015)
work page 2015
-
[14]
Commu- nications of the ACM (2021)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM (2021)
work page 2021
-
[15]
Petrovich, M., Black, M.J., Varol, G.: Tmr: Text-to-motion retrieval using con- trastive 3d human motion synthesis. In: ICCV (2023)
work page 2023
-
[16]
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with english labels. In: CVPR (2021)
work page 2021
-
[17]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[18]
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: CVPR (2019)
work page 2019
-
[19]
Shah, K., Shah, A.B., Lau, C.P., de Melo, C., Chellapp, R.: Multi-view action recognition using contrastive learning. W ACV (2023)
work page 2023
- [20]
-
[21]
Shahroudy, A., Liu, J., Ng, T., Wang, G.: Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: CVPR (2016)
work page 2016
-
[22]
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019)
work page 2019
-
[23]
Siddiqui, N., Tirupattur, P., Shah, M.: Dvanet: Disentangling view and action features for multi-view action recognition. AAAI (2024)
work page 2024
-
[24]
Soroush Mehraban, Mohammad Javad Rajabi, B.T.: Stars: Self-supervised tuning for 3d action recognition in skeleton sequences. arXiv (2024)
work page 2024
-
[25]
Sun, J.J., Zhao, J., Chen, L.C., Schroff, F., Adam, H., Liu, T.: View-invariant probabilistic embedding for human pose. In: ECCV (2020)
work page 2020
-
[26]
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: Ex- posing human motion generation to clip space. In: ECCV (2022)
work page 2022
-
[27]
Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.C.: Cross-view action modeling, learning and recognition. In: CVPR (2014)
work page 2014
-
[28]
Wang, L., Koniusz, P.: Temporal-viewpoint transportation plan for skeletal few- shot action recognition. In: ACCV (2022)
work page 2022
-
[29]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS (2022)
work page 2022
-
[30]
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
work page 2019
-
[31]
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
work page 2018
-
[32]
Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Bremond, F.: Unik: A unified framework for real-world skeleton-based action recognition. BMVC (2021)
work page 2021
-
[33]
Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Bremond, F.: View-invariant skeleton action representation learning via motion retargeting. IJCV (2024)
work page 2024
-
[34]
Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Brémond, F.: Self-supervised video pose representation learning for occlusion- robust action recognition. In: FG (2021)
work page 2021
-
[35]
Yu, Q., Tanaka, M., Fujiwara, K.: Exploring vision transformers for 3d human motion-language models with motion patches. In: CVPR (2024)
work page 2024
-
[36]
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. TPAMI (2019)
work page 2019
-
[37]
Zhao, L., Wang, Y., Zhao, J., Yuan, L., Sun, J.J., Schroff, F., Adam, H., Peng, X., Metaxas, D., Liu, T.: Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In: CVPR (2021)
work page 2021
-
[38]
Zhao, R., Li, M., Yang, Z., Lin, B., Zhong, X., Ren, X., Cai, D., Wu, B.: Towards fine-grained hboe with rendered orientation set and laplace smoothing. In: AAAI (2024)
work page 2024
-
[39]
Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: CVPR (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.