Recognition: unknown
SASI: Leveraging Sub-Action Semantics for Robust Early Action Recognition in Human-Robot Interaction
Pith reviewed 2026-05-07 10:24 UTC · model grok-4.3
The pith
SASI fuses sub-action semantics with skeleton graph features to raise early action recognition accuracy on partial sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SASI integrates graph convolution networks to fuse spatiotemporal features with sub-action semantics captured by a segmentation model on skeleton input. It retains both the detailed cues from sub-action units and the broader spatial context while processing at 29 Hz. On the BABEL dataset, the approach yields higher recognition accuracy than conventional methods and shows stronger results on partial action sequences, demonstrating its utility for early recognition needed in proactive human-robot interaction.
What carries the argument
SASI cross-modal fusion that pairs sub-action segmentation semantics with skeleton-based graph convolution networks
Load-bearing premise
Sub-action segmentation supplies reliable semantic cues that improve fusion with spatiotemporal features without adding noise or needing extra tuning to produce the observed gains.
What would settle it
An ablation test on BABEL showing that SASI without the sub-action segmentation module matches or exceeds full SASI accuracy on partial sequences would falsify the contribution of the semantic integration.
Figures
read the original abstract
Understanding human actions is critical for advancing behavior analysis in human-robot interaction. Particularly in tasks that demand quick and proactive feedback, robots must recognize human actions as early as possible from incomplete observations. \textit{Sub-actions} offer the semantic and hierarchical cues needed for this, since human actions are inherently structured and can be decomposed into smaller, meaningful units. However, conventional approaches focus primarily on holistic actions and often overlook the rich semantic structure embedded in sub-actions, making them poorly suited for early recognition. To address this gap, we introduce SASI (Sub-Action Semantics Integrated cross-modal fusion), a novel framework that integrates existing graph convolution networks to fuse spatiotemporal features with sub-action semantics. SASI exploits a segmentation model with a traditional skeleton-based graph convolution network, capturing both fine-grained sub-action semantics and overall spatial context, while operating in real-time at 29 Hz. Experiments on BABEL, a skeleton-based dataset with frame-level annotations, demonstrate that our method improves recognition accuracy over conventional approaches, with additional gains expected as the quality of sub-action segmentation improves. Notably, SASI also achieves superior performance in understanding partial action sequences, revealing its capability for early recognition, which is essential for proactive and seamless Human-Robot Interaction (HRI). Code is available at https://anonymous.4open.science/r/SASI .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SASI, a framework that fuses sub-action semantics extracted by a segmentation model with spatiotemporal features from a skeleton-based graph convolutional network for early action recognition in human-robot interaction. It claims improved accuracy over conventional approaches on the BABEL dataset (with gains expected to increase with better segmentation quality) and superior performance on partial action sequences, while operating in real time at 29 Hz. Code is provided via an anonymous link.
Significance. If the reported gains are substantiated with proper controls, the approach could meaningfully advance early recognition for proactive HRI by exploiting hierarchical sub-action structure rather than holistic actions alone. The real-time capability and open code are practical strengths for robotics applications.
major comments (2)
- [Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.
- [Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.
minor comments (1)
- [Abstract] The code link is to an anonymous repository; a permanent, non-anonymous link should be provided in the final version to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to strengthen the presentation of results and experimental validation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.
Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report specific accuracy improvements (with baselines and standard deviations) on the BABEL dataset for both full and partial sequences, while retaining the note on dependence on segmentation quality and referencing the corresponding experimental metrics. revision: yes
-
Referee: [Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.
Authors: We concur that an explicit ablation isolating the fusion mechanism is necessary to substantiate the contribution of SASI. We will add this to the experiments section, comparing the full model against ablated variants (GCN without semantics, and simple concatenation versus learned cross-modal fusion). We will also include the segmentation model's frame-level accuracy on BABEL to quantify input quality and show that downstream gains arise from the integration rather than segmentation alone. revision: yes
Circularity Check
No circularity: empirical pipeline without derivation or self-referential reduction
full rationale
The paper describes SASI as an integration of off-the-shelf graph convolution networks with a sub-action segmentation model for fusing spatiotemporal features and semantics on the BABEL dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. Claims of improved accuracy and early recognition are presented as experimental outcomes rather than reductions by construction. The method is self-contained as a standard empirical framework whose validity rests on external benchmarks, not internal definitional loops.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Semi-Supervised Classification with Graph Convolutional Networks
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016
work page internal anchor Pith review arXiv 2016
-
[3]
Action recognition by hierarchical mid-level action elements,
T. Lan, Y . Zhu, A. R. Zamir, and S. Savarese, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560
2015
-
[4]
Action recognition by hierarchical mid-level action elements,
——, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560
2015
-
[5]
Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,
Y . Yang, I. Saleemi, and M. Shah, “Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1635–1648, 2013
2013
-
[6]
Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,
C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013
2013
-
[7]
Hierarchical recurrent neural network for skeleton based action recognition,
Y . Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118
2015
-
[8]
Skeleton- based human action recognition with global context-aware attention lstm networks,
J. Liu, G. Wang, L.-Y . Duan, K. Abdiyeva, and A. C. Kot, “Skeleton- based human action recognition with global context-aware attention lstm networks,”IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2017
2017
-
[9]
Spatio-temporal attention-based lstm networks for 3d action recognition and detection,
S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018
2018
-
[10]
Skeleton based action recognition with convolutional neural network,
Y . Du, Y . Fu, and L. Wang, “Skeleton based action recognition with convolutional neural network,” inProceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579–583
2015
-
[11]
Spatial temporal graph convolutional networks for skeleton-based action recognition,
S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018
2018
-
[12]
Channel- wise topology refinement graph convolution for skeleton-based action recognition,
Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international confer- ence on computer vision, 2021, pp. 13 359–13 368
2021
-
[13]
Skeleton- based action recognition with shift graph convolutional network,
K. Cheng, Y . Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton- based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192
2020
-
[14]
Degcn: Deformable graph convolutional networks for skeleton-based action recognition,
W. Myung, N. Su, J.-H. Xue, and G. Wang, “Degcn: Deformable graph convolutional networks for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 33, pp. 2477–2490, 2024
2024
-
[15]
Blockgcn: Redefine topology awareness for skeleton-based action recognition,
Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua, “Blockgcn: Redefine topology awareness for skeleton-based action recognition,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 2049–2058
2024
-
[16]
H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun, “Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition,”arXiv preprint arXiv:2411.18941, 2024
-
[17]
Infogcn: Representation learning for human skeleton-based action recognition,
H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 186–20 196
2022
-
[18]
Two-stream adaptive graph convolutional networks for skeleton-based action recognition,
L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
2019
-
[19]
Temporal decoupling graph convolutional network for skeleton-based gesture recognition,
J. Liu, X. Wang, C. Wang, Y . Gao, and M. Liu, “Temporal decoupling graph convolutional network for skeleton-based gesture recognition,” IEEE Transactions on Multimedia, vol. 26, pp. 811–823, 2023
2023
-
[20]
S. Chi, H.-g. Chi, Q. Huang, and K. Ramani, “Infogcn++: Learning representation by predicting the future for online human skeleton- based action recognition,”arXiv preprint arXiv:2310.10547, 2023
-
[21]
Ntu rgb+d: A large scale dataset for 3d human activity analysis,
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019
2016
-
[22]
Contact-aware human mo- tion forecasting,
W. Mao, R. I. Hartley, and M. Salzmann, “Contact-aware human mo- tion forecasting,”Advances in Neural Information Processing Systems, vol. 35, pp. 7356–7367, 2022
2022
-
[23]
Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,
A. Haji Fathaliyan, X. Wang, and V . J. Santos, “Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,”Frontiers in Robotics and AI, vol. 5, p. 25, 2018
2018
-
[24]
Spatiotemporal multimodal learning with 3d cnns for video action recognition,
H. Wu, X. Ma, and Y . Li, “Spatiotemporal multimodal learning with 3d cnns for video action recognition,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1250–1261, 2021
2021
-
[25]
Revisiting skeleton- based action recognition,
H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton- based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978
2022
-
[26]
Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,
H. Zhang, M. C. Leong, L. Li, and W. Lin, “Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 857–18 867
2024
-
[27]
Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,
Y . Cao and Y . Yamakawa, “Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,” in2022 IEEE/ASME International Conference on Advanced Intelligent Mecha- tronics (AIM), 2022, pp. 159–164
2022
-
[28]
The wisdom of crowds: Temporal progressive attention for early action prediction,
A. Stergiou and D. Damen, “The wisdom of crowds: Temporal progressive attention for early action prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 709–14 719
2023
-
[29]
Rich action-semantic consistent knowledge for early action prediction,
X. Liu, J. Yin, D. Guo, and H. Liu, “Rich action-semantic consistent knowledge for early action prediction,”IEEE Transactions on Image Processing, vol. 33, pp. 479–492, 2023
2023
-
[30]
Multimodal human action recognition in assistive human-robot interaction,
I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Kat- samanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in2016 IEEE in- ternational conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 2702–2706
2016
-
[31]
Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,
G. J. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters, “Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,”Autonomous Robots, vol. 41, no. 3, pp. 593–612, 2017
2017
-
[32]
J. B ¨utepage, H. Kjellstr ¨om, and D. Kragic, “Anticipating many fu- tures: Online human motion prediction and synthesis for human-robot collaboration,”arXiv preprint arXiv:1702.08212, 2017
-
[33]
Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,
J. Lyu, P. Ruppel, N. Hendrich, S. Li, M. G ¨orner, and J. Zhang, “Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,”IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 4, pp. 1853–1863, 2022
2022
-
[34]
Interact: Trans- former models for human intent prediction conditioned on robot actions,
K. Kedia, A. Bhardwaj, P. Dan, and S. Choudhury, “Interact: Trans- former models for human intent prediction conditioned on robot actions,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 621–628
2024
-
[35]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review arXiv 2024
-
[36]
Finegym: A hierarchical video dataset for fine-grained action understanding,
D. Shao, Y . Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625
2020
-
[37]
BABEL: Bodies, action and behavior with english labels,
A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros- Ramirez, and M. J. Black, “BABEL: Bodies, action and behavior with english labels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 722–731
2021
-
[38]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PmLR, 2021, pp. 8748–8763
2021
-
[39]
AMASS: Archive of motion capture as surface shapes,
N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision, Oct. 2019, pp. 5442– 5451
2019
-
[40]
Temos: Generating diverse human motions from textual descriptions,
M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision (ECCV). Springer, 2022, pp. 480–497
2022
-
[41]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019
work page internal anchor Pith review arXiv 1908
-
[42]
Humantomato: Text-aligned whole-body motion generation,
S. Lu, L.-H. Chen, A. Zeng, J. Lin, R. Zhang, L. Zhang, and H.-Y . Shum, “Humantomato: Text-aligned whole-body motion generation,” arXiv preprint arXiv:2310.12978, 2023
-
[43]
Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,
W. Xin, Q. Miao, Y . Liu, R. Liu, C.-M. Pun, and C. Shi, “Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2211–2220
2023
-
[44]
Skateformer: Skeletal-temporal transformer for human action recognition,
J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,”arXiv preprint arXiv:2403.09508, 2024
-
[45]
Generative action description prompts for skeleton-based action recognition,
W. Xiang, C. Li, Y . Zhou, B. Wang, and L. Zhang, “Generative action description prompts for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285
2023
-
[46]
Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,
Z. Lu and E. Elhamifar, “Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 175–18 185
2024
-
[47]
Multi-modality co-learning for efficient skeleton-based action recognition,
J. Liu, C. Chen, and M. Liu, “Multi-modality co-learning for efficient skeleton-based action recognition,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 4909–4918
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.