arxiv: 2604.27508 · v1 · submitted 2026-04-30 · 💻 cs.RO

Recognition: unknown

SASI: Leveraging Sub-Action Semantics for Robust Early Action Recognition in Human-Robot Interaction

Yongpeng Cao , Masahiro Hirano , Hyuno Kim , Yuji Yamakawa

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords early action recognitionsub-action semanticshuman-robot interactiongraph convolution networksskeleton dataBABEL datasetcross-modal fusionreal-time recognition

0 comments

The pith

SASI fuses sub-action semantics with skeleton graph features to raise early action recognition accuracy on partial sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SASI to let robots recognize human actions sooner from incomplete data by drawing on the natural breakdown of actions into smaller meaningful parts. It combines a sub-action segmentation model with existing graph convolution networks so that fine semantic details from the parts fuse with overall spatial movement patterns from skeleton data. This runs in real time and is evaluated on the BABEL dataset with frame-level labels, where it outperforms standard holistic approaches especially when only part of an action has been seen. A reader would care because quicker understanding supports proactive robot responses that make human-robot collaboration smoother and safer. The gains are expected to grow as segmentation quality advances.

Core claim

SASI integrates graph convolution networks to fuse spatiotemporal features with sub-action semantics captured by a segmentation model on skeleton input. It retains both the detailed cues from sub-action units and the broader spatial context while processing at 29 Hz. On the BABEL dataset, the approach yields higher recognition accuracy than conventional methods and shows stronger results on partial action sequences, demonstrating its utility for early recognition needed in proactive human-robot interaction.

What carries the argument

SASI cross-modal fusion that pairs sub-action segmentation semantics with skeleton-based graph convolution networks

Load-bearing premise

Sub-action segmentation supplies reliable semantic cues that improve fusion with spatiotemporal features without adding noise or needing extra tuning to produce the observed gains.

What would settle it

An ablation test on BABEL showing that SASI without the sub-action segmentation module matches or exceeds full SASI accuracy on partial sequences would falsify the contribution of the semantic integration.

Figures

Figures reproduced from arXiv: 2604.27508 by Hyuno Kim, Masahiro Hirano, Yongpeng Cao, Yuji Yamakawa.

**Figure 1.** Figure 1: A comparison between the conventional holistic approach and pro view at source ↗

**Figure 2.** Figure 2: Framework overview of the proposed method. The MoCap motion data () view at source ↗

**Figure 3.** Figure 3: Illustration of data interpolation. its sub-action annotations to motion sequences from AMASS dataset [39]. BABEL’s raw labels suffer from redundancy and noise due to multi-dataset aggregation. To address this, we apply the strategy proposed in TEMOS [40] to merge similar labels by comparing the cosine similarity using the pretrained language model. In our implementation, we choose to use the pre-trained … view at source ↗

**Figure 4.** Figure 4: Visualization of attention weights in the cross-modal fusion module for four sample sequences comparing the cross-attention outputs of the view at source ↗

read the original abstract

Understanding human actions is critical for advancing behavior analysis in human-robot interaction. Particularly in tasks that demand quick and proactive feedback, robots must recognize human actions as early as possible from incomplete observations. \textit{Sub-actions} offer the semantic and hierarchical cues needed for this, since human actions are inherently structured and can be decomposed into smaller, meaningful units. However, conventional approaches focus primarily on holistic actions and often overlook the rich semantic structure embedded in sub-actions, making them poorly suited for early recognition. To address this gap, we introduce SASI (Sub-Action Semantics Integrated cross-modal fusion), a novel framework that integrates existing graph convolution networks to fuse spatiotemporal features with sub-action semantics. SASI exploits a segmentation model with a traditional skeleton-based graph convolution network, capturing both fine-grained sub-action semantics and overall spatial context, while operating in real-time at 29 Hz. Experiments on BABEL, a skeleton-based dataset with frame-level annotations, demonstrate that our method improves recognition accuracy over conventional approaches, with additional gains expected as the quality of sub-action segmentation improves. Notably, SASI also achieves superior performance in understanding partial action sequences, revealing its capability for early recognition, which is essential for proactive and seamless Human-Robot Interaction (HRI). Code is available at https://anonymous.4open.science/r/SASI .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SASI fuses off-the-shelf sub-action segmentation with a skeleton GCN for early recognition on BABEL, but the abstract gives no numbers or ablations so the fusion's actual value stays unproven.

read the letter

The paper's core move is to take a segmentation model that breaks actions into sub-actions and feed its output into a graph convolution network alongside the usual spatiotemporal skeleton features. This is aimed at early recognition from partial sequences in human-robot interaction, and they report real-time operation at 29 Hz with code released. That combination is new in the early-recognition setting even if the pieces themselves are standard. It makes sense for HRI because sub-actions can give semantic hints before the full action finishes, and BABEL's frame-level labels line up with that need. The write-up is straightforward about why holistic action models fall short for proactive feedback. Credit for keeping the pipeline practical and open-sourcing it. The soft spot is exactly what the stress-test flags: the abstract claims accuracy gains and better partial-sequence performance but shows none of the numbers, baselines, or error bars. There is also no ablation that isolates the fusion step from the segmentation input itself, no reported segmentation accuracy on BABEL, and no test of fusion variants. If the segmentation model already supplies strong cues, a plain GCN might look improved without the cross-modal part adding much. The note that gains will grow with better segmentation is reasonable but does not substitute for evidence on the current version. This is for people building real-time action systems in robotics who want an early-recognition baseline to try. A reader working on skeleton-based HRI or early prediction could pull the fusion idea and the dataset choice. It is coherent on its own terms and engages the literature honestly, so it deserves a serious referee even though the experiments will need tightening. Send it to review and ask for the missing ablations and quantitative tables.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SASI, a framework that fuses sub-action semantics extracted by a segmentation model with spatiotemporal features from a skeleton-based graph convolutional network for early action recognition in human-robot interaction. It claims improved accuracy over conventional approaches on the BABEL dataset (with gains expected to increase with better segmentation quality) and superior performance on partial action sequences, while operating in real time at 29 Hz. Code is provided via an anonymous link.

Significance. If the reported gains are substantiated with proper controls, the approach could meaningfully advance early recognition for proactive HRI by exploiting hierarchical sub-action structure rather than holistic actions alone. The real-time capability and open code are practical strengths for robotics applications.

major comments (2)

[Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.
[Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.

minor comments (1)

[Abstract] The code link is to an anonymous repository; a permanent, non-anonymous link should be provided in the final version to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to strengthen the presentation of results and experimental validation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the method 'improves recognition accuracy over conventional approaches' and achieves 'superior performance in understanding partial action sequences' is presented without any numerical results, baselines, error bars, or statistical details. This is load-bearing because the abstract itself states that gains depend on sub-action segmentation quality, yet supplies no evidence to support the superiority assertion.

Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised version, we will update the abstract to report specific accuracy improvements (with baselines and standard deviations) on the BABEL dataset for both full and partial sequences, while retaining the note on dependence on segmentation quality and referencing the corresponding experimental metrics. revision: yes
Referee: [Experiments] Experiments section: No ablation is reported that isolates the contribution of the proposed cross-modal fusion step from the sub-action segmentation input itself (e.g., simple concatenation vs. learned attention, or GCN with vs. without semantic cues). Segmentation accuracy metrics on BABEL are also absent. This directly undermines the central claim, as any downstream GCN could appear improved if the segmentation model already supplies strong frame-level signals.

Authors: We concur that an explicit ablation isolating the fusion mechanism is necessary to substantiate the contribution of SASI. We will add this to the experiments section, comparing the full model against ablated variants (GCN without semantics, and simple concatenation versus learned cross-modal fusion). We will also include the segmentation model's frame-level accuracy on BABEL to quantify input quality and show that downstream gains arise from the integration rather than segmentation alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline without derivation or self-referential reduction

full rationale

The paper describes SASI as an integration of off-the-shelf graph convolution networks with a sub-action segmentation model for fusing spatiotemporal features and semantics on the BABEL dataset. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. Claims of improved accuracy and early recognition are presented as experimental outcomes rather than reductions by construction. The method is self-contained as a standard empirical framework whose validity rests on external benchmarks, not internal definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions of graph convolution networks and segmentation models already published elsewhere; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1008 out tokens · 41750 ms · 2026-05-07T10:24:12.975112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages · 3 internal anchors

[1]

Semi-Supervised Classification with Graph Convolutional Networks

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review arXiv 2016
[3]

Action recognition by hierarchical mid-level action elements,

T. Lan, Y . Zhu, A. R. Zamir, and S. Savarese, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560

2015
[4]

Action recognition by hierarchical mid-level action elements,

——, “Action recognition by hierarchical mid-level action elements,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 4552–4560

2015
[5]

Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,

Y . Yang, I. Saleemi, and M. Shah, “Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1635–1648, 2013

2013
[6]

Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,

C. Ionescu, D. Papava, V . Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,”IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1325–1339, 2013

2013
[7]

Hierarchical recurrent neural network for skeleton based action recognition,

Y . Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118

2015
[8]

Skeleton- based human action recognition with global context-aware attention lstm networks,

J. Liu, G. Wang, L.-Y . Duan, K. Abdiyeva, and A. C. Kot, “Skeleton- based human action recognition with global context-aware attention lstm networks,”IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2017

2017
[9]

Spatio-temporal attention-based lstm networks for 3d action recognition and detection,

S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-temporal attention-based lstm networks for 3d action recognition and detection,” IEEE Transactions on image processing, vol. 27, no. 7, pp. 3459–3471, 2018

2018
[10]

Skeleton based action recognition with convolutional neural network,

Y . Du, Y . Fu, and L. Wang, “Skeleton based action recognition with convolutional neural network,” inProceedings of the 3rd IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579–583

2015
[11]

Spatial temporal graph convolutional networks for skeleton-based action recognition,

S. Yan, Y . Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

2018
[12]

Channel- wise topology refinement graph convolution for skeleton-based action recognition,

Y . Chen, Z. Zhang, C. Yuan, B. Li, Y . Deng, and W. Hu, “Channel- wise topology refinement graph convolution for skeleton-based action recognition,” inProceedings of the IEEE/CVF international confer- ence on computer vision, 2021, pp. 13 359–13 368

2021
[13]

Skeleton- based action recognition with shift graph convolutional network,

K. Cheng, Y . Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton- based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 183–192

2020
[14]

Degcn: Deformable graph convolutional networks for skeleton-based action recognition,

W. Myung, N. Su, J.-H. Xue, and G. Wang, “Degcn: Deformable graph convolutional networks for skeleton-based action recognition,”IEEE Transactions on Image Processing, vol. 33, pp. 2477–2490, 2024

2024
[15]

Blockgcn: Redefine topology awareness for skeleton-based action recognition,

Y . Zhou, X. Yan, Z.-Q. Cheng, Y . Yan, Q. Dai, and X.-S. Hua, “Blockgcn: Redefine topology awareness for skeleton-based action recognition,” inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2024, pp. 2049–2058

2024
[16]

Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition,

H. Liu, Y . Liu, M. Ren, H. Wang, Y . Wang, and Z. Sun, “Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition,”arXiv preprint arXiv:2411.18941, 2024

work page arXiv 2024
[17]

Infogcn: Representation learning for human skeleton-based action recognition,

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 20 186–20 196

2022
[18]

Two-stream adaptive graph convolutional networks for skeleton-based action recognition,

L. Shi, Y . Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

2019
[19]

Temporal decoupling graph convolutional network for skeleton-based gesture recognition,

J. Liu, X. Wang, C. Wang, Y . Gao, and M. Liu, “Temporal decoupling graph convolutional network for skeleton-based gesture recognition,” IEEE Transactions on Multimedia, vol. 26, pp. 811–823, 2023

2023
[20]

Infogcn++: Learning representation by predicting the future for online human skeleton- based action recognition,

S. Chi, H.-g. Chi, Q. Huang, and K. Ramani, “Infogcn++: Learning representation by predicting the future for online human skeleton- based action recognition,”arXiv preprint arXiv:2310.10547, 2023

work page arXiv 2023
[21]

Ntu rgb+d: A large scale dataset for 3d human activity analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019

2016
[22]

Contact-aware human mo- tion forecasting,

W. Mao, R. I. Hartley, and M. Salzmann, “Contact-aware human mo- tion forecasting,”Advances in Neural Information Processing Systems, vol. 35, pp. 7356–7367, 2022

2022
[23]

Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,

A. Haji Fathaliyan, X. Wang, and V . J. Santos, “Exploiting three- dimensional gaze tracking for action recognition during bimanual manipulation to enhance human-robot collaboration,”Frontiers in Robotics and AI, vol. 5, p. 25, 2018

2018
[24]

Spatiotemporal multimodal learning with 3d cnns for video action recognition,

H. Wu, X. Ma, and Y . Li, “Spatiotemporal multimodal learning with 3d cnns for video action recognition,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1250–1261, 2021

2021
[25]

Revisiting skeleton- based action recognition,

H. Duan, Y . Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton- based action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978

2022
[26]

Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,

H. Zhang, M. C. Leong, L. Li, and W. Lin, “Pevl: Pose-enhanced vision-language model for fine-grained human action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 857–18 867

2024
[27]

Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,

Y . Cao and Y . Yamakawa, “Marker-less kendo motion prediction using high-speed dual-camera system and lstm method,” in2022 IEEE/ASME International Conference on Advanced Intelligent Mecha- tronics (AIM), 2022, pp. 159–164

2022
[28]

The wisdom of crowds: Temporal progressive attention for early action prediction,

A. Stergiou and D. Damen, “The wisdom of crowds: Temporal progressive attention for early action prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 709–14 719

2023
[29]

Rich action-semantic consistent knowledge for early action prediction,

X. Liu, J. Yin, D. Guo, and H. Liu, “Rich action-semantic consistent knowledge for early action prediction,”IEEE Transactions on Image Processing, vol. 33, pp. 479–492, 2023

2023
[30]

Multimodal human action recognition in assistive human-robot interaction,

I. Rodomagoulakis, N. Kardaris, V . Pitsikalis, E. Mavroudi, A. Kat- samanis, A. Tsiami, and P. Maragos, “Multimodal human action recognition in assistive human-robot interaction,” in2016 IEEE in- ternational conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 2702–2706

2016
[31]

Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,

G. J. Maeda, G. Neumann, M. Ewerton, R. Lioutikov, O. Kroemer, and J. Peters, “Probabilistic movement primitives for coordination of multiple human–robot collaborative tasks,”Autonomous Robots, vol. 41, no. 3, pp. 593–612, 2017

2017
[32]

Anticipating many fu- tures: Online human motion prediction and synthesis for human-robot collaboration,

J. B ¨utepage, H. Kjellstr ¨om, and D. Kragic, “Anticipating many fu- tures: Online human motion prediction and synthesis for human-robot collaboration,”arXiv preprint arXiv:1702.08212, 2017

work page arXiv 2017
[33]

Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,

J. Lyu, P. Ruppel, N. Hendrich, S. Li, M. G ¨orner, and J. Zhang, “Efficient and collision-free human–robot collaboration based on in- tention and trajectory prediction,”IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 4, pp. 1853–1863, 2022

2022
[34]

Interact: Trans- former models for human intent prediction conditioned on robot actions,

K. Kedia, A. Bhardwaj, P. Dan, and S. Choudhury, “Interact: Trans- former models for human intent prediction conditioned on robot actions,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 621–628

2024
[35]

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Z. Fu, T. Z. Zhao, and C. Finn, “Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,”arXiv preprint arXiv:2401.02117, 2024

work page internal anchor Pith review arXiv 2024
[36]

Finegym: A hierarchical video dataset for fine-grained action understanding,

D. Shao, Y . Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video dataset for fine-grained action understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2616–2625

2020
[37]

BABEL: Bodies, action and behavior with english labels,

A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros- Ramirez, and M. J. Black, “BABEL: Bodies, action and behavior with english labels,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 722–731

2021
[38]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PmLR, 2021, pp. 8748–8763

2021
[39]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision, Oct. 2019, pp. 5442– 5451

2019
[40]

Temos: Generating diverse human motions from textual descriptions,

M. Petrovich, M. J. Black, and G. Varol, “Temos: Generating diverse human motions from textual descriptions,” inEuropean Conference on Computer Vision (ECCV). Springer, 2022, pp. 480–497

2022
[41]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review arXiv 1908
[42]

Humantomato: Text-aligned whole-body motion generation,

S. Lu, L.-H. Chen, A. Zeng, J. Lin, R. Zhang, L. Zhang, and H.-Y . Shum, “Humantomato: Text-aligned whole-body motion generation,” arXiv preprint arXiv:2310.12978, 2023

work page arXiv 2023
[43]

Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,

W. Xin, Q. Miao, Y . Liu, R. Liu, C.-M. Pun, and C. Shi, “Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition,” inProceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2211–2220

2023
[44]

Skateformer: Skeletal-temporal transformer for human action recognition,

J. Do and M. Kim, “Skateformer: Skeletal-temporal transformer for human action recognition,”arXiv preprint arXiv:2403.09508, 2024

work page arXiv 2024
[45]

Generative action description prompts for skeleton-based action recognition,

W. Xiang, C. Li, Y . Zhou, B. Wang, and L. Zhang, “Generative action description prompts for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 276–10 285

2023
[46]

Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,

Z. Lu and E. Elhamifar, “Fact: Frame-action cross-attention temporal modeling for efficient action segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 175–18 185

2024
[47]

Multi-modality co-learning for efficient skeleton-based action recognition,

J. Liu, C. Chen, and M. Liu, “Multi-modality co-learning for efficient skeleton-based action recognition,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 4909–4918

2024