pith. machine review for the scientific record. sign in

arxiv: 2604.03590 · v1 · submitted 2026-04-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords human action recognitionskeleton representationoptical flowvideo analysisscale mapbody mapsegmentation networkdeep learning
0
0 comments X

The pith

Augmenting 2D skeleton data with scale-body-flow maps raises video action recognition accuracy without added cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard 2D skeletons miss depth at joints, body outlines, and human-object interactions, limiting performance in many video scenes. The paper introduces Scale-Body-Flow (SBF) as three added maps: a scale volume for depth, a body outline map, and an optical-flow map for interactions. These are generated by SFSNet, a segmentation network trained only on existing skeleton and flow data with no new labels required. Experiments across datasets show the combined pipeline delivers higher recognition accuracy than leading skeleton-only methods while preserving similar speed and model size.

Core claim

Integrating the SBF representation—scale map volume, body map, and flow map—predicted by SFSNet from skeleton and optical flow inputs into the action recognition pipeline produces significantly higher accuracy than state-of-the-art skeleton-only methods, with comparable compactness and efficiency.

What carries the argument

Scale-Body-Flow (SBF) representation consisting of a scale map volume for joint depths, a body map for human contours, and a flow map for interactions, generated by SFSNet supervised solely by skeleton and optical flow.

If this is right

  • Skeleton-based HAR pipelines can reach higher accuracy by predicting and adding depth, contour, and interaction maps without increasing model size or inference time.
  • No extra human annotations beyond standard skeleton extraction are needed to train the augmentation network.
  • The approach generalizes across multiple video datasets while keeping the overall system compact.
  • Scenes involving depth cues or human-object contact become more reliably classified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could support real-time applications like surveillance by improving accuracy without heavier 3D sensors or larger models.
  • Similar augmentation might extend to other skeleton-driven tasks such as gesture recognition or pose tracking.
  • The method offers a lightweight bridge between 2D skeleton data and richer scene information using only video-derived signals.

Load-bearing premise

The three SBF components supply the critical missing action details and SFSNet can predict them reliably from skeleton and optical flow alone.

What would settle it

Ablation tests on benchmark datasets showing no accuracy gain when any SBF component is removed, or when SFSNet predictions are replaced by ground-truth maps, especially on videos with occlusions or object interactions.

Figures

Figures reproduced from arXiv: 2604.03590 by S.-H. Gary Chan, Yang Lin, Yiyi Ding, Zhuoxuan Peng.

Figure 1
Figure 1. Figure 1: Comparison of video-based HAR pipelines. Our pro [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Failure cases of HAR based on extracting 2D skeletons [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example video frame, its extracted skeleton, and our [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The overall structure of SFSNet. The flow estimator is pretrained via unsupervised learning. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A conceptual example of the “waving” action for our [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The accuracy difference (%) between our SBFConv3D [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of SBF components predicted by SFSNet [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: illustrates the detailed structure of our Simplified PointRend module compared to Implicit PointRend [6]. Simplified PointRend retains the overall structure of Im￾plicit PointRend, but the dynamic point head is replaced by a common 3-layer perceptron with ReLU as the activation function. During the process of point annotation genera￾tion, we use ρ = 10, Npos = 32, Nneg = 128, Nbody = Nf low = 256, α = 19 a… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of SBF components predicted by SF [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Scale-Body-Flow (SBF), a three-component augmentation (scale map for joint depth, body map for human contour, flow map for human-object interaction) to 2D skeleton representations for video-based human action recognition. It introduces SFSNet, a segmentation network that predicts SBF from skeleton keypoints and optical flow, supervised without extra annotations beyond standard skeleton extraction. The central claim is that the SBF-augmented pipeline yields significantly higher HAR accuracy than skeleton-only state-of-the-art methods while preserving compactness and efficiency, supported by experiments across multiple datasets.

Significance. If the experimental results and supervision claims hold, the work offers a compact, annotation-light way to address known limitations of pure skeleton representations (missing depth, contours, and interactions) in common HAR scenes. This could meaningfully advance efficient skeleton-based pipelines used in surveillance and interaction systems. The absence of free parameters in the core supervision and the use of existing optical flow as a signal are potential strengths worth highlighting if the ablations confirm they are not dataset-specific heuristics.

major comments (2)
  1. [Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.
  2. [SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.
minor comments (2)
  1. [Method] Clarify the precise computation of the scale map volume from joint distances and how it is rasterized into the volume representation.
  2. [Experiments] Add a table or figure showing the exact accuracy deltas versus the strongest skeleton-only baselines on each dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation and clarity where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.

    Authors: We agree that including specific quantitative results and dataset references in the abstract would make the central claims more immediately verifiable. In the revised manuscript, we have updated the abstract to report key accuracy improvements (e.g., gains over skeleton-only baselines on NTU RGB+D, Kinetics, and other evaluated datasets) along with a brief mention of the ablation studies supporting the contribution. revision: yes

  2. Referee: [SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.

    Authors: We appreciate the referee's emphasis on this point. The SFSNet supervision section already describes how the scale and body maps are derived directly from the input 2D skeleton keypoints and optical flow without requiring any additional manual annotations. To improve clarity and address concerns about reliability, we have expanded the section with more explicit descriptions of the generation process and added further cross-dataset ablation results demonstrating that the performance gains generalize and arise from the intended action-related information rather than dataset-specific effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity; SFSNet supervision and HAR evaluation remain independent of the target labels

full rationale

The paper defines SBF components (scale map, body map, flow map) as quantities derived from skeleton keypoints and optical flow, then trains SFSNet to regress those quantities and feeds the predicted SBF into a separate HAR classifier whose accuracy is measured against held-out action labels. No equation or claim equates the final HAR performance to a direct function of the input skeleton/flow by construction; the supervision signal for SFSNet is generated once from the same low-level inputs but the downstream task is a distinct classification problem. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no explicit free parameters, axioms, or invented physical entities; the approach rests on standard optical-flow computation and skeleton extraction already available in the literature.

pith-pipeline@v0.9.0 · 5514 in / 996 out tokens · 33627 ms · 2026-05-13T18:19:55.643591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    ViViT: A Video Vision Transformer.2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 6816–6826, 2021

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. ViViT: A Video Vision Transformer.2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 6816–6826, 2021. 2

  2. [2]

    Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition

    Jinmiao Cai, Nianjuan Jiang, Xiaoguang Han, Kui Jia, and Jiangbo Lu. Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2735–2744, 2021. 3

  3. [3]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4724–4733, 2017

    Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4724–4733, 2017. 2

  4. [4]

    Silhouette-based human action recog- nition using sequences of key poses.Pattern Recognition Letters, 34(15):1799–1807, 2013

    Alexandros Andre Chaaraoui, Pau Climent-P ´erez, and Fran- cisco Fl´orez-Revuelta. Silhouette-based human action recog- nition using sequences of key poses.Pattern Recognition Letters, 34(15):1799–1807, 2013. Smart Approaches for Hu- man Action Recognition. 3

  5. [5]

    Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

    Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 13339–13348, 2021. 6, 7, 2

  6. [6]

    Pointly-Supervised Instance Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2607–2616, 2022

    Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Pointly-Supervised Instance Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2607–2616, 2022. 4, 1

  7. [7]

    Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Chaewon Park, Donghyeong Kim, and Sangyoun Lee. Treating Mo- tion as Option to Reduce Motion Dependency in Unsuper- vised Video Object Segmentation.2023 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 5129–5138, 2023. 3

  8. [8]

    PoTion: Pose MoTion Representa- tion for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024– 7033, 2018

    Vasileios Choutas, Philippe Weinzaepfel, Jerome Revaud, and Cordelia Schmid. PoTion: Pose MoTion Representa- tion for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024– 7033, 2018. 2, 6, 7

  9. [9]

    MARS: Motion-Augmented RGB Stream for Action Recognition.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7874–7883, 2019

    Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7874–7883, 2019. 3

  10. [10]

    PYSKL: Towards Good Practices for Skeleton Action Recognition

    Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. PYSKL: Towards Good Practices for Skeleton Action Recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, New York, NY , USA, 2022. Association for Computing Machinery. 6

  11. [11]

    Revisiting Skeleton-based Action Recognition.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022

    Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting Skeleton-based Action Recognition.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022. 3, 4, 6, 7, 8, 1, 2

  12. [12]

    X3D: Expanding Architectures for Efficient Video Recognition

    Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020. 2, 8

  13. [13]

    SlowFast Networks for Video Recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6201–6210, 2019. 2, 6

  14. [14]

    Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling

    Ryo Hachiuma, Fumiaki Sato, and Taiki Sekii. Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22962–22971, 2023. 1, 3, 6, 7

  15. [15]

    Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Towards Understanding Ac- tion Recognition. In2013 IEEE International Conference on Computer Vision, pages 3192–3199, 2013. 6

  16. [16]

    RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose

    Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose

  17. [17]

    Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T

    Will Kay, Jo ˜ao Carreira, K. Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T. Back, A. Natsev, Mustafa Suleyman, and An- drew Zisserman. The Kinetics Human Action Video Dataset. ArXiv, 2017. 5

  18. [18]

    PointRend: Image Segmentation As Rendering

    Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir- shick. PointRend: Image Segmentation As Rendering. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9796–9805, 2020. 4, 6

  19. [19]

    Kuehne, H

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recogni- tion.2011 International Conference on Computer Vision, pages 2556–2563, 2011. 2, 5, 3

  20. [20]

    Unsupervised Video Object Seg- mentation via Prototype Memory Network.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5913–5923, 2023

    Minhyeok Lee, Suhwan Cho, Seunghoon Lee, Chaewon Park, and Sangyoun Lee. Unsupervised Video Object Seg- mentation via Prototype Memory Network.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5913–5923, 2023. 3

  21. [21]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InComputer Vision – ECCV 2014, pages 740–755, Cham,

  22. [22]

    Springer International Publishing. 6

  23. [23]

    Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition

    Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, and Zhenan Sun. Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 29248–29257, 2025. 2, 6, 7

  24. [24]

    Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understand- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020. 2, 5, 1, 3

  25. [25]

    Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

    Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 140–149, 2020. 6

  26. [26]

    Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See More, Know More: Un- supervised Video Object Segmentation With Co-Attention Siamese Networks.2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3618– 3627, 2019. 3

  27. [27]

    Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

    AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2214–2224, 2023. 2

  28. [28]

    Unsupervised Deep Learning for Optical Flow Estimation

    Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised Deep Learning for Optical Flow Estimation. InProceedings of the AAAI Conference on Artificial Intelligence, 2017. 3

  29. [29]

    Laura Sevilla-Lara, Yiyi Liao, Fatma G ¨uney, Varun Jampani, Andreas Geiger, and Michael J. Black. On the Integration of Optical Flow and Action Recognition. InPattern Recog- nition, pages 281–297, Cham, 2019. Springer International Publishing. 3

  30. [30]

    NTU RGB+D: A Large Scale Dataset for 3D Human Activ- ity Analysis.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

    Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activ- ity Analysis.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016. 2, 5, 1

  31. [31]

    Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks.IEEE Transac- tions on Image Processing, 29:9532–9545, 2020

    Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks.IEEE Transac- tions on Image Processing, 29:9532–9545, 2020. 6

  32. [32]

    Simonyan and Andrew Zisserman

    K. Simonyan and Andrew Zisserman. Two-Stream Convo- lutional Networks for Action Recognition in Videos.ArXiv,

  33. [33]

    Constructing Stronger and Faster Baselines for Skeleton- Based Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2023

    Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing Stronger and Faster Baselines for Skeleton- Based Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2023. 2

  34. [34]

    Soomro, Amir Zamir, and M

    K. Soomro, Amir Zamir, and M. Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. ArXiv, 2012. 2, 5, 3

  35. [35]

    OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning

    Siddharth Srivastava and Gaurav Sharma. OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27412–27424, 2024. 2

  36. [36]

    OmniVec: Learn- ing robust representations with cross modal sharing

    Siddharth Srivastava and Gaurav Sharma. OmniVec: Learn- ing robust representations with cross modal sharing. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1225–1237, 2024. 2

  37. [37]

    SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping

    Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia An- gelova, and Rico Jonschkowski. SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping. 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3886–3895, 2021. 3, 6

  38. [38]

    Learning Spatiotemporal Features with 3D Convolutional Networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497,

  39. [39]

    A Closer Look at Spatiotempo- ral Convolutions for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A Closer Look at Spatiotempo- ral Convolutions for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. 2

  40. [40]

    Zia Uddin

    Md. Zia Uddin. Human activity recognition using segmented body part and body joint features with hidden Markov mod- els.Multimedia Tools and Applications, 76(11):13585– 13614, 2017. 3

  41. [41]

    Multi-Stream Interaction Networks for Hu- man Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3050–3060, 2022

    Haoran Wang, Baosheng Yu, Jiaqi Li, Linlin Zhang, and Dongyue Chen. Multi-Stream Interaction Networks for Hu- man Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3050–3060, 2022. 1, 3, 6

  42. [42]

    Deep High- Resolution Representation Learning for Visual Recognition

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High- Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(10):3349–3364, 2021. 6, 7

  43. [43]

    VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14549–14560, 2023. 2

  44. [44]

    Neural Koop- man Pooling: Control-Inspired Temporal Dynamics Encod- ing for Skeleton-Based Action Recognition

    Xinghan Wang, Xin Xu, and Yadong Mu. Neural Koop- man Pooling: Control-Inspired Temporal Dynamics Encod- ing for Skeleton-Based Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10597–10607, 2023. 2

  45. [45]

    Skeleton-Based Mutually Assisted Interacted Object Local- ization and Human Action Recognition.IEEE Transactions on Multimedia, 25:4415–4425, 2023

    Liang Xu, Cuiling Lan, Wenjun Zeng, and Cewu Lu. Skeleton-Based Mutually Assisted Interacted Object Local- ization and Human Action Recognition.IEEE Transactions on Multimedia, 25:4415–4425, 2023. 1, 3, 6

  46. [46]

    PA3D: Pose-Action 3D Machine for Video Recognition

    An Yan, Yali Wang, Zhifeng Li, and Yu Qiao. PA3D: Pose-Action 3D Machine for Video Recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7914–7923, 2019. 2, 6, 7

  47. [47]

    Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

    Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2018. 2, 6

  48. [48]

    Lite-HRNet: A Lightweight High-Resolution Network.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10435–10445, 2021

    Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. Lite-HRNet: A Lightweight High-Resolution Network.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10435–10445, 2021. 6, 7

  49. [49]

    UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19037, 2024

    Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xi- ang, Rakesh Ranjan, and Denis Demandolx. UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19037, 2024. 3

  50. [50]

    Learning Discriminative Representations for Skeleton Based Action Recognition.2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10608–10617,

    Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning Discriminative Representations for Skeleton Based Action Recognition.2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10608–10617,

  51. [51]

    BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition

    Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua. BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024. 6

  52. [52]

    Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual con- nections

    Youwei Zhou, Tianyang Xu, Cong Wu, Xiaojun Wu, and Josef Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual con- nections. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12648– 12658, 2025. 3 SBF: An Effective Representation to Augment Skeleton for Video-based...

  53. [53]

    All our experiments are conducted on two hardware platforms, one with 8 NVIDIA GeForce 2080Ti GPUs and 16 CPUs and the other with 4 3090Ti GPUs and 40 CPUs

    More Implementation Details In this section, we elaborate more details of our implemen- tations of SFSNet and SBFConv3D. All our experiments are conducted on two hardware platforms, one with 8 NVIDIA GeForce 2080Ti GPUs and 16 CPUs and the other with 4 3090Ti GPUs and 40 CPUs. 8.1. SFSNet Fig. 8 illustrates the detailed structure of our Simplified PointRe...

  54. [54]

    Limb” Variant of SBF Our SBF has two variants, “joint

    More Quantitative Results In this section, we presents more experimental results to fur- ther validate the effectiveness of our SBF and SFSNet. 9.1. Results of the “Limb” Variant of SBF Our SBF has two variants, “joint” and “limb”. While pre- vious discussions focus on the “joint” variant, this section presents the performance of SBFConv3D based on SBF of...

  55. [55]

    Figure 10

    More Visualization Results This section presents more visualizations of our predicted SBF in various datasets. Figure 10. Visualization of SBF components predicted by SF- SNet on NTU120 [23] (row 1-6), HMDB51 [19] (row 7-8) and UCF101 [33] (row 9-10). Each joint is depicted in a distinct color