Recognition: 2 theorem links
· Lean TheoremSBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition
Pith reviewed 2026-05-13 18:19 UTC · model grok-4.3
The pith
Augmenting 2D skeleton data with scale-body-flow maps raises video action recognition accuracy without added cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Integrating the SBF representation—scale map volume, body map, and flow map—predicted by SFSNet from skeleton and optical flow inputs into the action recognition pipeline produces significantly higher accuracy than state-of-the-art skeleton-only methods, with comparable compactness and efficiency.
What carries the argument
Scale-Body-Flow (SBF) representation consisting of a scale map volume for joint depths, a body map for human contours, and a flow map for interactions, generated by SFSNet supervised solely by skeleton and optical flow.
If this is right
- Skeleton-based HAR pipelines can reach higher accuracy by predicting and adding depth, contour, and interaction maps without increasing model size or inference time.
- No extra human annotations beyond standard skeleton extraction are needed to train the augmentation network.
- The approach generalizes across multiple video datasets while keeping the overall system compact.
- Scenes involving depth cues or human-object contact become more reliably classified.
Where Pith is reading between the lines
- This could support real-time applications like surveillance by improving accuracy without heavier 3D sensors or larger models.
- Similar augmentation might extend to other skeleton-driven tasks such as gesture recognition or pose tracking.
- The method offers a lightweight bridge between 2D skeleton data and richer scene information using only video-derived signals.
Load-bearing premise
The three SBF components supply the critical missing action details and SFSNet can predict them reliably from skeleton and optical flow alone.
What would settle it
Ablation tests on benchmark datasets showing no accuracy gain when any SBF component is removed, or when SFSNet predictions are replaced by ground-truth maps, especially on videos with occlusions or object interactions.
Figures
read the original abstract
Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Scale-Body-Flow (SBF), a three-component augmentation (scale map for joint depth, body map for human contour, flow map for human-object interaction) to 2D skeleton representations for video-based human action recognition. It introduces SFSNet, a segmentation network that predicts SBF from skeleton keypoints and optical flow, supervised without extra annotations beyond standard skeleton extraction. The central claim is that the SBF-augmented pipeline yields significantly higher HAR accuracy than skeleton-only state-of-the-art methods while preserving compactness and efficiency, supported by experiments across multiple datasets.
Significance. If the experimental results and supervision claims hold, the work offers a compact, annotation-light way to address known limitations of pure skeleton representations (missing depth, contours, and interactions) in common HAR scenes. This could meaningfully advance efficient skeleton-based pipelines used in surveillance and interaction systems. The absence of free parameters in the core supervision and the use of existing optical flow as a signal are potential strengths worth highlighting if the ablations confirm they are not dataset-specific heuristics.
major comments (2)
- [Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.
- [SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.
minor comments (2)
- [Method] Clarify the precise computation of the scale map volume from joint distances and how it is rasterized into the volume representation.
- [Experiments] Add a table or figure showing the exact accuracy deltas versus the strongest skeleton-only baselines on each dataset.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation and clarity where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.
Authors: We agree that including specific quantitative results and dataset references in the abstract would make the central claims more immediately verifiable. In the revised manuscript, we have updated the abstract to report key accuracy improvements (e.g., gains over skeleton-only baselines on NTU RGB+D, Kinetics, and other evaluated datasets) along with a brief mention of the ablation studies supporting the contribution. revision: yes
-
Referee: [SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.
Authors: We appreciate the referee's emphasis on this point. The SFSNet supervision section already describes how the scale and body maps are derived directly from the input 2D skeleton keypoints and optical flow without requiring any additional manual annotations. To improve clarity and address concerns about reliability, we have expanded the section with more explicit descriptions of the generation process and added further cross-dataset ablation results demonstrating that the performance gains generalize and arise from the intended action-related information rather than dataset-specific effects. revision: partial
Circularity Check
No significant circularity; SFSNet supervision and HAR evaluation remain independent of the target labels
full rationale
The paper defines SBF components (scale map, body map, flow map) as quantities derived from skeleton keypoints and optical flow, then trains SFSNet to regress those quantities and feeds the predicted SBF into a separate HAR classifier whose accuracy is measured against held-out action labels. No equation or claim equates the final HAR performance to a direct function of the input skeleton/flow by construction; the supervision signal for SFSNet is generated once from the same low-level inputs but the downstream task is a distinct classification problem. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SBF consists of three distinct components... scale map volume... body map... flow map... supervised by the skeleton and optical flow without extra annotation overhead
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SFSNet... Simplified PointRend... point annotations derived from skeleton and optical flow
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. ViViT: A Video Vision Transformer.2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 6816–6826, 2021. 2
work page 2021
-
[2]
Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition
Jinmiao Cai, Nianjuan Jiang, Xiaoguang Han, Kui Jia, and Jiangbo Lu. Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2735–2744, 2021. 3
work page 2021
-
[3]
Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4724–4733, 2017. 2
work page 2017
-
[4]
Alexandros Andre Chaaraoui, Pau Climent-P ´erez, and Fran- cisco Fl´orez-Revuelta. Silhouette-based human action recog- nition using sequences of key poses.Pattern Recognition Letters, 34(15):1799–1807, 2013. Smart Approaches for Hu- man Action Recognition. 3
work page 2013
-
[5]
Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 13339–13348, 2021. 6, 7, 2
work page 2021
-
[6]
Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Pointly-Supervised Instance Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2607–2616, 2022. 4, 1
work page 2022
-
[7]
Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Chaewon Park, Donghyeong Kim, and Sangyoun Lee. Treating Mo- tion as Option to Reduce Motion Dependency in Unsuper- vised Video Object Segmentation.2023 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 5129–5138, 2023. 3
work page 2023
-
[8]
Vasileios Choutas, Philippe Weinzaepfel, Jerome Revaud, and Cordelia Schmid. PoTion: Pose MoTion Representa- tion for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024– 7033, 2018. 2, 6, 7
work page 2018
-
[9]
Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7874–7883, 2019. 3
work page 2019
-
[10]
PYSKL: Towards Good Practices for Skeleton Action Recognition
Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. PYSKL: Towards Good Practices for Skeleton Action Recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, New York, NY , USA, 2022. Association for Computing Machinery. 6
work page 2022
-
[11]
Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting Skeleton-based Action Recognition.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022. 3, 4, 6, 7, 8, 1, 2
work page 2022
-
[12]
X3D: Expanding Architectures for Efficient Video Recognition
Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020. 2, 8
work page 2020
-
[13]
SlowFast Networks for Video Recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6201–6210, 2019. 2, 6
work page 2019
-
[14]
Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling
Ryo Hachiuma, Fumiaki Sato, and Taiki Sekii. Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22962–22971, 2023. 1, 3, 6, 7
work page 2023
-
[15]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Towards Understanding Ac- tion Recognition. In2013 IEEE International Conference on Computer Vision, pages 3192–3199, 2013. 6
work page 2013
-
[16]
RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose
Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose
-
[17]
Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T
Will Kay, Jo ˜ao Carreira, K. Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T. Back, A. Natsev, Mustafa Suleyman, and An- drew Zisserman. The Kinetics Human Action Video Dataset. ArXiv, 2017. 5
work page 2017
-
[18]
PointRend: Image Segmentation As Rendering
Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir- shick. PointRend: Image Segmentation As Rendering. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9796–9805, 2020. 4, 6
work page 2020
- [19]
-
[20]
Minhyeok Lee, Suhwan Cho, Seunghoon Lee, Chaewon Park, and Sangyoun Lee. Unsupervised Video Object Seg- mentation via Prototype Memory Network.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5913–5923, 2023. 3
work page 2023
-
[21]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InComputer Vision – ECCV 2014, pages 740–755, Cham,
work page 2014
-
[22]
Springer International Publishing. 6
-
[23]
Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, and Zhenan Sun. Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 29248–29257, 2025. 2, 6, 7
work page 2025
-
[24]
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understand- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020. 2, 5, 1, 3
work page 2020
-
[25]
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 140–149, 2020. 6
work page 2020
-
[26]
Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See More, Know More: Un- supervised Video Object Segmentation With Co-Attention Siamese Networks.2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3618– 3627, 2019. 3
work page 2019
-
[27]
Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2214–2224, 2023. 2
work page 2023
-
[28]
Unsupervised Deep Learning for Optical Flow Estimation
Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised Deep Learning for Optical Flow Estimation. InProceedings of the AAAI Conference on Artificial Intelligence, 2017. 3
work page 2017
-
[29]
Laura Sevilla-Lara, Yiyi Liao, Fatma G ¨uney, Varun Jampani, Andreas Geiger, and Michael J. Black. On the Integration of Optical Flow and Action Recognition. InPattern Recog- nition, pages 281–297, Cham, 2019. Springer International Publishing. 3
work page 2019
-
[30]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activ- ity Analysis.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016. 2, 5, 1
work page 2016
-
[31]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks.IEEE Transac- tions on Image Processing, 29:9532–9545, 2020. 6
work page 2020
-
[32]
K. Simonyan and Andrew Zisserman. Two-Stream Convo- lutional Networks for Action Recognition in Videos.ArXiv,
-
[33]
Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing Stronger and Faster Baselines for Skeleton- Based Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2023. 2
work page 2023
-
[34]
K. Soomro, Amir Zamir, and M. Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. ArXiv, 2012. 2, 5, 3
work page 2012
-
[35]
OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning
Siddharth Srivastava and Gaurav Sharma. OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27412–27424, 2024. 2
work page 2024
-
[36]
OmniVec: Learn- ing robust representations with cross modal sharing
Siddharth Srivastava and Gaurav Sharma. OmniVec: Learn- ing robust representations with cross modal sharing. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1225–1237, 2024. 2
work page 2024
-
[37]
SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping
Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia An- gelova, and Rico Jonschkowski. SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping. 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3886–3895, 2021. 3, 6
work page 2021
-
[38]
Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497,
-
[39]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A Closer Look at Spatiotempo- ral Convolutions for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. 2
work page 2018
- [40]
-
[41]
Haoran Wang, Baosheng Yu, Jiaqi Li, Linlin Zhang, and Dongyue Chen. Multi-Stream Interaction Networks for Hu- man Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3050–3060, 2022. 1, 3, 6
work page 2022
-
[42]
Deep High- Resolution Representation Learning for Visual Recognition
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High- Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(10):3349–3364, 2021. 6, 7
work page 2021
-
[43]
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14549–14560, 2023. 2
work page 2023
-
[44]
Xinghan Wang, Xin Xu, and Yadong Mu. Neural Koop- man Pooling: Control-Inspired Temporal Dynamics Encod- ing for Skeleton-Based Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10597–10607, 2023. 2
work page 2023
-
[45]
Liang Xu, Cuiling Lan, Wenjun Zeng, and Cewu Lu. Skeleton-Based Mutually Assisted Interacted Object Local- ization and Human Action Recognition.IEEE Transactions on Multimedia, 25:4415–4425, 2023. 1, 3, 6
work page 2023
-
[46]
PA3D: Pose-Action 3D Machine for Video Recognition
An Yan, Yali Wang, Zhifeng Li, and Yu Qiao. PA3D: Pose-Action 3D Machine for Video Recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7914–7923, 2019. 2, 6, 7
work page 2019
-
[47]
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2018. 2, 6
work page 2018
-
[48]
Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. Lite-HRNet: A Lightweight High-Resolution Network.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10435–10445, 2021. 6, 7
work page 2021
-
[49]
Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xi- ang, Rakesh Ranjan, and Denis Demandolx. UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19037, 2024. 3
work page 2024
-
[50]
Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning Discriminative Representations for Skeleton Based Action Recognition.2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10608–10617,
work page 2023
-
[51]
BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition
Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua. BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024. 6
work page 2049
-
[52]
Youwei Zhou, Tianyang Xu, Cong Wu, Xiaojun Wu, and Josef Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual con- nections. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12648– 12658, 2025. 3 SBF: An Effective Representation to Augment Skeleton for Video-based...
work page 2025
-
[53]
More Implementation Details In this section, we elaborate more details of our implemen- tations of SFSNet and SBFConv3D. All our experiments are conducted on two hardware platforms, one with 8 NVIDIA GeForce 2080Ti GPUs and 16 CPUs and the other with 4 3090Ti GPUs and 40 CPUs. 8.1. SFSNet Fig. 8 illustrates the detailed structure of our Simplified PointRe...
-
[54]
Limb” Variant of SBF Our SBF has two variants, “joint
More Quantitative Results In this section, we presents more experimental results to fur- ther validate the effectiveness of our SBF and SFSNet. 9.1. Results of the “Limb” Variant of SBF Our SBF has two variants, “joint” and “limb”. While pre- vious discussions focus on the “joint” variant, this section presents the performance of SBFConv3D based on SBF of...
-
[55]
More Visualization Results This section presents more visualizations of our predicted SBF in various datasets. Figure 10. Visualization of SBF components predicted by SF- SNet on NTU120 [23] (row 1-6), HMDB51 [19] (row 7-8) and UCF101 [33] (row 9-10). Each joint is depicted in a distinct color
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.