arxiv: 2604.03590 · v1 · submitted 2026-04-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SBF: An Effective Representation to Augment Skeleton for Video-based Human Action Recognition

Zhuoxuan Peng , Yiyi Ding , Yang Lin , S.-H. Gary Chan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords human action recognitionskeleton representationoptical flowvideo analysisscale mapbody mapsegmentation networkdeep learning

0 comments

The pith

Augmenting 2D skeleton data with scale-body-flow maps raises video action recognition accuracy without added cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard 2D skeletons miss depth at joints, body outlines, and human-object interactions, limiting performance in many video scenes. The paper introduces Scale-Body-Flow (SBF) as three added maps: a scale volume for depth, a body outline map, and an optical-flow map for interactions. These are generated by SFSNet, a segmentation network trained only on existing skeleton and flow data with no new labels required. Experiments across datasets show the combined pipeline delivers higher recognition accuracy than leading skeleton-only methods while preserving similar speed and model size.

Core claim

Integrating the SBF representation—scale map volume, body map, and flow map—predicted by SFSNet from skeleton and optical flow inputs into the action recognition pipeline produces significantly higher accuracy than state-of-the-art skeleton-only methods, with comparable compactness and efficiency.

What carries the argument

Scale-Body-Flow (SBF) representation consisting of a scale map volume for joint depths, a body map for human contours, and a flow map for interactions, generated by SFSNet supervised solely by skeleton and optical flow.

If this is right

Skeleton-based HAR pipelines can reach higher accuracy by predicting and adding depth, contour, and interaction maps without increasing model size or inference time.
No extra human annotations beyond standard skeleton extraction are needed to train the augmentation network.
The approach generalizes across multiple video datasets while keeping the overall system compact.
Scenes involving depth cues or human-object contact become more reliably classified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could support real-time applications like surveillance by improving accuracy without heavier 3D sensors or larger models.
Similar augmentation might extend to other skeleton-driven tasks such as gesture recognition or pose tracking.
The method offers a lightweight bridge between 2D skeleton data and richer scene information using only video-derived signals.

Load-bearing premise

The three SBF components supply the critical missing action details and SFSNet can predict them reliably from skeleton and optical flow alone.

What would settle it

Ablation tests on benchmark datasets showing no accuracy gain when any SBF component is removed, or when SFSNet predictions are replaced by ground-truth maps, especially on videos with occlusions or object interactions.

Figures

Figures reproduced from arXiv: 2604.03590 by S.-H. Gary Chan, Yang Lin, Yiyi Ding, Zhuoxuan Peng.

**Figure 2.** Figure 2: Failure cases of HAR based on extracting 2D skeletons [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: An example video frame, its extracted skeleton, and our [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The overall structure of SFSNet. The flow estimator is pretrained via unsupervised learning. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: A conceptual example of the “waving” action for our [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The accuracy difference (%) between our SBFConv3D [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of SBF components predicted by SFSNet [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: illustrates the detailed structure of our Simplified PointRend module compared to Implicit PointRend [6]. Simplified PointRend retains the overall structure of Implicit PointRend, but the dynamic point head is replaced by a common 3-layer perceptron with ReLU as the activation function. During the process of point annotation generation, we use ρ = 10, Npos = 32, Nneg = 128, Nbody = Nf low = 256, α = 19 a… view at source ↗

**Figure 10.** Figure 10: Visualization of SBF components predicted by SF [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Many modern video-based human action recognition (HAR) approaches use 2D skeleton as the intermediate representation in their prediction pipelines. Despite overall encouraging results, these approaches still struggle in many common scenes, mainly because the skeleton does not capture critical action-related information pertaining to the depth of the joints, contour of the human body, and interaction between the human and objects. To address this, we propose an effective approach to augment skeleton with a representation capturing action-related information in the pipeline of HAR. The representation, termed Scale-Body-Flow (SBF), consists of three distinct components, namely a scale map volume given by the scale (and hence depth information) of each joint, a body map outlining the human subject, and a flow map indicating human-object interaction given by pixel-wise optical flow values. To predict SBF, we further present SFSNet, a novel segmentation network supervised by the skeleton and optical flow without extra annotation overhead beyond the existing skeleton extraction. Extensive experiments across different datasets demonstrate that our pipeline based on SBF and SFSNet achieves significantly higher HAR accuracy with similar compactness and efficiency as compared with the state-of-the-art skeleton-only approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SBF adds a composite scale-body-flow layer to skeletons for HAR but the gains rest on unshown experiments and a supervision setup that may lean on proxies.

read the letter

This paper introduces SBF, a three-part representation that layers a scale map volume for joint depth, a body map for contours, and a flow map for human-object interactions on top of standard 2D skeletons. SFSNet then predicts the whole thing from skeleton keypoints plus optical flow, with the claim that no extra annotations are needed beyond what skeleton extraction already provides. The stated result is higher action recognition accuracy at roughly the same compactness and speed as current skeleton-only pipelines. That combination is new in the referenced skeleton literature and targets a real practical gap: skeletons are fast but lose depth, shape, and interaction cues that matter in many scenes. If the numbers hold, it could be a low-friction upgrade for deployed systems that already run pose estimation. The abstract asserts gains across datasets, which is the part worth checking. The soft spot is the lack of any reported figures, ablations, or dataset details here, so the size of the improvement is still unknown. The supervision story also needs scrutiny: generating reliable pixel-wise scale and body maps from 2D keypoints and flow alone is under-specified, and any heuristic proxies for depth or segmentation could introduce viewpoint-specific errors that make the comparison to baselines less clean. The stress-test point about hidden approximations is reasonable to verify in the methods. This is aimed at practitioners who build efficient video HAR pipelines rather than at core theory. A reader already working on skeleton methods would get concrete value from seeing whether the added maps deliver measurable lifts without blowing up compute. I would send it to peer review so the experimental sections and exact supervision details can be examined.

Referee Report

2 major / 2 minor

Summary. The paper proposes Scale-Body-Flow (SBF), a three-component augmentation (scale map for joint depth, body map for human contour, flow map for human-object interaction) to 2D skeleton representations for video-based human action recognition. It introduces SFSNet, a segmentation network that predicts SBF from skeleton keypoints and optical flow, supervised without extra annotations beyond standard skeleton extraction. The central claim is that the SBF-augmented pipeline yields significantly higher HAR accuracy than skeleton-only state-of-the-art methods while preserving compactness and efficiency, supported by experiments across multiple datasets.

Significance. If the experimental results and supervision claims hold, the work offers a compact, annotation-light way to address known limitations of pure skeleton representations (missing depth, contours, and interactions) in common HAR scenes. This could meaningfully advance efficient skeleton-based pipelines used in surveillance and interaction systems. The absence of free parameters in the core supervision and the use of existing optical flow as a signal are potential strengths worth highlighting if the ablations confirm they are not dataset-specific heuristics.

major comments (2)

[Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.
[SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.

minor comments (2)

[Method] Clarify the precise computation of the scale map volume from joint distances and how it is rasterized into the volume representation.
[Experiments] Add a table or figure showing the exact accuracy deltas versus the strongest skeleton-only baselines on each dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation and clarity where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'significantly higher HAR accuracy' is asserted without any quantitative numbers, dataset names, or ablation results; this makes the central experimental claim impossible to evaluate from the provided summary and places the entire contribution on unverified assertions.

Authors: We agree that including specific quantitative results and dataset references in the abstract would make the central claims more immediately verifiable. In the revised manuscript, we have updated the abstract to report key accuracy improvements (e.g., gains over skeleton-only baselines on NTU RGB+D, Kinetics, and other evaluated datasets) along with a brief mention of the ablation studies supporting the contribution. revision: yes
Referee: [SFSNet supervision] SFSNet supervision section: the statement that SFSNet is supervised 'by the skeleton and optical flow without extra annotation overhead' is load-bearing for the no-extra-cost claim, yet generating pixel-wise scale (depth) and body-contour maps from 2D keypoints alone requires unspecified proxies or heuristics; these proxies are not shown to be reliable or generalizable, directly risking that reported gains are attributable to dataset-specific approximations rather than the intended 'critical missing information'.

Authors: We appreciate the referee's emphasis on this point. The SFSNet supervision section already describes how the scale and body maps are derived directly from the input 2D skeleton keypoints and optical flow without requiring any additional manual annotations. To improve clarity and address concerns about reliability, we have expanded the section with more explicit descriptions of the generation process and added further cross-dataset ablation results demonstrating that the performance gains generalize and arise from the intended action-related information rather than dataset-specific effects. revision: partial

Circularity Check

0 steps flagged

No significant circularity; SFSNet supervision and HAR evaluation remain independent of the target labels

full rationale

The paper defines SBF components (scale map, body map, flow map) as quantities derived from skeleton keypoints and optical flow, then trains SFSNet to regress those quantities and feeds the predicted SBF into a separate HAR classifier whose accuracy is measured against held-out action labels. No equation or claim equates the final HAR performance to a direct function of the input skeleton/flow by construction; the supervision signal for SFSNet is generated once from the same low-level inputs but the downstream task is a distinct classification problem. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no explicit free parameters, axioms, or invented physical entities; the approach rests on standard optical-flow computation and skeleton extraction already available in the literature.

pith-pipeline@v0.9.0 · 5514 in / 996 out tokens · 33627 ms · 2026-05-13T18:19:55.643591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SBF consists of three distinct components... scale map volume... body map... flow map... supervised by the skeleton and optical flow without extra annotation overhead
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SFSNet... Simplified PointRend... point annotations derived from skeleton and optical flow

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

ViViT: A Video Vision Transformer.2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 6816–6826, 2021

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. ViViT: A Video Vision Transformer.2021 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 6816–6826, 2021. 2

work page 2021
[2]

Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition

Jinmiao Cai, Nianjuan Jiang, Xiaoguang Han, Kui Jia, and Jiangbo Lu. Jolo-gcn: Mining joint-centered light-weight in- formation for skeleton-based action recognition. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2735–2744, 2021. 3

work page 2021
[3]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4724–4733, 2017

Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4724–4733, 2017. 2

work page 2017
[4]

Silhouette-based human action recog- nition using sequences of key poses.Pattern Recognition Letters, 34(15):1799–1807, 2013

Alexandros Andre Chaaraoui, Pau Climent-P ´erez, and Fran- cisco Fl´orez-Revuelta. Silhouette-based human action recog- nition using sequences of key poses.Pattern Recognition Letters, 34(15):1799–1807, 2013. Smart Approaches for Hu- man Action Recognition. 3

work page 2013
[5]

Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 13339–13348, 2021. 6, 7, 2

work page 2021
[6]

Pointly-Supervised Instance Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2607–2616, 2022

Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Pointly-Supervised Instance Segmentation.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2607–2616, 2022. 4, 1

work page 2022
[7]

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Chaewon Park, Donghyeong Kim, and Sangyoun Lee. Treating Mo- tion as Option to Reduce Motion Dependency in Unsuper- vised Video Object Segmentation.2023 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 5129–5138, 2023. 3

work page 2023
[8]

PoTion: Pose MoTion Representa- tion for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024– 7033, 2018

Vasileios Choutas, Philippe Weinzaepfel, Jerome Revaud, and Cordelia Schmid. PoTion: Pose MoTion Representa- tion for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7024– 7033, 2018. 2, 6, 7

work page 2018
[9]

MARS: Motion-Augmented RGB Stream for Action Recognition.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7874–7883, 2019

Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, and Cordelia Schmid. MARS: Motion-Augmented RGB Stream for Action Recognition.2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7874–7883, 2019. 3

work page 2019
[10]

PYSKL: Towards Good Practices for Skeleton Action Recognition

Haodong Duan, Jiaqi Wang, Kai Chen, and Dahua Lin. PYSKL: Towards Good Practices for Skeleton Action Recognition. InProceedings of the 30th ACM International Conference on Multimedia, pages 7351–7354, New York, NY , USA, 2022. Association for Computing Machinery. 6

work page 2022
[11]

Revisiting Skeleton-based Action Recognition.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. Revisiting Skeleton-based Action Recognition.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2959–2968, 2022. 3, 4, 6, 7, 8, 1, 2

work page 2022
[12]

X3D: Expanding Architectures for Efficient Video Recognition

Christoph Feichtenhofer. X3D: Expanding Architectures for Efficient Video Recognition. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020. 2, 8

work page 2020
[13]

SlowFast Networks for Video Recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 6201–6210, 2019. 2, 6

work page 2019
[14]

Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling

Ryo Hachiuma, Fumiaki Sato, and Taiki Sekii. Unified Keypoint-Based Action Recognition Framework via Struc- tured Keypoint Pooling. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22962–22971, 2023. 1, 3, 6, 7

work page 2023
[15]

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Towards Understanding Ac- tion Recognition. In2013 IEEE International Conference on Computer Vision, pages 3192–3199, 2013. 6

work page 2013
[16]

RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose

Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. RTMPose: Real- Time Multi-Person Pose Estimation based on MMPose

work page
[17]

Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T

Will Kay, Jo ˜ao Carreira, K. Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, T. Back, A. Natsev, Mustafa Suleyman, and An- drew Zisserman. The Kinetics Human Action Video Dataset. ArXiv, 2017. 5

work page 2017
[18]

PointRend: Image Segmentation As Rendering

Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Gir- shick. PointRend: Image Segmentation As Rendering. 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9796–9805, 2020. 4, 6

work page 2020
[19]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recogni- tion.2011 International Conference on Computer Vision, pages 2556–2563, 2011. 2, 5, 3

work page 2011
[20]

Unsupervised Video Object Seg- mentation via Prototype Memory Network.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5913–5923, 2023

Minhyeok Lee, Suhwan Cho, Seunghoon Lee, Chaewon Park, and Sangyoun Lee. Unsupervised Video Object Seg- mentation via Prototype Memory Network.2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5913–5923, 2023. 3

work page 2023
[21]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. InComputer Vision – ECCV 2014, pages 740–755, Cham,

work page 2014
[22]

Springer International Publishing. 6

work page
[23]

Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition

Hongda Liu, Yunfan Liu, Min Ren, Hao Wang, Yunlong Wang, and Zhenan Sun. Revealing key details to see differ- ences: A novel prototypical perspective for skeleton-based action recognition. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 29248–29257, 2025. 2, 6, 7

work page 2025
[24]

Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understand- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020. 2, 5, 1, 3

work page 2020
[25]

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In 2020 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 140–149, 2020. 6

work page 2020
[26]

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao, and Fatih Porikli. See More, Know More: Un- supervised Video Object Segmentation With Co-Attention Siamese Networks.2019 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 3618– 3627, 2019. 3

work page 2019
[27]

Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova. Re- thinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2214–2224, 2023. 2

work page 2023
[28]

Unsupervised Deep Learning for Optical Flow Estimation

Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised Deep Learning for Optical Flow Estimation. InProceedings of the AAAI Conference on Artificial Intelligence, 2017. 3

work page 2017
[29]

Laura Sevilla-Lara, Yiyi Liao, Fatma G ¨uney, Varun Jampani, Andreas Geiger, and Michael J. Black. On the Integration of Optical Flow and Action Recognition. InPattern Recog- nition, pages 281–297, Cham, 2019. Springer International Publishing. 3

work page 2019
[30]

NTU RGB+D: A Large Scale Dataset for 3D Human Activ- ity Analysis.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activ- ity Analysis.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016. 2, 5, 1

work page 2016
[31]

Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks.IEEE Transac- tions on Image Processing, 29:9532–9545, 2020

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks.IEEE Transac- tions on Image Processing, 29:9532–9545, 2020. 6

work page 2020
[32]

Simonyan and Andrew Zisserman

K. Simonyan and Andrew Zisserman. Two-Stream Convo- lutional Networks for Action Recognition in Videos.ArXiv,

work page
[33]

Constructing Stronger and Faster Baselines for Skeleton- Based Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2023

Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing Stronger and Faster Baselines for Skeleton- Based Action Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2023. 2

work page 2023
[34]

Soomro, Amir Zamir, and M

K. Soomro, Amir Zamir, and M. Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. ArXiv, 2012. 2, 5, 3

work page 2012
[35]

OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning

Siddharth Srivastava and Gaurav Sharma. OmniVec2 - A Novel Transformer based Network for Large Scale Mul- timodal and Multitask Learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27412–27424, 2024. 2

work page 2024
[36]

OmniVec: Learn- ing robust representations with cross modal sharing

Siddharth Srivastava and Gaurav Sharma. OmniVec: Learn- ing robust representations with cross modal sharing. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1225–1237, 2024. 2

work page 2024
[37]

SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping

Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia An- gelova, and Rico Jonschkowski. SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping. 2021 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 3886–3895, 2021. 3, 6

work page 2021
[38]

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497,

work page
[39]

A Closer Look at Spatiotempo- ral Convolutions for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A Closer Look at Spatiotempo- ral Convolutions for Action Recognition.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018. 2

work page 2018
[40]

Zia Uddin

Md. Zia Uddin. Human activity recognition using segmented body part and body joint features with hidden Markov mod- els.Multimedia Tools and Applications, 76(11):13585– 13614, 2017. 3

work page 2017
[41]

Multi-Stream Interaction Networks for Hu- man Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3050–3060, 2022

Haoran Wang, Baosheng Yu, Jiaqi Li, Linlin Zhang, and Dongyue Chen. Multi-Stream Interaction Networks for Hu- man Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 32(5):3050–3060, 2022. 1, 3, 6

work page 2022
[42]

Deep High- Resolution Representation Learning for Visual Recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep High- Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 43(10):3349–3364, 2021. 6, 7

work page 2021
[43]

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yi- nan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. In 2023 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 14549–14560, 2023. 2

work page 2023
[44]

Neural Koop- man Pooling: Control-Inspired Temporal Dynamics Encod- ing for Skeleton-Based Action Recognition

Xinghan Wang, Xin Xu, and Yadong Mu. Neural Koop- man Pooling: Control-Inspired Temporal Dynamics Encod- ing for Skeleton-Based Action Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10597–10607, 2023. 2

work page 2023
[45]

Skeleton-Based Mutually Assisted Interacted Object Local- ization and Human Action Recognition.IEEE Transactions on Multimedia, 25:4415–4425, 2023

Liang Xu, Cuiling Lan, Wenjun Zeng, and Cewu Lu. Skeleton-Based Mutually Assisted Interacted Object Local- ization and Human Action Recognition.IEEE Transactions on Multimedia, 25:4415–4425, 2023. 1, 3, 6

work page 2023
[46]

PA3D: Pose-Action 3D Machine for Video Recognition

An Yan, Yali Wang, Zhifeng Li, and Yu Qiao. PA3D: Pose-Action 3D Machine for Video Recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7914–7923, 2019. 2, 6, 7

work page 2019
[47]

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2018. 2, 6

work page 2018
[48]

Lite-HRNet: A Lightweight High-Resolution Network.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10435–10445, 2021

Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and Jingdong Wang. Lite-HRNet: A Lightweight High-Resolution Network.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10435–10445, 2021. 6, 7

work page 2021
[49]

UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19037, 2024

Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xi- ang, Rakesh Ranjan, and Denis Demandolx. UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19027–19037, 2024. 3

work page 2024
[50]

Learning Discriminative Representations for Skeleton Based Action Recognition.2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10608–10617,

Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning Discriminative Representations for Skeleton Based Action Recognition.2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10608–10617,

work page 2023
[51]

BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition

Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua. BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2049–2058, 2024. 6

work page 2049
[52]

Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual con- nections

Youwei Zhou, Tianyang Xu, Cong Wu, Xiaojun Wu, and Josef Kittler. Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual con- nections. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12648– 12658, 2025. 3 SBF: An Effective Representation to Augment Skeleton for Video-based...

work page 2025
[53]

All our experiments are conducted on two hardware platforms, one with 8 NVIDIA GeForce 2080Ti GPUs and 16 CPUs and the other with 4 3090Ti GPUs and 40 CPUs

More Implementation Details In this section, we elaborate more details of our implemen- tations of SFSNet and SBFConv3D. All our experiments are conducted on two hardware platforms, one with 8 NVIDIA GeForce 2080Ti GPUs and 16 CPUs and the other with 4 3090Ti GPUs and 40 CPUs. 8.1. SFSNet Fig. 8 illustrates the detailed structure of our Simplified PointRe...

work page
[54]

Limb” Variant of SBF Our SBF has two variants, “joint

More Quantitative Results In this section, we presents more experimental results to fur- ther validate the effectiveness of our SBF and SFSNet. 9.1. Results of the “Limb” Variant of SBF Our SBF has two variants, “joint” and “limb”. While pre- vious discussions focus on the “joint” variant, this section presents the performance of SBFConv3D based on SBF of...

work page
[55]

Figure 10

More Visualization Results This section presents more visualizations of our predicted SBF in various datasets. Figure 10. Visualization of SBF components predicted by SF- SNet on NTU120 [23] (row 1-6), HMDB51 [19] (row 7-8) and UCF101 [33] (row 9-10). Each joint is depicted in a distinct color

work page