Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization
Pith reviewed 2026-05-16 21:35 UTC · model grok-4.3
The pith
Contrasting non-overlapping skeleton snippets plus U-shaped fusion produces temporally fine-grained features for action boundary detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By projecting skeleton sequences into non-overlapping snippets and using contrastive learning to force features to discriminate them across videos, combined with U-shaped fusion of intermediate features, the method yields representations that capture subtle frame-to-frame differences required to localize action boundaries more accurately than prior skeleton contrastive approaches.
What carries the argument
The snippet discrimination pretext task that densely projects sequences into non-overlapping segments and promotes cross-video distinction via contrastive learning, together with the U-shaped module for multiscale feature fusion to boost localization resolution.
If this is right
- Existing skeleton-based contrastive methods gain consistent improvements on BABEL across diverse subsets and protocols.
- State-of-the-art transfer learning performance is reached on PKUMMD after pretraining on NTU RGB+D and BABEL.
- Temporally sensitive features become available for any downstream skeleton task that requires frame-accurate action timing.
- The approach reduces reliance on dense frame-level labels during pretraining while still supporting precise localization.
Where Pith is reading between the lines
- The same snippet-level contrastive signal could be applied to other sequential modalities such as optical flow or joint trajectories without skeleton data.
- Pretraining in this manner may lower annotation costs for building detectors in domains like clinical gait analysis or sports coaching.
- Combining the U-shaped fusion with longer context windows might further sharpen boundaries in actions that span many seconds.
Load-bearing premise
Forcing features to discriminate non-overlapping snippets across videos will automatically create the temporally precise representations needed to pinpoint action boundaries, and the U-shaped fusion will raise resolution without adding alignment errors or overfitting to the pretraining data.
What would settle it
If the proposed pretraining produces no gain or a drop in frame-level localization metrics such as mean average precision on BABEL compared with standard video-level contrastive baselines, the central claim that snippet discrimination yields boundary-sensitive features would be falsified.
Figures
read the original abstract
The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a self-supervised pretraining framework for skeleton-based temporal action localization. It introduces a snippet discrimination pretext task that densely segments skeleton sequences into non-overlapping snippets and applies contrastive learning to distinguish snippets across different videos. A U-shaped multiscale feature fusion module is added to existing skeleton backbones to improve feature resolution at the frame level. The approach is reported to yield consistent gains over prior skeleton contrastive methods on BABEL across subsets and protocols, plus state-of-the-art transfer results on PKUMMD after pretraining on NTU RGB+D and BABEL.
Significance. If the empirical improvements prove robust, the work would supply a practical pretraining recipe that bridges the gap between video-level skeleton recognition and frame-level localization by encouraging temporally discriminative features. The combination of a simple snippet-level contrastive objective with a U-shaped fusion module on established backbones could be adopted as a default initialization step for downstream skeleton TAL pipelines.
major comments (2)
- [§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.
- [§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.
minor comments (2)
- [Abstract] Abstract: The phrase 'video-level {action} recognition' contains an apparent LaTeX artifact that should be cleaned.
- [§3.3] §3.3 (U-shaped module): The description of how skip connections are aligned across the encoder-decoder stages lacks explicit equations for the upsampling and concatenation operations; adding them would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.
Authors: We appreciate this observation on the nature of the snippet-level contrastive objective. While the loss is computed on aggregated snippet embeddings, the dense non-overlapping segmentation means that distinguishing adjacent snippets from the same video requires the backbone to encode distinct motion patterns at the frame level; otherwise, nearby snippets would be indistinguishable under the contrastive pull. The U-shaped fusion module is specifically designed to maintain and enhance frame-level resolution from the intermediate backbone features. To address the concern directly, we will add a new analysis subsection (including intra-snippet feature variance statistics and t-SNE visualizations of frame embeddings within snippets) in the revised manuscript to demonstrate that temporal variation is indeed preserved and encouraged. revision: yes
-
Referee: [§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.
Authors: We agree that reporting variance, statistical tests, and targeted ablations would improve the robustness of the SOTA claim. In the revised manuscript we will (i) rerun the PKUMMD transfer experiments over multiple random seeds and report mean ± standard deviation, (ii) include statistical significance tests (e.g., paired t-tests) against the strongest baselines, and (iii) add an ablation isolating the U-shaped module’s contribution on the transfer task. These additions will allow clearer attribution of gains to the proposed snippet discrimination pretext and fusion module. revision: yes
Circularity Check
No circularity; method and claims are empirically grounded
full rationale
The paper defines a new snippet-level contrastive pretext task and a U-shaped multiscale fusion module on top of published skeleton backbones. Performance gains are asserted via transfer experiments on BABEL and PKUMMD rather than by algebraic reduction to the inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim remains falsifiable through the reported downstream metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we formulate a snippet discrimination pretext task ... densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
U-shaped module ... progressively upsample the final output to the original temporal resolution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Adeli, V., Mehraban, S., Mirmehdi, M., Whone, A., Filtjens, B., Dadashzadeh, A., Fasano, A., Iaboni, A., Taati, B.: GAITGen: Disentangled motion-pathology impaired gait generative model–bringing motion generation to the clinical domain. In: WACV (2026) 12
work page 2026
-
[3]
Alwassel, H., Giancola, S., Ghanem, B.: TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In: ICCV (2021) 2
work page 2021
-
[4]
Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: ICCV (2019) 4
work page 2019
-
[5]
Chen,B.,Nie,W.,Ji,H.,Ren,W.,Tong,Q.,Wang,Z.,Liu,H.:Multiscaleskeleton- based temporal action segmentation using hierarchical temporal modeling and pre- diction ensemble. IEEE Trans. Cybern. (2025) 4
work page 2025
-
[6]
In: ECCV (2022) 2, 3, 7, 9, 10
Chen, Y., Zhao, L., Yuan, J., Tian, Y., Xia, Z., Geng, S., Han, L., Metaxas, D.N.: Hierarchically self-supervised transformer for human skeleton representation learn- ing. In: ECCV (2022) 2, 3, 7, 9, 10
work page 2022
-
[7]
Chen, Z., Li, S., Yang, B., Li, Q., Liu, H.: Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: AAAI (2021) 4
work page 2021
-
[8]
Cheng, Q., Morgan, C., Sikdar, A., Masullo, A., Whone, A., Mirmehdi, M.: Your turn: At home turning angle estimation for parkinson’s disease severity assessment. Artif. Intell. Med. p. 103194 (2025) 1, 12
work page 2025
-
[9]
Dadashzadeh, A., Duan, S., Whone, A., Mirmehdi, M.: Pecop: Parameter efficient continual pretraining for action quality assessment. In: WACV (2024) 12
work page 2024
-
[10]
Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: Temporal contrastive learning for video representation. Comput. Vis. Image. Underst (2022) 3
work page 2022
-
[11]
Do, J., Kim, M.: SkateFormer: skeletal-temporal transformer for human action recognition. In: ECCV (2024) 2
work page 2024
-
[12]
Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: AAAI (2023) 3
work page 2023
-
[13]
Fang, B., Wu, W., Liu, C., Zhou, Y., He, D., Wang, W.: Mamico: Macro-to-micro semanticcorrespondenceforself-supervisedvideorepresentationlearning.In:ACM MM (2022) 3
work page 2022
-
[14]
Gao, R., Liu, X., Yang, J., Yue, H.: CdCLR: Clip-driven contrastive learning for skeleton-based action recognition. In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8
work page 2022
-
[15]
Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph con- volutional networks for action segmentation. In: WACV (2020) 4
work page 2020
-
[16]
Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton-based temporal action segmentation. In: ICCV (2025) 1, 7
work page 2025
-
[17]
In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q
Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q. Cheng et al
work page 2022
-
[18]
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 5, 8
work page 2020
-
[19]
Hu, J., Hou, Y., Guo, Z., Gao, J.: Global and local contrastive learning for self- supervised skeleton-based action recognition. IEEE TCSVT (2024) 2, 3
work page 2024
-
[20]
Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3d human pose estimation with cross-view u-shaped graph convolutional network. ACM MM 25(2022) 4
work page 2022
-
[21]
Jang, S., Lee, H., Kim, W.J., Lee, J., Woo, S., Lee, S.: Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE TCSVT (2024) 4
work page 2024
-
[22]
Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action repre- sentation learning via cross-view consistency pursuit. In: CVPR (2021) 1, 2, 3, 4, 8
work page 2021
-
[23]
Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: UAV-human: A large bench- mark for human behavior understanding with unmanned aerial vehicles. In: CVPR (2021) 1
work page 2021
-
[24]
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 4
work page 2017
-
[25]
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In: ACM MM (2017) 7, 8
work page 2017
-
[26]
Liu, H., Liu, Y., Ren, M., Wang, H., Wang, Y., Sun, Z.: Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. In: CVPR (2025) 2
work page 2025
-
[27]
Liu, Y., Wang, K., Liu, L., Lan, H., Lin, L.: TCGL: Temporal contrastive graph for self-supervised video representation learning. IEEE TIP (2022) 2, 3
work page 2022
-
[28]
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: CVPR (2019) 7
work page 2019
-
[29]
Morgan, C., Tonkin, E.L., Masullo, A., Jovan, F., Sikdar, A., Khaire, P., Mirmehdi, M., McConville, R., Tourte, G.J., Whone, A., et al.: A multimodal dataset of real world mobility activities in parkinson’s disease. Scientific Data (2023) 1, 12
work page 2023
-
[30]
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with english labels. In: CVPR (2021) 7, 8
work page 2021
-
[31]
Ray, A., Raj, A., Kolekar, M.H.: Autoregressive adaptive hypergraph transformer for skeleton-based activity recognition. In: WACV (2025) 2
work page 2025
-
[32]
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015) 4, 7
work page 2015
-
[33]
Sager, C., Janiesch, C., Zschech, P.: A survey of image labelling for computer vision applications. J. Bus. Anal.4(2) (2021) 1
work page 2021
-
[34]
Sardari, S., Sharifzadeh, S., Daneshkhah, A., Nakisa, B., Loke, S.W., Palade, V., Duncan, M.J.: Artificial intelligence for skeleton-based physical rehabilitation ac- tion evaluation: A systematic review. Comput. Biol. Med.158, 106835 (2023) 1
work page 2023
-
[35]
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D a large scale dataset for 3d human activity analysis. In: CVPR (2016) 1, 7, 8
work page 2016
-
[36]
Shao, Y., Zhang, F., Xu, C.: Snippet-to-prototype contrastive consensus network for weakly supervised temporal action localization. IEEE TMM (2024) 2
work page 2024
-
[37]
In: CVPR (2019) 2, 8, 9, 10, 17, 18
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019) 2, 8, 9, 10, 17, 18
work page 2019
-
[38]
IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15
Shu, X., Xu, B., Zhang, L., Tang, J.: Multi-granularity anchor-contrastive rep- resentation learning for semi-supervised skeleton-based action recognition. IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15
work page 2022
-
[39]
Su, H., Gan, W., Wu, W., Qiao, Y., Yan, J.: BSN++: Complementary bound- ary regressor with scale-balanced relation modeling for temporal action proposal generation. In: AAAI (2021) 7
work page 2021
-
[40]
Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter-intra contrastive framework. In: ACM MM (2020) 3
work page 2020
-
[41]
Tao, L., Wang, X., Yamasaki, T.: An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE TCSVT (2022) 2, 3
work page 2022
-
[42]
In: VUA workshop at BMVC (2023) 1
Wang, H., Mirmehdi, M., Damen, D., Perrett, T.: Centre Stage: Centricity-based audio-visual temporal action detection. In: VUA workshop at BMVC (2023) 1
work page 2023
-
[43]
Wang, H., Ma, X., Kuang, J., Gui, J.: Heterogeneous skeleton-based action repre- sentation learning. In: CVPR (2025) 3, 4
work page 2025
-
[44]
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR (2021) 2, 5, 6
work page 2021
-
[45]
Weng, W., Wang, H., Wang, J., He, L., Xie, G.S.: USDRL: Unified skeleton-based dense representation learning with multi-grained feature decorrelation. In: AAAI (2025) 7, 8, 9, 10
work page 2025
-
[46]
Wu, L., Lin, L., Zhang, J., Ma, Y., Liu, J.: MacDiff: Unified skeleton modeling with masked conditional diffusion. In: ECCV (2024) 3
work page 2024
-
[47]
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non- parametric instance discrimination. In: CVPR (2018) 2
work page 2018
-
[48]
Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: CVPR (2021) 2, 5
work page 2021
-
[49]
Xu, M., Pérez-Rúa, J.M., Escorcia, V., Martinez, B., Zhu, X., Zhang, L., Ghanem, B., Xiang, T.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV (2021) 2
work page 2021
-
[50]
Xu, R., Liu, C., Chen, Y., Lei, Z.: Snippet-level supervised contrastive learning- based transformer for temporal action detection. In: IJCNN (2023) 2
work page 2023
-
[51]
Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: ICCV (2023) 1, 3
work page 2023
-
[52]
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018) 8, 10, 17, 18
work page 2018
-
[53]
Yan, X., Pun, C.M., Li, H., Liu, M., Gao, H.: Hierarchical local temporal feature enhancing for transformer-based 3d human pose estimation. In: ICME (2024) 4
work page 2024
-
[54]
In: BMVC (2021) 2, 8, 10, 17, 18
Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Brémond, F.: UNIK: A unified framework for real-world skeleton-based action recognition. In: BMVC (2021) 2, 8, 10, 17, 18
work page 2021
-
[55]
Yang, D., Wang, Y., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bre- mond, F.: LAC-latent action composition for skeleton-based action segmentation. In: CVPR (2023) 3
work page 2023
-
[56]
Yang, H., Guo, L., Zhang, Y., Wu, X.: U-shaped spatial–temporal transformer network for 3d human pose estimation. Mach. Vis. Appl.33(6) (2022) 4
work page 2022
-
[57]
Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly- supervised action recognition. In: AAAI (2023) 1, 7, 8, 9
work page 2023
-
[58]
Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR (2021) 2
work page 2021
-
[59]
Zhang, J., Lin, L., Liu, J.: Hierarchical consistent contrastive learning for skeleton- based action recognition with growing augmentations. In: AAAI (2023) 3, 8, 9
work page 2023
-
[60]
Zhang, J., Lin, L., Liu, J.: Prompted contrast with masked motion modeling: To- wards versatile 3d action representation learning. In: ACM MM (2023) 8, 10 16 Q. Cheng et al
work page 2023
-
[61]
Zhou, Y., Xu, T., Wu, C., Wu, X., Kittler, J.: Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. In: ICCV (2025) 10
work page 2025
-
[62]
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: MICCAI (2018) 7
work page 2018
-
[63]
Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization
Zhu, Y., Han, H., Yu, Z., Liu, G.: Modeling the relative visual tempo for self- supervised skeleton-based action recognition. In: CVPR (2023) 2, 3, 4, 8, 9, 10, 11, 12, 17, 18 Supplementary materials for "Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization" Q. Cheng1, J. Liu1, C. Morgan2, A. Whone2, and M. Mirmehdi...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.