Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Alan Whone; Catherine Morgan; Jingjing Liu; Majid Mirmehdi; Qiushuo Cheng

arxiv: 2512.16504 · v3 · submitted 2025-12-18 · 💻 cs.CV

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Qiushuo Cheng , Jingjing Liu , Catherine Morgan , Alan Whone , Majid Mirmehdi This is my paper

Pith reviewed 2026-05-16 21:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords skeleton action localizationcontrastive learningself-supervised pretrainingmultiscale feature fusiontemporal action detection3D skeleton dataBABEL datasettransfer learning

0 comments

The pith

Contrasting non-overlapping skeleton snippets plus U-shaped fusion produces temporally fine-grained features for action boundary detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates a snippet discrimination pretext task that divides skeleton sequences into non-overlapping segments and trains models to distinguish these segments across different videos through contrastive learning. This is paired with a U-shaped module that fuses intermediate multiscale features from existing skeleton backbones to raise frame-level resolution. The resulting representations are shown to improve localization performance on the BABEL dataset under multiple subsets and evaluation protocols. The same pretraining also delivers state-of-the-art transfer results on PKUMMD after training on NTU RGB+D and BABEL.

Core claim

By projecting skeleton sequences into non-overlapping snippets and using contrastive learning to force features to discriminate them across videos, combined with U-shaped fusion of intermediate features, the method yields representations that capture subtle frame-to-frame differences required to localize action boundaries more accurately than prior skeleton contrastive approaches.

What carries the argument

The snippet discrimination pretext task that densely projects sequences into non-overlapping segments and promotes cross-video distinction via contrastive learning, together with the U-shaped module for multiscale feature fusion to boost localization resolution.

If this is right

Existing skeleton-based contrastive methods gain consistent improvements on BABEL across diverse subsets and protocols.
State-of-the-art transfer learning performance is reached on PKUMMD after pretraining on NTU RGB+D and BABEL.
Temporally sensitive features become available for any downstream skeleton task that requires frame-accurate action timing.
The approach reduces reliance on dense frame-level labels during pretraining while still supporting precise localization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same snippet-level contrastive signal could be applied to other sequential modalities such as optical flow or joint trajectories without skeleton data.
Pretraining in this manner may lower annotation costs for building detectors in domains like clinical gait analysis or sports coaching.
Combining the U-shaped fusion with longer context windows might further sharpen boundaries in actions that span many seconds.

Load-bearing premise

Forcing features to discriminate non-overlapping snippets across videos will automatically create the temporally precise representations needed to pinpoint action boundaries, and the U-shaped fusion will raise resolution without adding alignment errors or overfitting to the pretraining data.

What would settle it

If the proposed pretraining produces no gain or a drop in frame-level localization metrics such as mean average precision on BABEL compared with standard video-level contrastive baselines, the central claim that snippet discrimination yields boundary-sensitive features would be falsified.

Figures

Figures reproduced from arXiv: 2512.16504 by Alan Whone, Catherine Morgan, Jingjing Liu, Majid Mirmehdi, Qiushuo Cheng.

**Figure 2.** Figure 2: Overall pipeline of pretraining and finetuning in proposed designs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Similarity-based matching. In our pretext task, each snippet is treated as a distinct instance in its own right to discriminate against other snippets across videos. We therefore adopt a similarity-based matching strategy [44] to establish temporal correspondence between snippets in different views (see [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualititive visualization of action predictions on BABEL. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The snippet discrimination pretext plus U-shaped fusion gives usable gains on BABEL and PKUMMD transfer, but the boundary sensitivity still rests on an indirect mechanism.

read the letter

The paper's main contribution is a snippet discrimination pretext task that contrasts non-overlapping segments from skeleton sequences across videos, paired with a U-shaped multiscale fusion module built on top of existing recognition backbones to raise temporal resolution for localization. This produces consistent improvements on BABEL across subsets and protocols, plus state-of-the-art transfer results on PKUMMD after pretraining on NTU RGB+D and BABEL. The approach is practical because it reuses published backbones rather than requiring entirely new architectures. The experiments cover multiple evaluation settings, which helps show the method is not tied to one narrow protocol. The fusion step is a straightforward way to combine intermediate features without heavy redesign. The soft spot is the stress-test concern about intra-snippet gradients. The contrastive loss operates at the whole-snippet level, so nothing in the objective directly encourages the model to represent small frame-to-frame differences inside a snippet; the embeddings could stay relatively flat within each segment while still separating videos. The U-shaped upsampling then increases resolution, but without an explicit alignment or boundary term it can smooth transitions or introduce interpolation artifacts. The paper would be stronger with ablations that remove the snippet task while keeping the fusion, plus error bars or multiple seeds to confirm the reported lifts are not from tuning alone. This is for groups already working on self-supervised skeleton recognition who need to move to localization without starting over. It deserves a serious referee because the method is concrete, the claims are testable on public benchmarks, and the gap it targets is real even if the gains turn out incremental.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-supervised pretraining framework for skeleton-based temporal action localization. It introduces a snippet discrimination pretext task that densely segments skeleton sequences into non-overlapping snippets and applies contrastive learning to distinguish snippets across different videos. A U-shaped multiscale feature fusion module is added to existing skeleton backbones to improve feature resolution at the frame level. The approach is reported to yield consistent gains over prior skeleton contrastive methods on BABEL across subsets and protocols, plus state-of-the-art transfer results on PKUMMD after pretraining on NTU RGB+D and BABEL.

Significance. If the empirical improvements prove robust, the work would supply a practical pretraining recipe that bridges the gap between video-level skeleton recognition and frame-level localization by encouraging temporally discriminative features. The combination of a simple snippet-level contrastive objective with a U-shaped fusion module on established backbones could be adopted as a default initialization step for downstream skeleton TAL pipelines.

major comments (2)

[§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.
[§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.

minor comments (2)

[Abstract] Abstract: The phrase 'video-level {action} recognition' contains an apparent LaTeX artifact that should be cleaned.
[§3.3] §3.3 (U-shaped module): The description of how skip connections are aligned across the encoder-decoder stages lacks explicit equations for the upsampling and concatenation operations; adding them would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.

Authors: We appreciate this observation on the nature of the snippet-level contrastive objective. While the loss is computed on aggregated snippet embeddings, the dense non-overlapping segmentation means that distinguishing adjacent snippets from the same video requires the backbone to encode distinct motion patterns at the frame level; otherwise, nearby snippets would be indistinguishable under the contrastive pull. The U-shaped fusion module is specifically designed to maintain and enhance frame-level resolution from the intermediate backbone features. To address the concern directly, we will add a new analysis subsection (including intra-snippet feature variance statistics and t-SNE visualizations of frame embeddings within snippets) in the revised manuscript to demonstrate that temporal variation is indeed preserved and encouraged. revision: yes
Referee: [§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.

Authors: We agree that reporting variance, statistical tests, and targeted ablations would improve the robustness of the SOTA claim. In the revised manuscript we will (i) rerun the PKUMMD transfer experiments over multiple random seeds and report mean ± standard deviation, (ii) include statistical significance tests (e.g., paired t-tests) against the strongest baselines, and (iii) add an ablation isolating the U-shaped module’s contribution on the transfer task. These additions will allow clearer attribution of gains to the proposed snippet discrimination pretext and fusion module. revision: yes

Circularity Check

0 steps flagged

No circularity; method and claims are empirically grounded

full rationale

The paper defines a new snippet-level contrastive pretext task and a U-shaped multiscale fusion module on top of published skeleton backbones. Performance gains are asserted via transfer experiments on BABEL and PKUMMD rather than by algebraic reduction to the inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim remains falsifiable through the reported downstream metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5479 in / 1137 out tokens · 42218 ms · 2026-05-16T21:35:36.359897+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we formulate a snippet discrimination pretext task ... densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

U-shaped module ... progressively upsample the final output to the original temporal resolution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

[1]

In: CVPR

Abdelfattah, M., Hassan, M., Alahi, A.: MaskCLR: Attention-guided contrastive learning for robust action representation learning. In: CVPR. pp. 18678–18687 (2024) 2, 3, 4

work page 2024
[2]

In: WACV (2026) 12

Adeli, V., Mehraban, S., Mirmehdi, M., Whone, A., Filtjens, B., Dadashzadeh, A., Fasano, A., Iaboni, A., Taati, B.: GAITGen: Disentangled motion-pathology impaired gait generative model–bringing motion generation to the clinical domain. In: WACV (2026) 12

work page 2026
[3]

In: ICCV (2021) 2

Alwassel, H., Giancola, S., Ghanem, B.: TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In: ICCV (2021) 2

work page 2021
[4]

In: ICCV (2019) 4

Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: ICCV (2019) 4

work page 2019
[5]

IEEE Trans

Chen,B.,Nie,W.,Ji,H.,Ren,W.,Tong,Q.,Wang,Z.,Liu,H.:Multiscaleskeleton- based temporal action segmentation using hierarchical temporal modeling and pre- diction ensemble. IEEE Trans. Cybern. (2025) 4

work page 2025
[6]

In: ECCV (2022) 2, 3, 7, 9, 10

Chen, Y., Zhao, L., Yuan, J., Tian, Y., Xia, Z., Geng, S., Han, L., Metaxas, D.N.: Hierarchically self-supervised transformer for human skeleton representation learn- ing. In: ECCV (2022) 2, 3, 7, 9, 10

work page 2022
[7]

In: AAAI (2021) 4

Chen, Z., Li, S., Yang, B., Li, Q., Liu, H.: Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: AAAI (2021) 4

work page 2021
[8]

Cheng, Q., Morgan, C., Sikdar, A., Masullo, A., Whone, A., Mirmehdi, M.: Your turn: At home turning angle estimation for parkinson’s disease severity assessment. Artif. Intell. Med. p. 103194 (2025) 1, 12

work page 2025
[9]

In: WACV (2024) 12

Dadashzadeh, A., Duan, S., Whone, A., Mirmehdi, M.: Pecop: Parameter efficient continual pretraining for action quality assessment. In: WACV (2024) 12

work page 2024
[10]

Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: Temporal contrastive learning for video representation. Comput. Vis. Image. Underst (2022) 3

work page 2022
[11]

In: ECCV (2024) 2

Do, J., Kim, M.: SkateFormer: skeletal-temporal transformer for human action recognition. In: ECCV (2024) 2

work page 2024
[12]

In: AAAI (2023) 3

Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: AAAI (2023) 3

work page 2023
[13]

Fang, B., Wu, W., Liu, C., Zhou, Y., He, D., Wang, W.: Mamico: Macro-to-micro semanticcorrespondenceforself-supervisedvideorepresentationlearning.In:ACM MM (2022) 3

work page 2022
[14]

In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

Gao, R., Liu, X., Yang, J., Yue, H.: CdCLR: Clip-driven contrastive learning for skeleton-based action recognition. In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

work page 2022
[15]

In: WACV (2020) 4

Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph con- volutional networks for action segmentation. In: WACV (2020) 4

work page 2020
[16]

In: ICCV (2025) 1, 7

Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton-based temporal action segmentation. In: ICCV (2025) 1, 7

work page 2025
[17]

In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q

Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q. Cheng et al

work page 2022
[18]

In: CVPR (2020) 5, 8

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 5, 8

work page 2020
[19]

IEEE TCSVT (2024) 2, 3

Hu, J., Hou, Y., Guo, Z., Gao, J.: Global and local contrastive learning for self- supervised skeleton-based action recognition. IEEE TCSVT (2024) 2, 3

work page 2024
[20]

ACM MM 25(2022) 4

Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3d human pose estimation with cross-view u-shaped graph convolutional network. ACM MM 25(2022) 4

work page 2022
[21]

IEEE TCSVT (2024) 4

Jang, S., Lee, H., Kim, W.J., Lee, J., Woo, S., Lee, S.: Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE TCSVT (2024) 4

work page 2024
[22]

In: CVPR (2021) 1, 2, 3, 4, 8

Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action repre- sentation learning via cross-view consistency pursuit. In: CVPR (2021) 1, 2, 3, 4, 8

work page 2021
[23]

In: CVPR (2021) 1

Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: UAV-human: A large bench- mark for human behavior understanding with unmanned aerial vehicles. In: CVPR (2021) 1

work page 2021
[24]

In: CVPR (2017) 4

Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 4

work page 2017
[25]

In: ACM MM (2017) 7, 8

Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In: ACM MM (2017) 7, 8

work page 2017
[26]

In: CVPR (2025) 2

Liu, H., Liu, Y., Ren, M., Wang, H., Wang, Y., Sun, Z.: Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. In: CVPR (2025) 2

work page 2025
[27]

IEEE TIP (2022) 2, 3

Liu, Y., Wang, K., Liu, L., Lan, H., Lin, L.: TCGL: Temporal contrastive graph for self-supervised video representation learning. IEEE TIP (2022) 2, 3

work page 2022
[28]

In: CVPR (2019) 7

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: CVPR (2019) 7

work page 2019
[29]

Scientific Data (2023) 1, 12

Morgan, C., Tonkin, E.L., Masullo, A., Jovan, F., Sikdar, A., Khaire, P., Mirmehdi, M., McConville, R., Tourte, G.J., Whone, A., et al.: A multimodal dataset of real world mobility activities in parkinson’s disease. Scientific Data (2023) 1, 12

work page 2023
[30]

In: CVPR (2021) 7, 8

Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with english labels. In: CVPR (2021) 7, 8

work page 2021
[31]

In: WACV (2025) 2

Ray, A., Raj, A., Kolekar, M.H.: Autoregressive adaptive hypergraph transformer for skeleton-based activity recognition. In: WACV (2025) 2

work page 2025
[32]

In: MICCAI (2015) 4, 7

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015) 4, 7

work page 2015
[33]

Sager, C., Janiesch, C., Zschech, P.: A survey of image labelling for computer vision applications. J. Bus. Anal.4(2) (2021) 1

work page 2021
[34]

Sardari, S., Sharifzadeh, S., Daneshkhah, A., Nakisa, B., Loke, S.W., Palade, V., Duncan, M.J.: Artificial intelligence for skeleton-based physical rehabilitation ac- tion evaluation: A systematic review. Comput. Biol. Med.158, 106835 (2023) 1

work page 2023
[35]

In: CVPR (2016) 1, 7, 8

Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D a large scale dataset for 3d human activity analysis. In: CVPR (2016) 1, 7, 8

work page 2016
[36]

IEEE TMM (2024) 2

Shao, Y., Zhang, F., Xu, C.: Snippet-to-prototype contrastive consensus network for weakly supervised temporal action localization. IEEE TMM (2024) 2

work page 2024
[37]

In: CVPR (2019) 2, 8, 9, 10, 17, 18

Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019) 2, 8, 9, 10, 17, 18

work page 2019
[38]

IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

Shu, X., Xu, B., Zhang, L., Tang, J.: Multi-granularity anchor-contrastive rep- resentation learning for semi-supervised skeleton-based action recognition. IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

work page 2022
[39]

In: AAAI (2021) 7

Su, H., Gan, W., Wu, W., Qiao, Y., Yan, J.: BSN++: Complementary bound- ary regressor with scale-balanced relation modeling for temporal action proposal generation. In: AAAI (2021) 7

work page 2021
[40]

In: ACM MM (2020) 3

Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter-intra contrastive framework. In: ACM MM (2020) 3

work page 2020
[41]

IEEE TCSVT (2022) 2, 3

Tao, L., Wang, X., Yamasaki, T.: An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE TCSVT (2022) 2, 3

work page 2022
[42]

In: VUA workshop at BMVC (2023) 1

Wang, H., Mirmehdi, M., Damen, D., Perrett, T.: Centre Stage: Centricity-based audio-visual temporal action detection. In: VUA workshop at BMVC (2023) 1

work page 2023
[43]

In: CVPR (2025) 3, 4

Wang, H., Ma, X., Kuang, J., Gui, J.: Heterogeneous skeleton-based action repre- sentation learning. In: CVPR (2025) 3, 4

work page 2025
[44]

In: CVPR (2021) 2, 5, 6

Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR (2021) 2, 5, 6

work page 2021
[45]

In: AAAI (2025) 7, 8, 9, 10

Weng, W., Wang, H., Wang, J., He, L., Xie, G.S.: USDRL: Unified skeleton-based dense representation learning with multi-grained feature decorrelation. In: AAAI (2025) 7, 8, 9, 10

work page 2025
[46]

In: ECCV (2024) 3

Wu, L., Lin, L., Zhang, J., Ma, Y., Liu, J.: MacDiff: Unified skeleton modeling with masked conditional diffusion. In: ECCV (2024) 3

work page 2024
[47]

In: CVPR (2018) 2

Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non- parametric instance discrimination. In: CVPR (2018) 2

work page 2018
[48]

In: CVPR (2021) 2, 5

Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: CVPR (2021) 2, 5

work page 2021
[49]

In: ICCV (2021) 2

Xu, M., Pérez-Rúa, J.M., Escorcia, V., Martinez, B., Zhu, X., Zhang, L., Ghanem, B., Xiang, T.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV (2021) 2

work page 2021
[50]

In: IJCNN (2023) 2

Xu, R., Liu, C., Chen, Y., Lei, Z.: Snippet-level supervised contrastive learning- based transformer for temporal action detection. In: IJCNN (2023) 2

work page 2023
[51]

In: ICCV (2023) 1, 3

Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: ICCV (2023) 1, 3

work page 2023
[52]

In: AAAI (2018) 8, 10, 17, 18

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018) 8, 10, 17, 18

work page 2018
[53]

In: ICME (2024) 4

Yan, X., Pun, C.M., Li, H., Liu, M., Gao, H.: Hierarchical local temporal feature enhancing for transformer-based 3d human pose estimation. In: ICME (2024) 4

work page 2024
[54]

In: BMVC (2021) 2, 8, 10, 17, 18

Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Brémond, F.: UNIK: A unified framework for real-world skeleton-based action recognition. In: BMVC (2021) 2, 8, 10, 17, 18

work page 2021
[55]

In: CVPR (2023) 3

Yang, D., Wang, Y., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bre- mond, F.: LAC-latent action composition for skeleton-based action segmentation. In: CVPR (2023) 3

work page 2023
[56]

Yang, H., Guo, L., Zhang, Y., Wu, X.: U-shaped spatial–temporal transformer network for 3d human pose estimation. Mach. Vis. Appl.33(6) (2022) 4

work page 2022
[57]

In: AAAI (2023) 1, 7, 8, 9

Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly- supervised action recognition. In: AAAI (2023) 1, 7, 8, 9

work page 2023
[58]

In: CVPR (2021) 2

Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR (2021) 2

work page 2021
[59]

In: AAAI (2023) 3, 8, 9

Zhang, J., Lin, L., Liu, J.: Hierarchical consistent contrastive learning for skeleton- based action recognition with growing augmentations. In: AAAI (2023) 3, 8, 9

work page 2023
[60]

In: ACM MM (2023) 8, 10 16 Q

Zhang, J., Lin, L., Liu, J.: Prompted contrast with masked motion modeling: To- wards versatile 3d action representation learning. In: ACM MM (2023) 8, 10 16 Q. Cheng et al

work page 2023
[61]

In: ICCV (2025) 10

Zhou, Y., Xu, T., Wu, C., Wu, X., Kittler, J.: Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. In: ICCV (2025) 10

work page 2025
[62]

In: MICCAI (2018) 7

Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: MICCAI (2018) 7

work page 2018
[63]

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Zhu, Y., Han, H., Yu, Z., Liu, G.: Modeling the relative visual tempo for self- supervised skeleton-based action recognition. In: CVPR (2023) 2, 3, 4, 8, 9, 10, 11, 12, 17, 18 Supplementary materials for "Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization" Q. Cheng1, J. Liu1, C. Morgan2, A. Whone2, and M. Mirmehdi...

work page 2023

[1] [1]

In: CVPR

Abdelfattah, M., Hassan, M., Alahi, A.: MaskCLR: Attention-guided contrastive learning for robust action representation learning. In: CVPR. pp. 18678–18687 (2024) 2, 3, 4

work page 2024

[2] [2]

In: WACV (2026) 12

Adeli, V., Mehraban, S., Mirmehdi, M., Whone, A., Filtjens, B., Dadashzadeh, A., Fasano, A., Iaboni, A., Taati, B.: GAITGen: Disentangled motion-pathology impaired gait generative model–bringing motion generation to the clinical domain. In: WACV (2026) 12

work page 2026

[3] [3]

In: ICCV (2021) 2

Alwassel, H., Giancola, S., Ghanem, B.: TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In: ICCV (2021) 2

work page 2021

[4] [4]

In: ICCV (2019) 4

Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: ICCV (2019) 4

work page 2019

[5] [5]

IEEE Trans

Chen,B.,Nie,W.,Ji,H.,Ren,W.,Tong,Q.,Wang,Z.,Liu,H.:Multiscaleskeleton- based temporal action segmentation using hierarchical temporal modeling and pre- diction ensemble. IEEE Trans. Cybern. (2025) 4

work page 2025

[6] [6]

In: ECCV (2022) 2, 3, 7, 9, 10

Chen, Y., Zhao, L., Yuan, J., Tian, Y., Xia, Z., Geng, S., Han, L., Metaxas, D.N.: Hierarchically self-supervised transformer for human skeleton representation learn- ing. In: ECCV (2022) 2, 3, 7, 9, 10

work page 2022

[7] [7]

In: AAAI (2021) 4

Chen, Z., Li, S., Yang, B., Li, Q., Liu, H.: Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: AAAI (2021) 4

work page 2021

[8] [8]

Cheng, Q., Morgan, C., Sikdar, A., Masullo, A., Whone, A., Mirmehdi, M.: Your turn: At home turning angle estimation for parkinson’s disease severity assessment. Artif. Intell. Med. p. 103194 (2025) 1, 12

work page 2025

[9] [9]

In: WACV (2024) 12

Dadashzadeh, A., Duan, S., Whone, A., Mirmehdi, M.: Pecop: Parameter efficient continual pretraining for action quality assessment. In: WACV (2024) 12

work page 2024

[10] [10]

Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: Temporal contrastive learning for video representation. Comput. Vis. Image. Underst (2022) 3

work page 2022

[11] [11]

In: ECCV (2024) 2

Do, J., Kim, M.: SkateFormer: skeletal-temporal transformer for human action recognition. In: ECCV (2024) 2

work page 2024

[12] [12]

In: AAAI (2023) 3

Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: AAAI (2023) 3

work page 2023

[13] [13]

Fang, B., Wu, W., Liu, C., Zhou, Y., He, D., Wang, W.: Mamico: Macro-to-micro semanticcorrespondenceforself-supervisedvideorepresentationlearning.In:ACM MM (2022) 3

work page 2022

[14] [14]

In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

Gao, R., Liu, X., Yang, J., Yue, H.: CdCLR: Clip-driven contrastive learning for skeleton-based action recognition. In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

work page 2022

[15] [15]

In: WACV (2020) 4

Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph con- volutional networks for action segmentation. In: WACV (2020) 4

work page 2020

[16] [16]

In: ICCV (2025) 1, 7

Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton-based temporal action segmentation. In: ICCV (2025) 1, 7

work page 2025

[17] [17]

In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q

Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q. Cheng et al

work page 2022

[18] [18]

In: CVPR (2020) 5, 8

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 5, 8

work page 2020

[19] [19]

IEEE TCSVT (2024) 2, 3

Hu, J., Hou, Y., Guo, Z., Gao, J.: Global and local contrastive learning for self- supervised skeleton-based action recognition. IEEE TCSVT (2024) 2, 3

work page 2024

[20] [20]

ACM MM 25(2022) 4

Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3d human pose estimation with cross-view u-shaped graph convolutional network. ACM MM 25(2022) 4

work page 2022

[21] [21]

IEEE TCSVT (2024) 4

Jang, S., Lee, H., Kim, W.J., Lee, J., Woo, S., Lee, S.: Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE TCSVT (2024) 4

work page 2024

[22] [22]

In: CVPR (2021) 1, 2, 3, 4, 8

Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action repre- sentation learning via cross-view consistency pursuit. In: CVPR (2021) 1, 2, 3, 4, 8

work page 2021

[23] [23]

In: CVPR (2021) 1

Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: UAV-human: A large bench- mark for human behavior understanding with unmanned aerial vehicles. In: CVPR (2021) 1

work page 2021

[24] [24]

In: CVPR (2017) 4

Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 4

work page 2017

[25] [25]

In: ACM MM (2017) 7, 8

Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In: ACM MM (2017) 7, 8

work page 2017

[26] [26]

In: CVPR (2025) 2

Liu, H., Liu, Y., Ren, M., Wang, H., Wang, Y., Sun, Z.: Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. In: CVPR (2025) 2

work page 2025

[27] [27]

IEEE TIP (2022) 2, 3

Liu, Y., Wang, K., Liu, L., Lan, H., Lin, L.: TCGL: Temporal contrastive graph for self-supervised video representation learning. IEEE TIP (2022) 2, 3

work page 2022

[28] [28]

In: CVPR (2019) 7

Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: CVPR (2019) 7

work page 2019

[29] [29]

Scientific Data (2023) 1, 12

Morgan, C., Tonkin, E.L., Masullo, A., Jovan, F., Sikdar, A., Khaire, P., Mirmehdi, M., McConville, R., Tourte, G.J., Whone, A., et al.: A multimodal dataset of real world mobility activities in parkinson’s disease. Scientific Data (2023) 1, 12

work page 2023

[30] [30]

In: CVPR (2021) 7, 8

Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with english labels. In: CVPR (2021) 7, 8

work page 2021

[31] [31]

In: WACV (2025) 2

Ray, A., Raj, A., Kolekar, M.H.: Autoregressive adaptive hypergraph transformer for skeleton-based activity recognition. In: WACV (2025) 2

work page 2025

[32] [32]

In: MICCAI (2015) 4, 7

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015) 4, 7

work page 2015

[33] [33]

Sager, C., Janiesch, C., Zschech, P.: A survey of image labelling for computer vision applications. J. Bus. Anal.4(2) (2021) 1

work page 2021

[34] [34]

Sardari, S., Sharifzadeh, S., Daneshkhah, A., Nakisa, B., Loke, S.W., Palade, V., Duncan, M.J.: Artificial intelligence for skeleton-based physical rehabilitation ac- tion evaluation: A systematic review. Comput. Biol. Med.158, 106835 (2023) 1

work page 2023

[35] [35]

In: CVPR (2016) 1, 7, 8

Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D a large scale dataset for 3d human activity analysis. In: CVPR (2016) 1, 7, 8

work page 2016

[36] [36]

IEEE TMM (2024) 2

Shao, Y., Zhang, F., Xu, C.: Snippet-to-prototype contrastive consensus network for weakly supervised temporal action localization. IEEE TMM (2024) 2

work page 2024

[37] [37]

In: CVPR (2019) 2, 8, 9, 10, 17, 18

Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019) 2, 8, 9, 10, 17, 18

work page 2019

[38] [38]

IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

Shu, X., Xu, B., Zhang, L., Tang, J.: Multi-granularity anchor-contrastive rep- resentation learning for semi-supervised skeleton-based action recognition. IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

work page 2022

[39] [39]

In: AAAI (2021) 7

Su, H., Gan, W., Wu, W., Qiao, Y., Yan, J.: BSN++: Complementary bound- ary regressor with scale-balanced relation modeling for temporal action proposal generation. In: AAAI (2021) 7

work page 2021

[40] [40]

In: ACM MM (2020) 3

Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter-intra contrastive framework. In: ACM MM (2020) 3

work page 2020

[41] [41]

IEEE TCSVT (2022) 2, 3

Tao, L., Wang, X., Yamasaki, T.: An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE TCSVT (2022) 2, 3

work page 2022

[42] [42]

In: VUA workshop at BMVC (2023) 1

Wang, H., Mirmehdi, M., Damen, D., Perrett, T.: Centre Stage: Centricity-based audio-visual temporal action detection. In: VUA workshop at BMVC (2023) 1

work page 2023

[43] [43]

In: CVPR (2025) 3, 4

Wang, H., Ma, X., Kuang, J., Gui, J.: Heterogeneous skeleton-based action repre- sentation learning. In: CVPR (2025) 3, 4

work page 2025

[44] [44]

In: CVPR (2021) 2, 5, 6

Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR (2021) 2, 5, 6

work page 2021

[45] [45]

In: AAAI (2025) 7, 8, 9, 10

Weng, W., Wang, H., Wang, J., He, L., Xie, G.S.: USDRL: Unified skeleton-based dense representation learning with multi-grained feature decorrelation. In: AAAI (2025) 7, 8, 9, 10

work page 2025

[46] [46]

In: ECCV (2024) 3

Wu, L., Lin, L., Zhang, J., Ma, Y., Liu, J.: MacDiff: Unified skeleton modeling with masked conditional diffusion. In: ECCV (2024) 3

work page 2024

[47] [47]

In: CVPR (2018) 2

Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non- parametric instance discrimination. In: CVPR (2018) 2

work page 2018

[48] [48]

In: CVPR (2021) 2, 5

Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: CVPR (2021) 2, 5

work page 2021

[49] [49]

In: ICCV (2021) 2

Xu, M., Pérez-Rúa, J.M., Escorcia, V., Martinez, B., Zhu, X., Zhang, L., Ghanem, B., Xiang, T.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV (2021) 2

work page 2021

[50] [50]

In: IJCNN (2023) 2

Xu, R., Liu, C., Chen, Y., Lei, Z.: Snippet-level supervised contrastive learning- based transformer for temporal action detection. In: IJCNN (2023) 2

work page 2023

[51] [51]

In: ICCV (2023) 1, 3

Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: ICCV (2023) 1, 3

work page 2023

[52] [52]

In: AAAI (2018) 8, 10, 17, 18

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018) 8, 10, 17, 18

work page 2018

[53] [53]

In: ICME (2024) 4

Yan, X., Pun, C.M., Li, H., Liu, M., Gao, H.: Hierarchical local temporal feature enhancing for transformer-based 3d human pose estimation. In: ICME (2024) 4

work page 2024

[54] [54]

In: BMVC (2021) 2, 8, 10, 17, 18

Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Brémond, F.: UNIK: A unified framework for real-world skeleton-based action recognition. In: BMVC (2021) 2, 8, 10, 17, 18

work page 2021

[55] [55]

In: CVPR (2023) 3

Yang, D., Wang, Y., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bre- mond, F.: LAC-latent action composition for skeleton-based action segmentation. In: CVPR (2023) 3

work page 2023

[56] [56]

Yang, H., Guo, L., Zhang, Y., Wu, X.: U-shaped spatial–temporal transformer network for 3d human pose estimation. Mach. Vis. Appl.33(6) (2022) 4

work page 2022

[57] [57]

In: AAAI (2023) 1, 7, 8, 9

Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly- supervised action recognition. In: AAAI (2023) 1, 7, 8, 9

work page 2023

[58] [58]

In: CVPR (2021) 2

Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR (2021) 2

work page 2021

[59] [59]

In: AAAI (2023) 3, 8, 9

Zhang, J., Lin, L., Liu, J.: Hierarchical consistent contrastive learning for skeleton- based action recognition with growing augmentations. In: AAAI (2023) 3, 8, 9

work page 2023

[60] [60]

In: ACM MM (2023) 8, 10 16 Q

Zhang, J., Lin, L., Liu, J.: Prompted contrast with masked motion modeling: To- wards versatile 3d action representation learning. In: ACM MM (2023) 8, 10 16 Q. Cheng et al

work page 2023

[61] [61]

In: ICCV (2025) 10

Zhou, Y., Xu, T., Wu, C., Wu, X., Kittler, J.: Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. In: ICCV (2025) 10

work page 2025

[62] [62]

In: MICCAI (2018) 7

Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: MICCAI (2018) 7

work page 2018

[63] [63]

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Zhu, Y., Han, H., Yu, Z., Liu, G.: Modeling the relative visual tempo for self- supervised skeleton-based action recognition. In: CVPR (2023) 2, 3, 4, 8, 9, 10, 11, 12, 17, 18 Supplementary materials for "Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization" Q. Cheng1, J. Liu1, C. Morgan2, A. Whone2, and M. Mirmehdi...

work page 2023