pith. sign in

arxiv: 2512.16504 · v3 · submitted 2025-12-18 · 💻 cs.CV

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

Pith reviewed 2026-05-16 21:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords skeleton action localizationcontrastive learningself-supervised pretrainingmultiscale feature fusiontemporal action detection3D skeleton dataBABEL datasettransfer learning
0
0 comments X

The pith

Contrasting non-overlapping skeleton snippets plus U-shaped fusion produces temporally fine-grained features for action boundary detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates a snippet discrimination pretext task that divides skeleton sequences into non-overlapping segments and trains models to distinguish these segments across different videos through contrastive learning. This is paired with a U-shaped module that fuses intermediate multiscale features from existing skeleton backbones to raise frame-level resolution. The resulting representations are shown to improve localization performance on the BABEL dataset under multiple subsets and evaluation protocols. The same pretraining also delivers state-of-the-art transfer results on PKUMMD after training on NTU RGB+D and BABEL.

Core claim

By projecting skeleton sequences into non-overlapping snippets and using contrastive learning to force features to discriminate them across videos, combined with U-shaped fusion of intermediate features, the method yields representations that capture subtle frame-to-frame differences required to localize action boundaries more accurately than prior skeleton contrastive approaches.

What carries the argument

The snippet discrimination pretext task that densely projects sequences into non-overlapping segments and promotes cross-video distinction via contrastive learning, together with the U-shaped module for multiscale feature fusion to boost localization resolution.

If this is right

  • Existing skeleton-based contrastive methods gain consistent improvements on BABEL across diverse subsets and protocols.
  • State-of-the-art transfer learning performance is reached on PKUMMD after pretraining on NTU RGB+D and BABEL.
  • Temporally sensitive features become available for any downstream skeleton task that requires frame-accurate action timing.
  • The approach reduces reliance on dense frame-level labels during pretraining while still supporting precise localization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same snippet-level contrastive signal could be applied to other sequential modalities such as optical flow or joint trajectories without skeleton data.
  • Pretraining in this manner may lower annotation costs for building detectors in domains like clinical gait analysis or sports coaching.
  • Combining the U-shaped fusion with longer context windows might further sharpen boundaries in actions that span many seconds.

Load-bearing premise

Forcing features to discriminate non-overlapping snippets across videos will automatically create the temporally precise representations needed to pinpoint action boundaries, and the U-shaped fusion will raise resolution without adding alignment errors or overfitting to the pretraining data.

What would settle it

If the proposed pretraining produces no gain or a drop in frame-level localization metrics such as mean average precision on BABEL compared with standard video-level contrastive baselines, the central claim that snippet discrimination yields boundary-sensitive features would be falsified.

Figures

Figures reproduced from arXiv: 2512.16504 by Alan Whone, Catherine Morgan, Jingjing Liu, Majid Mirmehdi, Qiushuo Cheng.

Figure 1
Figure 1. Figure 1: The basic concept of our contrastive approach. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of pretraining and finetuning in proposed designs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Similarity-based matching. In our pretext task, each snippet is treated as a dis￾tinct instance in its own right to discriminate against other snippets across videos. We therefore adopt a similarity-based matching strategy [44] to establish temporal correspondence between snippets in differ￾ent views (see [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualititive visualization of action predictions on BABEL. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a self-supervised pretraining framework for skeleton-based temporal action localization. It introduces a snippet discrimination pretext task that densely segments skeleton sequences into non-overlapping snippets and applies contrastive learning to distinguish snippets across different videos. A U-shaped multiscale feature fusion module is added to existing skeleton backbones to improve feature resolution at the frame level. The approach is reported to yield consistent gains over prior skeleton contrastive methods on BABEL across subsets and protocols, plus state-of-the-art transfer results on PKUMMD after pretraining on NTU RGB+D and BABEL.

Significance. If the empirical improvements prove robust, the work would supply a practical pretraining recipe that bridges the gap between video-level skeleton recognition and frame-level localization by encouraging temporally discriminative features. The combination of a simple snippet-level contrastive objective with a U-shaped fusion module on established backbones could be adopted as a default initialization step for downstream skeleton TAL pipelines.

major comments (2)
  1. [§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.
  2. [§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'video-level {action} recognition' contains an apparent LaTeX artifact that should be cleaned.
  2. [§3.3] §3.3 (U-shaped module): The description of how skip connections are aligned across the encoder-decoder stages lacks explicit equations for the upsampling and concatenation operations; adding them would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with point-by-point responses and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Snippet Discrimination Pretext): The contrastive loss operates exclusively on whole-snippet embeddings; nothing in the objective explicitly penalizes or rewards intra-snippet temporal variation. Consequently the learned representation may remain nearly constant inside each snippet while still separating different videos, undermining the claim that the pretext automatically produces the frame-level gradients required for boundary detection.

    Authors: We appreciate this observation on the nature of the snippet-level contrastive objective. While the loss is computed on aggregated snippet embeddings, the dense non-overlapping segmentation means that distinguishing adjacent snippets from the same video requires the backbone to encode distinct motion patterns at the frame level; otherwise, nearby snippets would be indistinguishable under the contrastive pull. The U-shaped fusion module is specifically designed to maintain and enhance frame-level resolution from the intermediate backbone features. To address the concern directly, we will add a new analysis subsection (including intra-snippet feature variance statistics and t-SNE visualizations of frame embeddings within snippets) in the revised manuscript to demonstrate that temporal variation is indeed preserved and encouraged. revision: yes

  2. Referee: [§4.3, Table 4] §4.3 and Table 4 (Transfer results on PKUMMD): The SOTA claim rests on a single pretraining combination (NTU+BABEL) without reporting variance across random seeds, statistical significance tests, or ablation of the U-shaped module alone. Without these controls it is impossible to attribute the reported gains specifically to the proposed snippet discrimination rather than to backbone capacity or training schedule.

    Authors: We agree that reporting variance, statistical tests, and targeted ablations would improve the robustness of the SOTA claim. In the revised manuscript we will (i) rerun the PKUMMD transfer experiments over multiple random seeds and report mean ± standard deviation, (ii) include statistical significance tests (e.g., paired t-tests) against the strongest baselines, and (iii) add an ablation isolating the U-shaped module’s contribution on the transfer task. These additions will allow clearer attribution of gains to the proposed snippet discrimination pretext and fusion module. revision: yes

Circularity Check

0 steps flagged

No circularity; method and claims are empirically grounded

full rationale

The paper defines a new snippet-level contrastive pretext task and a U-shaped multiscale fusion module on top of published skeleton backbones. Performance gains are asserted via transfer experiments on BABEL and PKUMMD rather than by algebraic reduction to the inputs. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claim remains falsifiable through the reported downstream metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5479 in / 1137 out tokens · 42218 ms · 2026-05-16T21:35:36.359897+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [1]

    In: CVPR

    Abdelfattah, M., Hassan, M., Alahi, A.: MaskCLR: Attention-guided contrastive learning for robust action representation learning. In: CVPR. pp. 18678–18687 (2024) 2, 3, 4

  2. [2]

    In: WACV (2026) 12

    Adeli, V., Mehraban, S., Mirmehdi, M., Whone, A., Filtjens, B., Dadashzadeh, A., Fasano, A., Iaboni, A., Taati, B.: GAITGen: Disentangled motion-pathology impaired gait generative model–bringing motion generation to the clinical domain. In: WACV (2026) 12

  3. [3]

    In: ICCV (2021) 2

    Alwassel, H., Giancola, S., Ghanem, B.: TSP: Temporally-sensitive pretraining of video encoders for localization tasks. In: ICCV (2021) 2

  4. [4]

    In: ICCV (2019) 4

    Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploit- ing spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: ICCV (2019) 4

  5. [5]

    IEEE Trans

    Chen,B.,Nie,W.,Ji,H.,Ren,W.,Tong,Q.,Wang,Z.,Liu,H.:Multiscaleskeleton- based temporal action segmentation using hierarchical temporal modeling and pre- diction ensemble. IEEE Trans. Cybern. (2025) 4

  6. [6]

    In: ECCV (2022) 2, 3, 7, 9, 10

    Chen, Y., Zhao, L., Yuan, J., Tian, Y., Xia, Z., Geng, S., Han, L., Metaxas, D.N.: Hierarchically self-supervised transformer for human skeleton representation learn- ing. In: ECCV (2022) 2, 3, 7, 9, 10

  7. [7]

    In: AAAI (2021) 4

    Chen, Z., Li, S., Yang, B., Li, Q., Liu, H.: Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: AAAI (2021) 4

  8. [8]

    Cheng, Q., Morgan, C., Sikdar, A., Masullo, A., Whone, A., Mirmehdi, M.: Your turn: At home turning angle estimation for parkinson’s disease severity assessment. Artif. Intell. Med. p. 103194 (2025) 1, 12

  9. [9]

    In: WACV (2024) 12

    Dadashzadeh, A., Duan, S., Whone, A., Mirmehdi, M.: Pecop: Parameter efficient continual pretraining for action quality assessment. In: WACV (2024) 12

  10. [10]

    Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: Temporal contrastive learning for video representation. Comput. Vis. Image. Underst (2022) 3

  11. [11]

    In: ECCV (2024) 2

    Do, J., Kim, M.: SkateFormer: skeletal-temporal transformer for human action recognition. In: ECCV (2024) 2

  12. [12]

    In: AAAI (2023) 3

    Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: AAAI (2023) 3

  13. [13]

    Fang, B., Wu, W., Liu, C., Zhou, Y., He, D., Wang, W.: Mamico: Macro-to-micro semanticcorrespondenceforself-supervisedvideorepresentationlearning.In:ACM MM (2022) 3

  14. [14]

    In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

    Gao, R., Liu, X., Yang, J., Yue, H.: CdCLR: Clip-driven contrastive learning for skeleton-based action recognition. In: 2022 IEEE International Conference on Vi- sual Communications and Image Processing (VCIP) (2022) 2, 3, 8

  15. [15]

    In: WACV (2020) 4

    Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph con- volutional networks for action segmentation. In: WACV (2020) 4

  16. [16]

    In: ICCV (2025) 1, 7

    Gökay, U., Spurio, F., Bach, D.R., Gall, J.: Skeleton motion words for unsupervised skeleton-based temporal action segmentation. In: ICCV (2025) 1, 7

  17. [17]

    In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q

    Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. In: AAAI (2022) 2, 3, 4, 8, 9, 12 14 Q. Cheng et al

  18. [18]

    In: CVPR (2020) 5, 8

    He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020) 5, 8

  19. [19]

    IEEE TCSVT (2024) 2, 3

    Hu, J., Hou, Y., Guo, Z., Gao, J.: Global and local contrastive learning for self- supervised skeleton-based action recognition. IEEE TCSVT (2024) 2, 3

  20. [20]

    ACM MM 25(2022) 4

    Hua, G., Liu, H., Li, W., Zhang, Q., Ding, R., Xu, X.: Weakly-supervised 3d human pose estimation with cross-view u-shaped graph convolutional network. ACM MM 25(2022) 4

  21. [21]

    IEEE TCSVT (2024) 4

    Jang, S., Lee, H., Kim, W.J., Lee, J., Woo, S., Lee, S.: Multi-scale structural graph convolutional network for skeleton-based action recognition. IEEE TCSVT (2024) 4

  22. [22]

    In: CVPR (2021) 1, 2, 3, 4, 8

    Li, L., Wang, M., Ni, B., Wang, H., Yang, J., Zhang, W.: 3D human action repre- sentation learning via cross-view consistency pursuit. In: CVPR (2021) 1, 2, 3, 4, 8

  23. [23]

    In: CVPR (2021) 1

    Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: UAV-human: A large bench- mark for human behavior understanding with unmanned aerial vehicles. In: CVPR (2021) 1

  24. [24]

    In: CVPR (2017) 4

    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 4

  25. [25]

    In: ACM MM (2017) 7, 8

    Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In: ACM MM (2017) 7, 8

  26. [26]

    In: CVPR (2025) 2

    Liu, H., Liu, Y., Ren, M., Wang, H., Wang, Y., Sun, Z.: Revealing key details to see differences: A novel prototypical perspective for skeleton-based action recognition. In: CVPR (2025) 2

  27. [27]

    IEEE TIP (2022) 2, 3

    Liu, Y., Wang, K., Liu, L., Lan, H., Lin, L.: TCGL: Temporal contrastive graph for self-supervised video representation learning. IEEE TIP (2022) 2, 3

  28. [28]

    In: CVPR (2019) 7

    Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: Archive of motion capture as surface shapes. In: CVPR (2019) 7

  29. [29]

    Scientific Data (2023) 1, 12

    Morgan, C., Tonkin, E.L., Masullo, A., Jovan, F., Sikdar, A., Khaire, P., Mirmehdi, M., McConville, R., Tourte, G.J., Whone, A., et al.: A multimodal dataset of real world mobility activities in parkinson’s disease. Scientific Data (2023) 1, 12

  30. [30]

    In: CVPR (2021) 7, 8

    Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: Bodies, action and behavior with english labels. In: CVPR (2021) 7, 8

  31. [31]

    In: WACV (2025) 2

    Ray, A., Raj, A., Kolekar, M.H.: Autoregressive adaptive hypergraph transformer for skeleton-based activity recognition. In: WACV (2025) 2

  32. [32]

    In: MICCAI (2015) 4, 7

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015) 4, 7

  33. [33]

    Sager, C., Janiesch, C., Zschech, P.: A survey of image labelling for computer vision applications. J. Bus. Anal.4(2) (2021) 1

  34. [34]

    Sardari, S., Sharifzadeh, S., Daneshkhah, A., Nakisa, B., Loke, S.W., Palade, V., Duncan, M.J.: Artificial intelligence for skeleton-based physical rehabilitation ac- tion evaluation: A systematic review. Comput. Biol. Med.158, 106835 (2023) 1

  35. [35]

    In: CVPR (2016) 1, 7, 8

    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D a large scale dataset for 3d human activity analysis. In: CVPR (2016) 1, 7, 8

  36. [36]

    IEEE TMM (2024) 2

    Shao, Y., Zhang, F., Xu, C.: Snippet-to-prototype contrastive consensus network for weakly supervised temporal action localization. IEEE TMM (2024) 2

  37. [37]

    In: CVPR (2019) 2, 8, 9, 10, 17, 18

    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: CVPR (2019) 2, 8, 9, 10, 17, 18

  38. [38]

    IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

    Shu, X., Xu, B., Zhang, L., Tang, J.: Multi-granularity anchor-contrastive rep- resentation learning for semi-supervised skeleton-based action recognition. IEEE TPAMI45(6) (2022) 3 Title Suppressed Due to Excessive Length 15

  39. [39]

    In: AAAI (2021) 7

    Su, H., Gan, W., Wu, W., Qiao, Y., Yan, J.: BSN++: Complementary bound- ary regressor with scale-balanced relation modeling for temporal action proposal generation. In: AAAI (2021) 7

  40. [40]

    In: ACM MM (2020) 3

    Tao, L., Wang, X., Yamasaki, T.: Self-supervised video representation learning using inter-intra contrastive framework. In: ACM MM (2020) 3

  41. [41]

    IEEE TCSVT (2022) 2, 3

    Tao, L., Wang, X., Yamasaki, T.: An improved inter-intra contrastive learning framework on self-supervised video representation. IEEE TCSVT (2022) 2, 3

  42. [42]

    In: VUA workshop at BMVC (2023) 1

    Wang, H., Mirmehdi, M., Damen, D., Perrett, T.: Centre Stage: Centricity-based audio-visual temporal action detection. In: VUA workshop at BMVC (2023) 1

  43. [43]

    In: CVPR (2025) 3, 4

    Wang, H., Ma, X., Kuang, J., Gui, J.: Heterogeneous skeleton-based action repre- sentation learning. In: CVPR (2025) 3, 4

  44. [44]

    In: CVPR (2021) 2, 5, 6

    Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR (2021) 2, 5, 6

  45. [45]

    In: AAAI (2025) 7, 8, 9, 10

    Weng, W., Wang, H., Wang, J., He, L., Xie, G.S.: USDRL: Unified skeleton-based dense representation learning with multi-grained feature decorrelation. In: AAAI (2025) 7, 8, 9, 10

  46. [46]

    In: ECCV (2024) 3

    Wu, L., Lin, L., Zhang, J., Ma, Y., Liu, J.: MacDiff: Unified skeleton modeling with masked conditional diffusion. In: ECCV (2024) 3

  47. [47]

    In: CVPR (2018) 2

    Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non- parametric instance discrimination. In: CVPR (2018) 2

  48. [48]

    In: CVPR (2021) 2, 5

    Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: CVPR (2021) 2, 5

  49. [49]

    In: ICCV (2021) 2

    Xu, M., Pérez-Rúa, J.M., Escorcia, V., Martinez, B., Zhu, X., Zhang, L., Ghanem, B., Xiang, T.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV (2021) 2

  50. [50]

    In: IJCNN (2023) 2

    Xu, R., Liu, C., Chen, Y., Lei, Z.: Snippet-level supervised contrastive learning- based transformer for temporal action detection. In: IJCNN (2023) 2

  51. [51]

    In: ICCV (2023) 1, 3

    Yan, H., Liu, Y., Wei, Y., Li, Z., Li, G., Lin, L.: SkeletonMAE: graph-based masked autoencoder for skeleton sequence pre-training. In: ICCV (2023) 1, 3

  52. [52]

    In: AAAI (2018) 8, 10, 17, 18

    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018) 8, 10, 17, 18

  53. [53]

    In: ICME (2024) 4

    Yan, X., Pun, C.M., Li, H., Liu, M., Gao, H.: Hierarchical local temporal feature enhancing for transformer-based 3d human pose estimation. In: ICME (2024) 4

  54. [54]

    In: BMVC (2021) 2, 8, 10, 17, 18

    Yang, D., Wang, Y., Dantcheva, A., Garattoni, L., Francesca, G., Brémond, F.: UNIK: A unified framework for real-world skeleton-based action recognition. In: BMVC (2021) 2, 8, 10, 17, 18

  55. [55]

    In: CVPR (2023) 3

    Yang, D., Wang, Y., Dantcheva, A., Kong, Q., Garattoni, L., Francesca, G., Bre- mond, F.: LAC-latent action composition for skeleton-based action segmentation. In: CVPR (2023) 3

  56. [56]

    Yang, H., Guo, L., Zhang, Y., Wu, X.: U-shaped spatial–temporal transformer network for 3d human pose estimation. Mach. Vis. Appl.33(6) (2022) 4

  57. [57]

    In: AAAI (2023) 1, 7, 8, 9

    Yu, Q., Fujiwara, K.: Frame-level label refinement for skeleton-based weakly- supervised action recognition. In: AAAI (2023) 1, 7, 8, 9

  58. [58]

    In: CVPR (2021) 2

    Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR (2021) 2

  59. [59]

    In: AAAI (2023) 3, 8, 9

    Zhang, J., Lin, L., Liu, J.: Hierarchical consistent contrastive learning for skeleton- based action recognition with growing augmentations. In: AAAI (2023) 3, 8, 9

  60. [60]

    In: ACM MM (2023) 8, 10 16 Q

    Zhang, J., Lin, L., Liu, J.: Prompted contrast with masked motion modeling: To- wards versatile 3d action representation learning. In: ACM MM (2023) 8, 10 16 Q. Cheng et al

  61. [61]

    In: ICCV (2025) 10

    Zhou, Y., Xu, T., Wu, C., Wu, X., Kittler, J.: Adaptive hyper-graph convolution network for skeleton-based human action recognition with virtual connections. In: ICCV (2025) 10

  62. [62]

    In: MICCAI (2018) 7

    Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: MICCAI (2018) 7

  63. [63]

    Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

    Zhu, Y., Han, H., Yu, Z., Liu, G.: Modeling the relative visual tempo for self- supervised skeleton-based action recognition. In: CVPR (2023) 2, 3, 4, 8, 9, 10, 11, 12, 17, 18 Supplementary materials for "Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization" Q. Cheng1, J. Liu1, C. Morgan2, A. Whone2, and M. Mirmehdi...