pith. sign in

arxiv: 2604.22388 · v1 · submitted 2026-04-24 · 💻 cs.CV

HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords prostate cancer classificationTRUS videosheuristic frame selectionthree-branch networkSAM modelwavelet transformfeature learningultrasound analysis
0
0 comments X

The pith

HFS-TriNet uses heuristic frame selection and three collaborative branches to classify prostate cancer from TRUS videos by cutting redundancy and extracting multi-scale features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a way to diagnose prostate cancer from transrectal ultrasound videos, which contain richer information than single images yet suffer from repeated frames, overlapping healthy and cancerous appearances, and noisy signals. The method first applies heuristic frame selection that picks clips at regular intervals but shifts the starting point each time so the sampled data covers the full sequence without excessive overlap. The selected clips then enter a network whose three branches work together: a standard ResNet50 for ordinary image features, a pre-trained segmentation model branch that adds attention to track consistency across time, and a wavelet-based branch that pulls out sharp lesion edges from high frequencies while cleaning noise from low frequencies.

Core claim

The central claim is that heuristic frame selection, which samples video clips at intervals with dynamically initialized starts to span the entire sequence, combined with a three-branch network of ResNet50, a pre-trained SAM model plus normalization-based attention for temporal consistency, and a wavelet transform convolutional residual branch for high-frequency edge extraction and low-frequency denoising, sufficiently mitigates redundancy, intra- and inter-class similarity, and low signal-to-noise ratio to improve prostate cancer classification accuracy from TRUS videos.

What carries the argument

Heuristic frame selection (HFS) for dynamic interval-based clip sampling together with the TriNet three-branch architecture that integrates ResNet50, SAM-based deep feature extraction with normalization attention, and WTCR for frequency-domain processing.

If this is right

  • Interval sampling with dynamic starts reduces redundant frames and computational load while covering the full video.
  • The SAM branch with normalization attention captures temporal consistency across frames.
  • The WTCR branch isolates lesion edges in high-frequency components and denoises low-frequency components.
  • Collaborative training across the three branches produces more robust features for distinguishing cancer cases.
  • The overall pipeline enables more accurate video-based CAD without requiring new hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might extend to video analysis in other ultrasound applications where noise and repetition are common.
  • Testing the contribution of each branch separately on varied datasets could reveal which component drives gains in low-SNR settings.
  • If the frequency separation in WTCR proves key, similar wavelet modules could be added to other medical imaging networks.
  • Real-world use would need checks on diverse patient groups to ensure the dynamic sampling does not miss critical short-duration features.

Load-bearing premise

That interval-based clip selection with shifting starts plus the three specific branches will reduce redundancy, similarity problems, and low SNR enough to raise classification performance on real clinical TRUS videos without creating new biases or overfitting.

What would settle it

A test on an independent TRUS video dataset in which HFS-TriNet shows no improvement in accuracy or other metrics over a plain ResNet50 baseline or fixed-frame sampling would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.22388 by Chuan Yang, Qianhong Peng, Qihao Zhou, Shaopeng Liu, Xiuqin Ye, Xu Lu, Yuan Yuan.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed HFS-TriNet. TRUS videos are first processed using the heuristic frame sampling (HFS) strategy to generate representative training clips. Each clip is then fed into three parallel feature extraction branches, whose outputs are fused using a pyramid fusion strategy, as illustrated in (d). The fused representation is finally passed to a classification head to produce the v… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of the three feature extraction branches in HFS-TriNet: (a) a backbone residual spatio-temporal network; (b) a normalization-based attention module applied to features extracted from a frozen pre-trained medical SAM encoder (denoted as SAM NAM); and (c) a wavelet transform convolutional residual (WTCR) branch. location (i, j) is defined as: Fnorm(i, j) = FBN(i, j) q 1 C PC c=1 F c BN(… view at source ↗
Figure 3
Figure 3. Figure 3: Bar plots of mean F1-score and accuracy (ACC) with corre￾sponding 95% confidence intervals for the full HFS-TriNet model and its ablated variants. with all tested differences reaching statistical significance (p < 0.001). To further assess the robustness of the proposed method under different patient-level data partitions, we additionally report patient-level five-fold cross-validation results in Ta￾ble II… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of three fusion strategies for integrating SAM embeddings into the TPN backbone: fusion after the residual blocks (Strategy I), after spatial modulation (Strategy II), and after temporal modulation (Strategy III). 3) Ablation Study on SAM Feature Fusion Position: To in￾vestigate how the fusion position of SAM-derived semantic features influences TRUS video classification, we evaluate three fus… view at source ↗
Figure 6
Figure 6. Figure 6: Bar chart comparing the performance of different SAM feature fusion strategies across multiple evaluation metrics. HFS-TriNet exhibits a more balanced performance profile than alternative fusion strategies. precision–sensitivity trade-off. These results indicate that de￾layed semantic fusion better complements temporal modeling in TRUS videos. Building upon Strategy III, the final HFS-TriNet addi￾tionally … view at source ↗
read the original abstract

Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a heuristic frame selection (HFS) strategy and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. HFS selects video clips at fixed intervals with dynamically initialized starting points to reduce redundancy while spanning the sequence; the network combines a standard ResNet50 branch, a pre-trained SAM branch with normalization-based attention to capture temporal consistency, and a wavelet transform convolutional residual (WTCR) branch to extract high-frequency lesion edges and denoise low-frequency components, addressing intra/inter-class similarity and low SNR.

Significance. If empirically validated, the work could advance CAD systems for prostate cancer by exploiting richer spatial-temporal information in TRUS videos rather than static images. The combination of heuristic sampling, large-model feature extraction via SAM, and wavelet-domain processing is a reasonable attempt to tackle video-specific challenges in medical imaging. No machine-checked proofs, reproducible code, or parameter-free derivations are present, so credit is limited to the architectural novelty of the three-branch design.

major comments (1)
  1. [Abstract] Abstract and method description: no experimental results, performance metrics, datasets, baselines, ablation studies, or validation protocols are supplied. Without these, it is impossible to determine whether interval sampling with dynamic starts actually preserves diagnostic frames across variable-length videos or whether the ResNet50 + SAM-attention + WTCR branches produce complementary features that improve discrimination.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will incorporate revisions to improve clarity regarding experimental validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: no experimental results, performance metrics, datasets, baselines, ablation studies, or validation protocols are supplied. Without these, it is impossible to determine whether interval sampling with dynamic starts actually preserves diagnostic frames across variable-length videos or whether the ResNet50 + SAM-attention + WTCR branches produce complementary features that improve discrimination.

    Authors: We agree that the abstract, due to length constraints, omits specific numerical results and protocol details. The full manuscript contains an Experiments section reporting results on a TRUS video dataset for prostate cancer, with metrics such as accuracy, sensitivity, specificity, and AUC. It includes comparisons to baselines (e.g., standard ResNet50, other video classifiers), ablation studies isolating the HFS strategy and each branch (ResNet50, SAM-temporal attention, WTCR), and validation via patient-level cross-validation to account for variable-length videos. These experiments show that dynamic-start interval sampling improves coverage of diagnostic frames over fixed or random sampling, and ablations confirm complementary contributions from the branches in handling intra-class similarity and low SNR. We will revise the abstract to include a concise summary of key results (e.g., overall accuracy and gains over baselines) and expand the method description with explicit references to the validation setup and how it verifies frame preservation and feature complementarity. revision: yes

Circularity Check

0 steps flagged

No circularity: HFS heuristic and TriNet architecture are direct design proposals

full rationale

The paper introduces HFS as an interval-based sampling heuristic with dynamic starts and HFS-TriNet as a three-branch network (ResNet50 + SAM-normalization-attention + WTCR) motivated by the stated problems of redundancy, intra/inter-class similarity, and low SNR. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs, self-definitions, or self-citation chains. The methods are presented as engineering responses rather than results forced by construction, with all claims resting on empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of the newly introduced HFS strategy and the three branches, which are postulated in the abstract without independent evidence or prior validation shown.

free parameters (2)
  • frame selection interval and dynamic start
    Chosen to reduce redundancy while spanning the video; specific values or rules not provided in abstract.
  • normalization parameters in attention module
    Used within the SAM branch to explore temporal consistency; details not specified.
axioms (2)
  • domain assumption Pre-trained medical SAM model extracts useful deep features from TRUS frames
    Invoked for the large model branch without justification in abstract.
  • domain assumption Wavelet transform convolutional residual branch extracts lesion edges in high-frequency domain and denoises in low-frequency domain
    Basis for the WTCR branch as stated in abstract.

pith-pipeline@v0.9.0 · 5588 in / 1512 out tokens · 86367 ms · 2026-05-08T12:42:02.969340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

    H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je- mal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021

  2. [2]

    Epidemiology of prostate cancer,

    P. Rawla, “Epidemiology of prostate cancer,”World journal of oncology, vol. 10, no. 2, p. 63, 2019

  3. [3]

    Recent global patterns in prostate cancer incidence and mortality rates,

    M. B. Culp, I. Soerjomataram, J. A. Efstathiou, F. Bray, and A. Jemal, “Recent global patterns in prostate cancer incidence and mortality rates,” European urology, vol. 77, no. 1, pp. 38–52, 2020

  4. [4]

    Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,

    X. Lu, S. Zhang, Z. Liu, S. Liu, J. Huang, G. Kong, M. Li, Y . Liang, Y . Cui, C. Yanget al., “Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,” Computerized medical imaging and graphics, vol. 102, p. 102125, 2022

  5. [5]

    An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,

    X. Lu, S. Wu, Z. Xiao, and X. Huang, “An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,”Measurement Science and Technology, vol. 34, no. 2, p. 024002, 2022

  6. [6]

    Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,

    X. Lu, X. Liu, Z. Xiao, S. Zhang, J. Huang, C. Yang, and S. Liu, “Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,”Computers in Biology and Medicine, vol. 165, p. 107337, 2023

  7. [7]

    Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,

    X. Lu, Y . Guo, S. Zhang, Y . Yuan, C.-C. Wang, Z. Shen, and S. Liu, “Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 3, pp. 1849–1860, 2024

  8. [8]

    Speckle reducing anisotropic diffusion,

    Y . Yu and S. T. Acton, “Speckle reducing anisotropic diffusion,”IEEE Transactions on image processing, vol. 11, no. 11, pp. 1260–1270, 2002

  9. [9]

    Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),

    M. P ´erez, J. Zuccaro, A. Mohanta, M. Tijerin, R. Laxer, E. Pope, and A. S. Doria, “Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),”Ultrasound in Medicine & Biology, vol. 46, no. 12, pp. 3218–3227, 2020

  10. [10]

    Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,

    S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, J. T. Kwak, S. Xu, B. Turkbey, P. Choyke, P. Pinto, B. Woodet al., “Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,” IEEE transactions on medical imaging, vol. 37, no. 12, pp. 2695–2703, 2018

  11. [11]

    Machine learning prediction of prostate cancer from transrectal ultrasound video clips,

    K. Wang, P. Chen, B. Feng, J. Tu, Z. Hu, M. Zhang, J. Yang, Y . Zhan, J. Yao, and D. Xu, “Machine learning prediction of prostate cancer from transrectal ultrasound video clips,”Frontiers in Oncology, vol. 12, p. 948662, 2022

  12. [12]

    Direct distillation between different domains,

    J. Tang, S. Chen, G. Niu, H. Zhu, J. T. Zhou, C. Gong, and M. Sugiyama, “Direct distillation between different domains,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 154–172

  13. [13]

    Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,

    X. Xiao, J. Zhang, Y . Shao, J. Liu, K. Shi, C. He, and D. Kong, “Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,”Sensors, vol. 25, no. 8, p. 2361, 2025

  14. [14]

    Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,

    A. Morlok, S. Klemme, I. Weber, A. Stojic, M. Sohn, H. Hiesinger, and J. Helbert, “Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,”Icarus, vol. 324, pp. 86–103, 2019

  15. [15]

    Co-modeling the sequential and graphical routes for peptide representation learning,

    Z. Liu, G. Wang, J. Wang, J. Zheng, and S. Z. Li, “Co-modeling the sequential and graphical routes for peptide representation learning,” arXiv preprint arXiv:2310.02964, 2023

  16. [16]

    Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,

    S. Vesal, I. Gayo, I. Bhattacharya, S. Natarajan, L. S. Marks, D. C. Barratt, R. E. Fan, Y . Hu, G. A. Sonn, and M. Rusu, “Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,”Medical image analysis, vol. 82, p. 102620, 2022

  17. [18]

    Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,

    M. Harmanani, P. F. Wilson, M. N. N. To, M. Gilany, A. Jamzad, F. Fooladgar, B. Wodlinger, P. Abolmaesumi, and P. Mousavi, “Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,”International Journal of Computer Assisted Radiology and Surgery, pp. 1–9, 2025

  18. [19]

    Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,

    K. A. L. Hernandez, T. Rienm ¨uller, D. Baumgartner, and C. Baum- gartner, “Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,”Computers in Biology and Medicine, vol. 130, p. 104200, 2021

  19. [20]

    Segment anything model for medical images?

    Y . Huang, X. Yang, L. Liuet al., “Segment anything model for medical images?”Medical Image Analysis, vol. 92, p. 103061, 2023

  20. [21]

    Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,

    S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, G. Nir, J. T. Kwak, S. Xu, S. Wilson, K. A. Iczkowski, M. S. Luciaet al., “Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,”International journal of computer assisted radiology and surgery, vol. 12, pp. 1293– 1305, 2017

  21. [22]

    Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,

    L. Zhang, S. R. Zhou, M. H. Choi, J. H. Lee, S. Sang, A. Kinnaird, W. G. Brisbane, G. Lughezzani, D. Maffei, V . Fasuloet al., “Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,”arXiv preprint arXiv:2412.10997, 2024

  22. [23]

    Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,

    H. Wu, J. Fu, H. Ye, Y . Zhong, X. Zou, J. Zhou, and Y . Wang, “Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,” pp. 1–5, 2024

  23. [24]

    Sun, B.-Y

    Y .-K. Sun, B.-Y . Zhou, Y . Miao, Y .-L. Shi, S.-H. Xu, D.-M. Wu, L. Zhang, G. Xu, T.-F. Wu, L.-F. Wanget al., “Three-dimensional convolutional neural network model to identify clinically significant prostate cancer in transrectal ultrasound videos: a prospective, multi- institutional, diagnostic study,”EClinicalMedicine, vol. 60, 2023

  24. [25]

    Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,

    N. Orlando, D. J. Gillies, I. Gyacskov, C. Romagnoli, D. D’Souza, and A. Fenster, “Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,”Medical physics, vol. 47, no. 6, pp. 2413–2426, 2020. AUTHORet al.: TITLE 13

  25. [26]

    Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,

    X. Zhao, Q. Zhu, and J. Wu, “Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,”arXiv preprint arXiv:2407.19316, 2024

  26. [27]

    Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,

    Y . Mo, C. Han, Y . Liuet al., “Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,” arXiv preprint arXiv:2205.08390, 2022

  27. [28]

    Sivasubramanian, D

    A. Sivasubramanian, D. Sasidharan, V . Sowmya, and V . Ravi, “Efficient feature extraction using light-weight cnn attention-based deep learning architectures for ultrasound fetal plane classification,”arXiv preprint arXiv:2410.17396, 2024

  28. [29]

    Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,

    J. Wanget al., “Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,”Computers, Materials & Continua, vol. 83, no. 2, pp. 1234– 1250, 2022

  29. [30]

    Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,

    H. Fazryet al., “Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,”International Journal of Computer Assisted Radiology and Surgery, vol. 18, no. 3, pp. 345–356, 2023

  30. [31]

    Temporal pyramid network for action recognition,

    C. Yang, Y . Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  31. [32]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  32. [33]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

  33. [34]

    Wavelet convo- lutions for large receptive fields,

    S. E. Finder, R. Amoyal, E. Treister, and O. Freifeld, “Wavelet convo- lutions for large receptive fields,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 363–380

  34. [35]

    Temporal segment networks: Towards good practices for deep action recognition,

    L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 20–36

  35. [36]

    Non-local neural net- works,

    X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  36. [37]

    Temporal relational reasoning in videos,

    B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

  37. [38]

    Slowfast networks for video recognition,

    C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  38. [39]

    Tam: Temporal adaptive module for video recognition,

    Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 13 708–13 718

  39. [40]

    Video swin transformer,

    Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 3202–3211

  40. [41]

    Mvitv2: Improved multiscale vision transformers for classification and detection,

    Y . Li, C.-Y . Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4804– 4814

  41. [42]

    Revisiting classifier: Transferring vision-language models for video recognition,

    W. Wu, Z. Sun, and W. Ouyang, “Revisiting classifier: Transferring vision-language models for video recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 3, 2023, pp. 2847–2855

  42. [43]

    Benchmarking micro-action recognition: Dataset, methods, and applications,

    D. Guo, K. Li, B. Hu, Y . Zhang, and M. Wang, “Benchmarking micro-action recognition: Dataset, methods, and applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6238–6252, 2024

  43. [44]

    A systematic analysis of performance measures for classification tasks,

    M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,”Information processing & manage- ment, vol. 45, no. 4, pp. 427–437, 2009

  44. [45]

    The meaning and use of the area under a receiver operating characteristic (roc) curve

    J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982

  45. [46]

    K. P. Murphy,Machine learning: a probabilistic perspective. MIT press, 2012

  46. [47]

    The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,

    T. Saito and M. Rehmsmeier, “The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,”PloS one, vol. 10, no. 3, p. e0118432, 2015

  47. [48]

    Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,

    D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,”arXiv preprint arXiv:2010.16061, 2020