HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

Chuan Yang; Qianhong Peng; Qihao Zhou; Shaopeng Liu; Xiuqin Ye; Xu Lu; Yuan Yuan

arxiv: 2604.22388 · v1 · submitted 2026-04-24 · 💻 cs.CV

HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos

Xu Lu , Qianhong Peng , Qihao Zhou , Shaopeng Liu , Xiuqin Ye , Chuan Yang , Yuan Yuan This is my paper

Pith reviewed 2026-05-08 12:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords prostate cancer classificationTRUS videosheuristic frame selectionthree-branch networkSAM modelwavelet transformfeature learningultrasound analysis

0 comments

The pith

HFS-TriNet uses heuristic frame selection and three collaborative branches to classify prostate cancer from TRUS videos by cutting redundancy and extracting multi-scale features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a way to diagnose prostate cancer from transrectal ultrasound videos, which contain richer information than single images yet suffer from repeated frames, overlapping healthy and cancerous appearances, and noisy signals. The method first applies heuristic frame selection that picks clips at regular intervals but shifts the starting point each time so the sampled data covers the full sequence without excessive overlap. The selected clips then enter a network whose three branches work together: a standard ResNet50 for ordinary image features, a pre-trained segmentation model branch that adds attention to track consistency across time, and a wavelet-based branch that pulls out sharp lesion edges from high frequencies while cleaning noise from low frequencies.

Core claim

The central claim is that heuristic frame selection, which samples video clips at intervals with dynamically initialized starts to span the entire sequence, combined with a three-branch network of ResNet50, a pre-trained SAM model plus normalization-based attention for temporal consistency, and a wavelet transform convolutional residual branch for high-frequency edge extraction and low-frequency denoising, sufficiently mitigates redundancy, intra- and inter-class similarity, and low signal-to-noise ratio to improve prostate cancer classification accuracy from TRUS videos.

What carries the argument

Heuristic frame selection (HFS) for dynamic interval-based clip sampling together with the TriNet three-branch architecture that integrates ResNet50, SAM-based deep feature extraction with normalization attention, and WTCR for frequency-domain processing.

If this is right

Interval sampling with dynamic starts reduces redundant frames and computational load while covering the full video.
The SAM branch with normalization attention captures temporal consistency across frames.
The WTCR branch isolates lesion edges in high-frequency components and denoises low-frequency components.
Collaborative training across the three branches produces more robust features for distinguishing cancer cases.
The overall pipeline enables more accurate video-based CAD without requiring new hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might extend to video analysis in other ultrasound applications where noise and repetition are common.
Testing the contribution of each branch separately on varied datasets could reveal which component drives gains in low-SNR settings.
If the frequency separation in WTCR proves key, similar wavelet modules could be added to other medical imaging networks.
Real-world use would need checks on diverse patient groups to ensure the dynamic sampling does not miss critical short-duration features.

Load-bearing premise

That interval-based clip selection with shifting starts plus the three specific branches will reduce redundancy, similarity problems, and low SNR enough to raise classification performance on real clinical TRUS videos without creating new biases or overfitting.

What would settle it

A test on an independent TRUS video dataset in which HFS-TriNet shows no improvement in accuracy or other metrics over a plain ResNet50 baseline or fixed-frame sampling would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.22388 by Chuan Yang, Qianhong Peng, Qihao Zhou, Shaopeng Liu, Xiuqin Ye, Xu Lu, Yuan Yuan.

**Figure 1.** Figure 1: Overall architecture of the proposed HFS-TriNet. TRUS videos are first processed using the heuristic frame sampling (HFS) strategy to generate representative training clips. Each clip is then fed into three parallel feature extraction branches, whose outputs are fused using a pyramid fusion strategy, as illustrated in (d). The fused representation is finally passed to a classification head to produce the v… view at source ↗

**Figure 2.** Figure 2: Detailed architecture of the three feature extraction branches in HFS-TriNet: (a) a backbone residual spatio-temporal network; (b) a normalization-based attention module applied to features extracted from a frozen pre-trained medical SAM encoder (denoted as SAM NAM); and (c) a wavelet transform convolutional residual (WTCR) branch. location (i, j) is defined as: Fnorm(i, j) = FBN(i, j) q 1 C PC c=1 F c BN(… view at source ↗

**Figure 3.** Figure 3: Bar plots of mean F1-score and accuracy (ACC) with corresponding 95% confidence intervals for the full HFS-TriNet model and its ablated variants. with all tested differences reaching statistical significance (p < 0.001). To further assess the robustness of the proposed method under different patient-level data partitions, we additionally report patient-level five-fold cross-validation results in Table II… view at source ↗

**Figure 5.** Figure 5: Illustration of three fusion strategies for integrating SAM embeddings into the TPN backbone: fusion after the residual blocks (Strategy I), after spatial modulation (Strategy II), and after temporal modulation (Strategy III). 3) Ablation Study on SAM Feature Fusion Position: To investigate how the fusion position of SAM-derived semantic features influences TRUS video classification, we evaluate three fus… view at source ↗

**Figure 6.** Figure 6: Bar chart comparing the performance of different SAM feature fusion strategies across multiple evaluation metrics. HFS-TriNet exhibits a more balanced performance profile than alternative fusion strategies. precision–sensitivity trade-off. These results indicate that delayed semantic fusion better complements temporal modeling in TRUS videos. Building upon Strategy III, the final HFS-TriNet additionally … view at source ↗

read the original abstract

Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HFS-TriNet proposes a three-branch network with heuristic frame selection for TRUS video prostate cancer classification, but the abstract supplies no experiments, datasets, or results to test the claims.

read the letter

This paper puts forward a new network called HFS-TriNet for classifying prostate cancer from TRUS videos. It uses a heuristic way to pick frames and splits feature learning into three branches. The main point to know is that while the idea targets real problems in video analysis, there are no results to back it up yet. What stands out as new is the combination of heuristic frame selection that starts at different points to cover the whole video, a branch using a pre-trained SAM model with attention for temporal features, and a wavelet-based branch for edge and denoising info. The regular ResNet50 is the third. This is presented as a way to cut redundancy, deal with similar classes, and handle noise in ultrasound videos. The approach makes sense on paper. TRUS videos do have lots of repeated frames, and ultrasound is noisy, so pulling high-frequency details and using a big model like SAM for better features could help. The dynamic start in frame selection is a simple way to avoid always sampling the same parts. The soft spot is obvious: the abstract describes the method but gives zero datasets, no performance metrics, no comparisons to other methods, and no ablations. Without those, it's impossible to say if the three branches actually work together better than one or two, or if the frame selection avoids missing key diagnostic parts. The assumption that the branches provide complementary information is stated but not tested. This kind of work is for people in medical image analysis who focus on prostate cancer CAD systems. A reader looking for ideas on handling video data in ultrasound might get some value from the architecture description. It deserves a serious referee if the full version includes proper experiments and shows improvements on real data. As it stands in the abstract, it's more of a proposal than a completed study. I'd recommend sending it for peer review once they add the validation results and analysis, because the problem area is important and the design has some thought behind it.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a heuristic frame selection (HFS) strategy and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. HFS selects video clips at fixed intervals with dynamically initialized starting points to reduce redundancy while spanning the sequence; the network combines a standard ResNet50 branch, a pre-trained SAM branch with normalization-based attention to capture temporal consistency, and a wavelet transform convolutional residual (WTCR) branch to extract high-frequency lesion edges and denoise low-frequency components, addressing intra/inter-class similarity and low SNR.

Significance. If empirically validated, the work could advance CAD systems for prostate cancer by exploiting richer spatial-temporal information in TRUS videos rather than static images. The combination of heuristic sampling, large-model feature extraction via SAM, and wavelet-domain processing is a reasonable attempt to tackle video-specific challenges in medical imaging. No machine-checked proofs, reproducible code, or parameter-free derivations are present, so credit is limited to the architectural novelty of the three-branch design.

major comments (1)

[Abstract] Abstract and method description: no experimental results, performance metrics, datasets, baselines, ablation studies, or validation protocols are supplied. Without these, it is impossible to determine whether interval sampling with dynamic starts actually preserves diagnostic frames across variable-length videos or whether the ResNet50 + SAM-attention + WTCR branches produce complementary features that improve discrimination.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will incorporate revisions to improve clarity regarding experimental validation.

read point-by-point responses

Referee: [Abstract] Abstract and method description: no experimental results, performance metrics, datasets, baselines, ablation studies, or validation protocols are supplied. Without these, it is impossible to determine whether interval sampling with dynamic starts actually preserves diagnostic frames across variable-length videos or whether the ResNet50 + SAM-attention + WTCR branches produce complementary features that improve discrimination.

Authors: We agree that the abstract, due to length constraints, omits specific numerical results and protocol details. The full manuscript contains an Experiments section reporting results on a TRUS video dataset for prostate cancer, with metrics such as accuracy, sensitivity, specificity, and AUC. It includes comparisons to baselines (e.g., standard ResNet50, other video classifiers), ablation studies isolating the HFS strategy and each branch (ResNet50, SAM-temporal attention, WTCR), and validation via patient-level cross-validation to account for variable-length videos. These experiments show that dynamic-start interval sampling improves coverage of diagnostic frames over fixed or random sampling, and ablations confirm complementary contributions from the branches in handling intra-class similarity and low SNR. We will revise the abstract to include a concise summary of key results (e.g., overall accuracy and gains over baselines) and expand the method description with explicit references to the validation setup and how it verifies frame preservation and feature complementarity. revision: yes

Circularity Check

0 steps flagged

No circularity: HFS heuristic and TriNet architecture are direct design proposals

full rationale

The paper introduces HFS as an interval-based sampling heuristic with dynamic starts and HFS-TriNet as a three-branch network (ResNet50 + SAM-normalization-attention + WTCR) motivated by the stated problems of redundancy, intra/inter-class similarity, and low SNR. No equations, predictions, or first-principles derivations are present that reduce to fitted inputs, self-definitions, or self-citation chains. The methods are presented as engineering responses rather than results forced by construction, with all claims resting on empirical validation rather than tautological reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the effectiveness of the newly introduced HFS strategy and the three branches, which are postulated in the abstract without independent evidence or prior validation shown.

free parameters (2)

frame selection interval and dynamic start
Chosen to reduce redundancy while spanning the video; specific values or rules not provided in abstract.
normalization parameters in attention module
Used within the SAM branch to explore temporal consistency; details not specified.

axioms (2)

domain assumption Pre-trained medical SAM model extracts useful deep features from TRUS frames
Invoked for the large model branch without justification in abstract.
domain assumption Wavelet transform convolutional residual branch extracts lesion edges in high-frequency domain and denoises in low-frequency domain
Basis for the WTCR branch as stated in abstract.

pith-pipeline@v0.9.0 · 5588 in / 1512 out tokens · 86367 ms · 2026-05-08T12:42:02.969340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je- mal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021

work page 2020
[2]

Epidemiology of prostate cancer,

P. Rawla, “Epidemiology of prostate cancer,”World journal of oncology, vol. 10, no. 2, p. 63, 2019

work page 2019
[3]

Recent global patterns in prostate cancer incidence and mortality rates,

M. B. Culp, I. Soerjomataram, J. A. Efstathiou, F. Bray, and A. Jemal, “Recent global patterns in prostate cancer incidence and mortality rates,” European urology, vol. 77, no. 1, pp. 38–52, 2020

work page 2020
[4]

Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,

X. Lu, S. Zhang, Z. Liu, S. Liu, J. Huang, G. Kong, M. Li, Y . Liang, Y . Cui, C. Yanget al., “Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,” Computerized medical imaging and graphics, vol. 102, p. 102125, 2022

work page 2022
[5]

An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,

X. Lu, S. Wu, Z. Xiao, and X. Huang, “An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,”Measurement Science and Technology, vol. 34, no. 2, p. 024002, 2022

work page 2022
[6]

Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,

X. Lu, X. Liu, Z. Xiao, S. Zhang, J. Huang, C. Yang, and S. Liu, “Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,”Computers in Biology and Medicine, vol. 165, p. 107337, 2023

work page 2023
[7]

Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,

X. Lu, Y . Guo, S. Zhang, Y . Yuan, C.-C. Wang, Z. Shen, and S. Liu, “Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 3, pp. 1849–1860, 2024

work page 2024
[8]

Speckle reducing anisotropic diffusion,

Y . Yu and S. T. Acton, “Speckle reducing anisotropic diffusion,”IEEE Transactions on image processing, vol. 11, no. 11, pp. 1260–1270, 2002

work page 2002
[9]

Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),

M. P ´erez, J. Zuccaro, A. Mohanta, M. Tijerin, R. Laxer, E. Pope, and A. S. Doria, “Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),”Ultrasound in Medicine & Biology, vol. 46, no. 12, pp. 3218–3227, 2020

work page 2020
[10]

Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,

S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, J. T. Kwak, S. Xu, B. Turkbey, P. Choyke, P. Pinto, B. Woodet al., “Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,” IEEE transactions on medical imaging, vol. 37, no. 12, pp. 2695–2703, 2018

work page 2018
[11]

Machine learning prediction of prostate cancer from transrectal ultrasound video clips,

K. Wang, P. Chen, B. Feng, J. Tu, Z. Hu, M. Zhang, J. Yang, Y . Zhan, J. Yao, and D. Xu, “Machine learning prediction of prostate cancer from transrectal ultrasound video clips,”Frontiers in Oncology, vol. 12, p. 948662, 2022

work page 2022
[12]

Direct distillation between different domains,

J. Tang, S. Chen, G. Niu, H. Zhu, J. T. Zhou, C. Gong, and M. Sugiyama, “Direct distillation between different domains,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 154–172

work page 2024
[13]

Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,

X. Xiao, J. Zhang, Y . Shao, J. Liu, K. Shi, C. He, and D. Kong, “Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,”Sensors, vol. 25, no. 8, p. 2361, 2025

work page 2025
[14]

Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,

A. Morlok, S. Klemme, I. Weber, A. Stojic, M. Sohn, H. Hiesinger, and J. Helbert, “Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,”Icarus, vol. 324, pp. 86–103, 2019

work page 2019
[15]

Co-modeling the sequential and graphical routes for peptide representation learning,

Z. Liu, G. Wang, J. Wang, J. Zheng, and S. Z. Li, “Co-modeling the sequential and graphical routes for peptide representation learning,” arXiv preprint arXiv:2310.02964, 2023

work page arXiv 2023
[16]

Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,

S. Vesal, I. Gayo, I. Bhattacharya, S. Natarajan, L. S. Marks, D. C. Barratt, R. E. Fan, Y . Hu, G. A. Sonn, and M. Rusu, “Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,”Medical image analysis, vol. 82, p. 102620, 2022

work page 2022
[18]

Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,

M. Harmanani, P. F. Wilson, M. N. N. To, M. Gilany, A. Jamzad, F. Fooladgar, B. Wodlinger, P. Abolmaesumi, and P. Mousavi, “Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,”International Journal of Computer Assisted Radiology and Surgery, pp. 1–9, 2025

work page 2025
[19]

Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,

K. A. L. Hernandez, T. Rienm ¨uller, D. Baumgartner, and C. Baum- gartner, “Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,”Computers in Biology and Medicine, vol. 130, p. 104200, 2021

work page 2021
[20]

Segment anything model for medical images?

Y . Huang, X. Yang, L. Liuet al., “Segment anything model for medical images?”Medical Image Analysis, vol. 92, p. 103061, 2023

work page 2023
[21]

Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,

S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, G. Nir, J. T. Kwak, S. Xu, S. Wilson, K. A. Iczkowski, M. S. Luciaet al., “Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,”International journal of computer assisted radiology and surgery, vol. 12, pp. 1293– 1305, 2017

work page 2017
[22]

Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,

L. Zhang, S. R. Zhou, M. H. Choi, J. H. Lee, S. Sang, A. Kinnaird, W. G. Brisbane, G. Lughezzani, D. Maffei, V . Fasuloet al., “Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,”arXiv preprint arXiv:2412.10997, 2024

work page arXiv 2024
[23]

Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,

H. Wu, J. Fu, H. Ye, Y . Zhong, X. Zou, J. Zhou, and Y . Wang, “Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,” pp. 1–5, 2024

work page 2024
[24]

Sun, B.-Y

Y .-K. Sun, B.-Y . Zhou, Y . Miao, Y .-L. Shi, S.-H. Xu, D.-M. Wu, L. Zhang, G. Xu, T.-F. Wu, L.-F. Wanget al., “Three-dimensional convolutional neural network model to identify clinically significant prostate cancer in transrectal ultrasound videos: a prospective, multi- institutional, diagnostic study,”EClinicalMedicine, vol. 60, 2023

work page 2023
[25]

Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,

N. Orlando, D. J. Gillies, I. Gyacskov, C. Romagnoli, D. D’Souza, and A. Fenster, “Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,”Medical physics, vol. 47, no. 6, pp. 2413–2426, 2020. AUTHORet al.: TITLE 13

work page 2020
[26]

Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,

X. Zhao, Q. Zhu, and J. Wu, “Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,”arXiv preprint arXiv:2407.19316, 2024

work page arXiv 2024
[27]

Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,

Y . Mo, C. Han, Y . Liuet al., “Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,” arXiv preprint arXiv:2205.08390, 2022

work page arXiv 2022
[28]

Sivasubramanian, D

A. Sivasubramanian, D. Sasidharan, V . Sowmya, and V . Ravi, “Efficient feature extraction using light-weight cnn attention-based deep learning architectures for ultrasound fetal plane classification,”arXiv preprint arXiv:2410.17396, 2024

work page arXiv 2024
[29]

Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,

J. Wanget al., “Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,”Computers, Materials & Continua, vol. 83, no. 2, pp. 1234– 1250, 2022

work page 2022
[30]

Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,

H. Fazryet al., “Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,”International Journal of Computer Assisted Radiology and Surgery, vol. 18, no. 3, pp. 345–356, 2023

work page 2023
[31]

Temporal pyramid network for action recognition,

C. Yang, Y . Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020
[32]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016
[33]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review arXiv 2010
[34]

Wavelet convo- lutions for large receptive fields,

S. E. Finder, R. Amoyal, E. Treister, and O. Freifeld, “Wavelet convo- lutions for large receptive fields,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 363–380

work page 2024
[35]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 20–36

work page 2016
[36]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[37]

Temporal relational reasoning in videos,

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

work page 2018
[38]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019
[39]

Tam: Temporal adaptive module for video recognition,

Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 13 708–13 718

work page 2021
[40]

Video swin transformer,

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 3202–3211

work page 2022
[41]

Mvitv2: Improved multiscale vision transformers for classification and detection,

Y . Li, C.-Y . Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4804– 4814

work page 2022
[42]

Revisiting classifier: Transferring vision-language models for video recognition,

W. Wu, Z. Sun, and W. Ouyang, “Revisiting classifier: Transferring vision-language models for video recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 3, 2023, pp. 2847–2855

work page 2023
[43]

Benchmarking micro-action recognition: Dataset, methods, and applications,

D. Guo, K. Li, B. Hu, Y . Zhang, and M. Wang, “Benchmarking micro-action recognition: Dataset, methods, and applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6238–6252, 2024

work page 2024
[44]

A systematic analysis of performance measures for classification tasks,

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,”Information processing & manage- ment, vol. 45, no. 4, pp. 427–437, 2009

work page 2009
[45]

The meaning and use of the area under a receiver operating characteristic (roc) curve

J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982

work page 1982
[46]

K. P. Murphy,Machine learning: a probabilistic perspective. MIT press, 2012

work page 2012
[47]

The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,

T. Saito and M. Rehmsmeier, “The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,”PloS one, vol. 10, no. 3, p. e0118432, 2015

work page 2015
[48]

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,

D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,”arXiv preprint arXiv:2010.16061, 2020

work page arXiv 2010

[1] [1]

Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,

H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Je- mal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”CA: a cancer journal for clinicians, vol. 71, no. 3, pp. 209–249, 2021

work page 2020

[2] [2]

Epidemiology of prostate cancer,

P. Rawla, “Epidemiology of prostate cancer,”World journal of oncology, vol. 10, no. 2, p. 63, 2019

work page 2019

[3] [3]

Recent global patterns in prostate cancer incidence and mortality rates,

M. B. Culp, I. Soerjomataram, J. A. Efstathiou, F. Bray, and A. Jemal, “Recent global patterns in prostate cancer incidence and mortality rates,” European urology, vol. 77, no. 1, pp. 38–52, 2020

work page 2020

[4] [4]

Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,

X. Lu, S. Zhang, Z. Liu, S. Liu, J. Huang, G. Kong, M. Li, Y . Liang, Y . Cui, C. Yanget al., “Ultrasonographic pathological grading of prostate cancer using automatic region-based gleason grading network,” Computerized medical imaging and graphics, vol. 102, p. 102125, 2022

work page 2022

[5] [5]

An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,

X. Lu, S. Wu, Z. Xiao, and X. Huang, “An enhanced multiscale generation and depth-perceptual loss-based super-resolution network for prostate ultrasound images,”Measurement Science and Technology, vol. 34, no. 2, p. 024002, 2022

work page 2022

[6] [6]

Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,

X. Lu, X. Liu, Z. Xiao, S. Zhang, J. Huang, C. Yang, and S. Liu, “Self- supervised dual-head attentional bootstrap learning network for prostate cancer screening in transrectal ultrasound images,”Computers in Biology and Medicine, vol. 165, p. 107337, 2023

work page 2023

[7] [7]

Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,

X. Lu, Y . Guo, S. Zhang, Y . Yuan, C.-C. Wang, Z. Shen, and S. Liu, “Deep augmented metric learning network for prostate cancer classifi- cation in ultrasound images,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 3, pp. 1849–1860, 2024

work page 2024

[8] [8]

Speckle reducing anisotropic diffusion,

Y . Yu and S. T. Acton, “Speckle reducing anisotropic diffusion,”IEEE Transactions on image processing, vol. 11, no. 11, pp. 1260–1270, 2002

work page 2002

[9] [9]

Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),

M. P ´erez, J. Zuccaro, A. Mohanta, M. Tijerin, R. Laxer, E. Pope, and A. S. Doria, “Feasibility of using elastography ultrasound in pediatric localized scleroderma (morphea),”Ultrasound in Medicine & Biology, vol. 46, no. 12, pp. 3218–3227, 2020

work page 2020

[10] [10]

Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,

S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, J. T. Kwak, S. Xu, B. Turkbey, P. Choyke, P. Pinto, B. Woodet al., “Deep recurrent neural networks for prostate cancer detection: analysis of temporal enhanced ultrasound,” IEEE transactions on medical imaging, vol. 37, no. 12, pp. 2695–2703, 2018

work page 2018

[11] [11]

Machine learning prediction of prostate cancer from transrectal ultrasound video clips,

K. Wang, P. Chen, B. Feng, J. Tu, Z. Hu, M. Zhang, J. Yang, Y . Zhan, J. Yao, and D. Xu, “Machine learning prediction of prostate cancer from transrectal ultrasound video clips,”Frontiers in Oncology, vol. 12, p. 948662, 2022

work page 2022

[12] [12]

Direct distillation between different domains,

J. Tang, S. Chen, G. Niu, H. Zhu, J. T. Zhou, C. Gong, and M. Sugiyama, “Direct distillation between different domains,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 154–172

work page 2024

[13] [13]

Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,

X. Xiao, J. Zhang, Y . Shao, J. Liu, K. Shi, C. He, and D. Kong, “Deep learning-based medical ultrasound image and video segmentation methods: Overview, frontiers, and challenges,”Sensors, vol. 25, no. 8, p. 2361, 2025

work page 2025

[14] [14]

Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,

A. Morlok, S. Klemme, I. Weber, A. Stojic, M. Sohn, H. Hiesinger, and J. Helbert, “Mid-infrared spectroscopy of planetary analogs: a database for planetary remote sensing,”Icarus, vol. 324, pp. 86–103, 2019

work page 2019

[15] [15]

Co-modeling the sequential and graphical routes for peptide representation learning,

Z. Liu, G. Wang, J. Wang, J. Zheng, and S. Z. Li, “Co-modeling the sequential and graphical routes for peptide representation learning,” arXiv preprint arXiv:2310.02964, 2023

work page arXiv 2023

[16] [16]

Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,

S. Vesal, I. Gayo, I. Bhattacharya, S. Natarajan, L. S. Marks, D. C. Barratt, R. E. Fan, Y . Hu, G. A. Sonn, and M. Rusu, “Domain generalization for prostate segmentation in transrectal ultrasound images: A multi-center study,”Medical image analysis, vol. 82, p. 102620, 2022

work page 2022

[17] [18]

Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,

M. Harmanani, P. F. Wilson, M. N. N. To, M. Gilany, A. Jamzad, F. Fooladgar, B. Wodlinger, P. Abolmaesumi, and P. Mousavi, “Truswor- thy: toward clinically applicable deep learning for confident detection of prostate cancer in micro-ultrasound,”International Journal of Computer Assisted Radiology and Surgery, pp. 1–9, 2025

work page 2025

[18] [19]

Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,

K. A. L. Hernandez, T. Rienm ¨uller, D. Baumgartner, and C. Baum- gartner, “Deep learning in spatiotemporal cardiac imaging: A review of methodologies and clinical usability,”Computers in Biology and Medicine, vol. 130, p. 104200, 2021

work page 2021

[19] [20]

Segment anything model for medical images?

Y . Huang, X. Yang, L. Liuet al., “Segment anything model for medical images?”Medical Image Analysis, vol. 92, p. 103061, 2023

work page 2023

[20] [21]

Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,

S. Azizi, S. Bayat, P. Yan, A. Tahmasebi, G. Nir, J. T. Kwak, S. Xu, S. Wilson, K. A. Iczkowski, M. S. Luciaet al., “Detection and grading of prostate cancer using temporal enhanced ultrasound: combining deep neural networks and tissue mimicking simulations,”International journal of computer assisted radiology and surgery, vol. 12, pp. 1293– 1305, 2017

work page 2017

[21] [22]

Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,

L. Zhang, S. R. Zhou, M. H. Choi, J. H. Lee, S. Sang, A. Kinnaird, W. G. Brisbane, G. Lughezzani, D. Maffei, V . Fasuloet al., “Mask enhanced deeply supervised prostate cancer detection on b-mode micro- ultrasound,”arXiv preprint arXiv:2412.10997, 2024

work page arXiv 2024

[22] [23]

Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,

H. Wu, J. Fu, H. Ye, Y . Zhong, X. Zou, J. Zhou, and Y . Wang, “Multi- modality transrectal ultrasound video classification for identification of clinically significant prostate cancer,” pp. 1–5, 2024

work page 2024

[23] [24]

Sun, B.-Y

Y .-K. Sun, B.-Y . Zhou, Y . Miao, Y .-L. Shi, S.-H. Xu, D.-M. Wu, L. Zhang, G. Xu, T.-F. Wu, L.-F. Wanget al., “Three-dimensional convolutional neural network model to identify clinically significant prostate cancer in transrectal ultrasound videos: a prospective, multi- institutional, diagnostic study,”EClinicalMedicine, vol. 60, 2023

work page 2023

[24] [25]

Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,

N. Orlando, D. J. Gillies, I. Gyacskov, C. Romagnoli, D. D’Souza, and A. Fenster, “Automatic prostate segmentation using deep learning on clinically diverse 3d transrectal ultrasound images,”Medical physics, vol. 47, no. 6, pp. 2413–2426, 2020. AUTHORet al.: TITLE 13

work page 2020

[25] [26]

Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,

X. Zhao, Q. Zhu, and J. Wu, “Aresnet-vit: A hybrid cnn-transformer net- work for benign and malignant breast nodule classification in ultrasound images,”arXiv preprint arXiv:2407.19316, 2024

work page arXiv 2024

[26] [27]

Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,

Y . Mo, C. Han, Y . Liuet al., “Hover-trans: Anatomy-aware hover- transformer for roi-free breast cancer diagnosis in ultrasound images,” arXiv preprint arXiv:2205.08390, 2022

work page arXiv 2022

[27] [28]

Sivasubramanian, D

A. Sivasubramanian, D. Sasidharan, V . Sowmya, and V . Ravi, “Efficient feature extraction using light-weight cnn attention-based deep learning architectures for ultrasound fetal plane classification,”arXiv preprint arXiv:2410.17396, 2024

work page arXiv 2024

[28] [29]

Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,

J. Wanget al., “Ultrasegnet: A hybrid deep learning framework for enhanced breast cancer segmentation and classification on ultrasound images,”Computers, Materials & Continua, vol. 83, no. 2, pp. 1234– 1250, 2022

work page 2022

[29] [30]

Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,

H. Fazryet al., “Ultraswin: A deep learning method for estimating ejection fraction from echocardiogram videos,”International Journal of Computer Assisted Radiology and Surgery, vol. 18, no. 3, pp. 345–356, 2023

work page 2023

[30] [31]

Temporal pyramid network for action recognition,

C. Yang, Y . Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

work page 2020

[31] [32]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016

[32] [33]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review arXiv 2010

[33] [34]

Wavelet convo- lutions for large receptive fields,

S. E. Finder, R. Amoyal, E. Treister, and O. Freifeld, “Wavelet convo- lutions for large receptive fields,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 363–380

work page 2024

[34] [35]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inEuropean conference on computer vision. Springer, 2016, pp. 20–36

work page 2016

[35] [36]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[36] [37]

Temporal relational reasoning in videos,

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

work page 2018

[37] [38]

Slowfast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019

[38] [39]

Tam: Temporal adaptive module for video recognition,

Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 13 708–13 718

work page 2021

[39] [40]

Video swin transformer,

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 3202–3211

work page 2022

[40] [41]

Mvitv2: Improved multiscale vision transformers for classification and detection,

Y . Li, C.-Y . Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4804– 4814

work page 2022

[41] [42]

Revisiting classifier: Transferring vision-language models for video recognition,

W. Wu, Z. Sun, and W. Ouyang, “Revisiting classifier: Transferring vision-language models for video recognition,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 3, 2023, pp. 2847–2855

work page 2023

[42] [43]

Benchmarking micro-action recognition: Dataset, methods, and applications,

D. Guo, K. Li, B. Hu, Y . Zhang, and M. Wang, “Benchmarking micro-action recognition: Dataset, methods, and applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6238–6252, 2024

work page 2024

[43] [44]

A systematic analysis of performance measures for classification tasks,

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,”Information processing & manage- ment, vol. 45, no. 4, pp. 427–437, 2009

work page 2009

[44] [45]

The meaning and use of the area under a receiver operating characteristic (roc) curve

J. A. Hanley and B. J. McNeil, “The meaning and use of the area under a receiver operating characteristic (roc) curve.”Radiology, vol. 143, no. 1, pp. 29–36, 1982

work page 1982

[45] [46]

K. P. Murphy,Machine learning: a probabilistic perspective. MIT press, 2012

work page 2012

[46] [47]

The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,

T. Saito and M. Rehmsmeier, “The precision-recall plot is more informa- tive than the roc plot when evaluating binary classifiers on imbalanced datasets,”PloS one, vol. 10, no. 3, p. e0118432, 2015

work page 2015

[47] [48]

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,

D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,”arXiv preprint arXiv:2010.16061, 2020

work page arXiv 2010