Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy
Pith reviewed 2026-05-17 02:18 UTC · model grok-4.3
The pith
A Siamese Swin Transformer with dual cross-attention distinguishes rectal tumor regrowth from complete response in paired endoscopy images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SSDCA combines longitudinal endoscopic images at restaging and follow-up using a Siamese Swin Transformer architecture with dual cross-attention to predict whether a rectal cancer patient has clinical complete response or local regrowth; on a held-out set of 62 patients it achieved the highest balanced accuracy of 81.76 percent plus or minus 0.04, sensitivity of 90.07 percent plus or minus 0.08, and specificity of 72.86 percent plus or minus 0.05 while showing stable performance irrespective of imaging artifacts and producing the most discriminative feature clusters according to UMAP analysis.
What carries the argument
Dual cross-attention inside a Siamese Swin Transformer that fuses features from two unaligned longitudinal scans to classify response.
If this is right
- Early objective detection of local regrowth could trigger salvage therapy before distant spread occurs.
- The absence of any spatial-alignment requirement makes the method directly usable on routine clinical image pairs.
- Stability across blood, stool, telangiectasia, and poor quality suggests the model can operate in typical endoscopy conditions.
- Improved inter-cluster separation in feature space indicates the architecture learns more separable representations than standard Swin baselines.
Where Pith is reading between the lines
- If the performance holds on external data, the approach could be embedded in surveillance software to flag regrowth during routine follow-up visits.
- The dual cross-attention pattern may transfer to other longitudinal medical imaging tasks where perfect alignment between time points is unavailable.
- Real-time deployment during endoscopy could provide immediate risk scores and reduce reliance on subjective visual assessment.
Load-bearing premise
The single-center held-out test set of 62 patients sufficiently represents real-world clinical variability and the model will generalize to new centers without external validation.
What would settle it
Evaluation of the same model on an independent multi-center collection of watch-and-wait endoscopy image pairs would show whether balanced accuracy remains above 80 percent outside the original data distribution.
read the original abstract
Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, objectively accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment of images to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76\% $\pm$ 0.04), sensitivity (90.07\% $\pm$ 0.08), and specificity (72.86\% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to classify clinical complete response (cCR) versus local regrowth (LR) from paired restaging and follow-up endoscopic images in rectal cancer watch-and-wait patients. Pretrained Swin backbones extract features, dual cross-attention fuses longitudinal information without spatial alignment, and the model is trained on 135 patients and tested on a held-out set of 62 patients, reporting balanced accuracy 81.76% ± 0.04, sensitivity 90.07% ± 0.08, and specificity 72.86% ± 0.05 together with artifact-robustness and UMAP separation claims.
Significance. If the internal performance holds under broader testing, the work could supply an objective tool for earlier regrowth detection during watch-and-wait surveillance, potentially improving patient management and reducing metastasis risk. The technical approach—pretrained Swin encoders plus dual cross-attention for unpaired longitudinal endoscopy—offers a clear, reproducible direction for medical-image longitudinal analysis in oncology.
major comments (2)
- [Abstract / Results] Abstract and Results section: the central claims of balanced accuracy, sensitivity, specificity, and robustness to blood, stool, telangiectasia, and poor image quality are supported only by a single-center held-out split (135/62 patients). No external or multi-center validation cohort is described, so the reported stability cannot be assumed to generalize to differing endoscopic protocols, demographics, or imaging hardware.
- [Methods] Methods: the manuscript supplies no description of the loss function, optimizer, hyper-parameter search, data augmentation, or statistical tests used to compare SSDCA against Swin-based baselines. Without these details the reported means and standard deviations cannot be reproduced or assessed for significance.
minor comments (1)
- [Abstract] Abstract: the UMAP separation numbers (inter-cluster 1.45 ± 0.18, intra-cluster 1.07 ± 0.19) are presented without stating the distance metric, number of neighbors, or clustering algorithm, making the claim of “maximal inter-cluster separation” difficult to interpret.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript accordingly to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: the central claims of balanced accuracy, sensitivity, specificity, and robustness to blood, stool, telangiectasia, and poor image quality are supported only by a single-center held-out split (135/62 patients). No external or multi-center validation cohort is described, so the reported stability cannot be assumed to generalize to differing endoscopic protocols, demographics, or imaging hardware.
Authors: We agree that the evaluation relies on a single-center held-out split and that external validation would strengthen claims of generalizability. In the revised manuscript we have added an explicit limitations paragraph in the Discussion section stating that the reported performance is internal and that multi-center studies are required to assess robustness across protocols, demographics, and hardware. We retain the artifact-robustness and UMAP analyses as evidence of stability within the current dataset while clearly qualifying the scope of the claims. revision: yes
-
Referee: [Methods] Methods: the manuscript supplies no description of the loss function, optimizer, hyper-parameter search, data augmentation, or statistical tests used to compare SSDCA against Swin-based baselines. Without these details the reported means and standard deviations cannot be reproduced or assessed for significance.
Authors: We thank the referee for identifying this omission. The revised Methods section now includes the following details: binary cross-entropy loss, AdamW optimizer with learning rate 1e-4 and weight decay 0.05, hyper-parameter selection via grid search over learning rate and batch size using 5-fold cross-validation on the training set, data augmentation consisting of random horizontal flips, rotations (±15°), and color jitter, and statistical comparisons performed with paired t-tests and Bonferroni correction. These additions enable full reproduction of the reported means, standard deviations, and significance assessments. revision: yes
- Absence of an external or multi-center validation cohort; the study is limited to single-center data and additional patient cohorts cannot be obtained within the current work.
Circularity Check
No circularity: empirical ML evaluation on held-out internal split
full rationale
The paper describes a standard supervised training pipeline for a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) on longitudinal endoscopic image pairs. Training uses 135 patients and evaluation uses a held-out set of 62 patients, with reported metrics (balanced accuracy 81.76% ± 0.04, sensitivity 90.07% ± 0.08, specificity 72.86% ± 0.05) obtained directly from this split. Post-hoc analyses such as artifact robustness checks and UMAP clustering of extracted features are interpretations of model outputs on the same data distribution. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce any central claim to its inputs by construction. The evaluation chain is self-contained against the internal benchmark and does not rely on tautological definitions or load-bearing self-references.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR... Dual cross attention is implemented to emphasize features from the two scans without requiring any spatial alignment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION The Organ Preservation in Rectal Adenocarcinoma (OPRA) clinical trial (NCT02008656) showed that 47% of patients with rectal adenocarcinoma treated with total neoadjuvant therapy (TNT) followed by a watch-and-wait (WW) surveil- lance avoid surgery and achieve sustained clinical response without reducing their chance of cure [1]. Subjective cli...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
RELATED WORKS Multiple prior works have addressed the problem of predict- ing tumor treatment response by combining pairs of images either using radiomic [7] or DL methods involving radio- graphic CT and MR images [8, 9, 10, 11, 12], demonstrating the benefit of combining additional time point in improv- ing prediction accuracy. Methods to fuse informatio...
-
[3]
METHODS 3.1. Datasets One hundred ninety seven patients with LARC treated with induction or consolidation total neoadjuvant therapy (TNT) and selected for WW surveillance at restaging (8 to 12 weeks after completing TNT) were analyzed to classify LR vs. cCR using pairs of endoscopic images at restaging and a follow- up. A total of 2,278 images (LR = 768, ...
-
[4]
SSDCA more accurately predits LR As shown in Table 1, SSDCA was the most accurate fol- lowed by SSFC
RESULTS 4.1. SSDCA more accurately predits LR As shown in Table 1, SSDCA was the most accurate fol- lowed by SSFC. As shown, SSDCA had higher sensitivity than SSFC but lower specificity than the latter method. On the other hand, using only single image prediction was the least accurate, with a much lower sensitivty (p=0.0029 95% CI: [0.1454, 0.3613]) and ...
-
[5]
DISCUSSION AND CONCLUSION We developed a longitudinal analysis framework to predict tumor regrowth from endoscopic images of rectal cancer pa- tients. Our findings show that combining pairs of images with dual cross attention leads to more accurate performance compared to single time-point analysis. SSDCA balanced ac- curacy of 82% is comparable to surgeo...
-
[6]
Approval was granted by the Ethics Committee of Memorial Sloan Ketter- ing Cancer Center
COMPLIANCE WITH ETHICAL STANDARDS This retrospective research study was conducted in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of Memorial Sloan Ketter- ing Cancer Center
-
[7]
ACKNOWLEDGEMENTS This research was supported by Department of Surgery at Memorial Sloan Kettering. We thank Maria Widmar, Iris H Wei, Emmanouil P Pappou, Garrett M Nash, Martin R Weiser, and Philip B Paty, along with Hannah Thompson, Hannah Williams, Joshua Jesse Smith and Julio Garcia-Aguilar, for their assistance in collecting endoscopic images
-
[8]
Organ preservation in patients with rectal adenocarcinoma treated with total neoadju- vant therapy,
Julio Garcia-Aguilar, Sujata Patil, Marc J. Gollub, Jin K. Kim, Jonathan B. Yuval, Hannah M. Thomp- son, Floris S. Verheij, Dana M. Omer, Meghan Lee, Richard F. Dunne, Jorge Marcet, Peter Cataldo, Blase Polite, Daniel O. Herzig, David Liska, Samuel Oom- men, Charles M. Friel, Charles Ternent, Andrew L. Cov- eler, Steven Hunt, Anita Gregory, Madhulika G. V...
work page 2022
-
[9]
Seth I. Felder, Sujata Patil, Erin Kennedy, and Julio Garcia-Aguilar, “Endoscopic feature and response re- producibility in tumor assessment after neoadjuvant therapy for rectal adenocarcinoma,”Annals of Surgical Oncology, vol. 28, no. 9, pp. 5205–5223, 2021
work page 2021
-
[10]
Jorge Tapias Gomez, Aneesh Rangnekar, Hannah Williams, Hannah M. Thompson, Julio Garcia-Aguilar, Joshua Jesse Smith, and Harini Veeraraghavan, “Swin transformers are robust to distribution and concept drift in endoscopy-based longitudinal rectal cancer assess- ment,” inMedical Imaging 2025: Image Processing, Olivier Colliot and Jhimli Mitra, Eds. Internat...
work page 2025
-
[11]
Thompson, Christina Lee, Aneesh Rangnekar, Jorge T
Hannah Williams, Hannah M. Thompson, Christina Lee, Aneesh Rangnekar, Jorge T. Gomez, Maria Wid- mar, Iris H. Wei, Emmanouil P. Pappou, Garrett M. Nash, Martin R. Weiser, Philip B. Paty, J. Joshua Smith, Harini Veeraraghavan, and Julio Garcia-Aguilar, “As- sessing endoscopic response in locally advanced rectal cancer treated with total neoadjuvant therapy...
work page 2024
-
[12]
Hester Haak, Xinpei Gao, Monique Maas, Selam Wak- tola, Sean Benson, Regina Beets-Tan, Geerard Beets, Monique Leerdam, and Jarno Melenhorst, “The use of deep learning on endoscopic images to assess the re- sponse of rectal cancer after chemoradiation,”Surgical Endoscopy, vol. 36, pp. 1–9, 10 2021
work page 2021
-
[13]
H Thompson, J.K Kim, R.M Jimenez-Rodriguez, J Garcia-Aguilar, and H Veeraraghavan, “Deep learning-based model for identifying tumors in endo- scopic images from patients with locally advanced rec- tal cancer treated with total neoadjuvant therapy,”Dis Colon Rectum, vol. 66, no. 3, pp. 383–391, 2023
work page 2023
-
[14]
Elizabeth J. Sutton, Natsuko Onishi, Duc A. Fehr, Brit- tany Z. Dashevsky, Meredith Sadinski, Katja Pinker, Danny F. Martinez, Edi Brogi, Lior Braunstein, Pe- dram Razavi, Mahmoud El-Tamer, Virgilio Sacchini, Joseph O. Deasy, Elizabeth A. Morris, and Harini Veer- araghavan, “A machine learning model that classifies breast cancer pathologic complete respon...
work page 2020
-
[15]
Pre- dicting treatment response from longitudinal images us- ing multi-task deep learning,
Cheng Jin, Heng Yu, Jia Ke, Peirong Ding, Yongju Yi, Xiaofeng Jiang, Xin Duan, Jinghua Tang, Daniel T. Chang, Xiaojian Wu, Feng Gao, and Ruijiang Li, “Pre- dicting treatment response from longitudinal images us- ing multi-task deep learning,”Nature Communications, vol. 12, no. 1, pp. 1851, 2021
work page 2021
-
[16]
Yuchen Sun, Kunwei Li, Duanduan Chen, Yi Hu, and Shuaitong Zhang, “Lomia-t: A transformer-based lon- gitudinal medical image analysis framework for predict- ing treatment response of esophageal cancer,” inMedi- cal Image Computing and Computer Assisted Interven- tion – MICCAI 2024, Marius George Linguraru, Qi Dou, Aasa Feragen, Stamatia Giannarou, Ben Glo...
work page 2024
-
[17]
Hailin Yue, Jin Liu, Junjian Li, Hulin Kuang, Jinyi Lang, Jianhong Cheng, Lin Peng, Yongtao Han, Harri- son Bai, Yuping Wang, Qifeng Wang, and Jianxin Wang, “Mldrl: Multi-loss disentangled representation learning for predicting esophageal cancer response to neoadju- vant chemoradiotherapy using longitudinal ct images,” Medical Image Analysis, vol. 79, pp....
work page 2022
-
[18]
Matthew Li, Ken Chang, Ben Bearce, Connie Chang, Ambrose Huang, John Campbell, James Brown, Praveer Singh, Katharina Hoebel, Deniz Erdogmus, Stratis Ioan- nidis, William Palmer, Michael Chiang, and Jayashree Kalpathy-Cramer, “Siamese neural networks for contin- uous disease severity evaluation and change detection in medical imaging,”npj Digital Medicine,...
work page 2020
-
[19]
Heejong Kim, Batuhan K. Karaman, Qingyu Zhao, Alan Q. Wang, Mert R. Sabuncu, and the Alzheimer’s Disease Neuroimaging Initiative, “Learning-based in- ference of longitudinal image changes: Applications in embryo development, wound healing, and aging brain,” Proceedings of the National Academy of Sciences, vol. 122, no. 8, pp. e2411492122, 2025
work page 2025
-
[20]
Deep learning pre- dicts lung cancer treatment response from serial medical imaging,
Yiwen Xu, Ahmed Hosny, Roman Zeleznik, Chintan Parmar, Thibaud Coroller, Idalid Franco, Raymond H. Mak, and Hugo J. W. L. Aerts, “Deep learning pre- dicts lung cancer treatment response from serial medical imaging,”Clinical Cancer Research, vol. 25, no. 11, pp. 3266–3275, 2019
work page 2019
-
[21]
Signature verification using a
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, and Roopak Shah, “Signature verification using a ”siamese” time delay neural network,” inPro- ceedings of the 7th International Conference on Neu- ral Information Processing Systems, San Francisco, CA, USA, 1993, NIPS’93, p. 737–744, Morgan Kaufmann Publishers Inc
work page 1993
-
[22]
Swin transformer: Hierarchical vision transformer us- ing shifted windows,
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer us- ing shifted windows,” 2021
work page 2021
-
[23]
Bitemporal at- tention transformer for building change detection and building damage assessment,
Wen Lu, Lu Wei, and Minh Nguyen, “Bitemporal at- tention transformer for building change detection and building damage assessment,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sens- ing, vol. 17, pp. 4917–4935, 2024
work page 2024
-
[24]
Imagenet: A large-scale hierarchical image database,
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Com- puter Vision and Pattern Recognition, 2009, pp. 248– 255
work page 2009
-
[25]
Dissecting self-supervised learning methods for surgical computer vision,
Sanat Ramesh, Vinkle Srivastav, Deepak Alapatt, Tong Yu, Aditya Murali, Luca Sestini, Chinedu Inno- cent Nwoye, Idris Hamoud, Saurav Sharma, Antoine Fleurentin, Georgios Exarchakis, Alexandros Karar- gyris, and Nicolas Padoy, “Dissecting self-supervised learning methods for surgical computer vision,” 2023
work page 2023
-
[26]
Hannah Williams, Hannah M Thompson, Sabrina T Lin, Floris S Verheij, Dana M Omer, Li-Xuan Qin, Julio Garcia-Aguilar, OPRA Consortium, et al., “Endoscopic predictors of residual tumor after total neoadjuvant ther- apy: a post hoc analysis from the organ preservation in rectal adenocarcinoma trial,”Diseases of the Colon & Rectum, vol. 67, no. 3, pp. 369–376, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.