arxiv: 2605.12855 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

Jorge Tapias Gomez , Despoina Kanata , Aneesh Rangnekar , Christina Lee , Hannah Williams , Hannah Thompson , J. Joshua Smith , Francisco Sanchez-Vega

show 3 more authors

Mert R. Sabuncu Julio Garcia-Aguilar Harini Veeraraghavan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords rectal cancerendoscopydeep learningtumor regrowthlongitudinal imagingwatch-and-waitcomputer visionSwin Transformer

0 comments

The pith

A longitudinal deep learning model detects rectal cancer regrowth from paired endoscopy images with 97 percent sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TREX to analyze pairs of endoscopic images from restaging and follow-up visits in rectal cancer patients under watch-and-wait surveillance. Standard clinical checks currently lack objective early signals for local regrowth, which can delay intervention. If the approach holds, it would flag regrowth at 3 to 12 months before visible clinical confirmation while matching the accuracy of attending surgeons. The model also shows preliminary ability to predict initial treatment response from pre-treatment and restaging pairs.

Core claim

TREX uses siamese Swin Transformers with dual cross-attention on unregistered longitudinal image pairs to distinguish complete clinical response from local regrowth, reaching 97 percent sensitivity and 90 percent balanced accuracy on held-out data while outperforming baselines at early time points of 3-6 and 6-12 months before clinical detection.

What carries the argument

TREX (Temporal Rectal Endoscopy Cross-attention) extracts features from image pairs via pretrained Swin Transformers in a siamese setup and fuses them with dual cross-attention without spatial co-registration.

Load-bearing premise

The clinical trial dataset used for training and testing is representative of broader patient populations and imaging conditions, and that performance on held-out data will translate to prospective real-world use without significant domain shift.

What would settle it

A prospective multi-center study on new patients under watch-and-wait surveillance showing TREX sensitivity falling below 85 percent would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.12855 by Aneesh Rangnekar, Christina Lee, Despoina Kanata, Francisco Sanchez-Vega, Hannah Thompson, Hannah Williams, Harini Veeraraghavan, J. Joshua Smith, Jorge Tapias Gomez, Julio Garcia-Aguilar, Mert R. Sabuncu.

**Figure 1.** Figure 1: Prediction and surveillance system (TREX) for locally advanced rectal cancer under total neoadjuvant therapy (TNT) and watch-and-wait (WW) management. (a) Clinical workflow: patients undergo TNT treatment followed by either surgical resection for persistent/recurrent disease or WW surveillance with endoscopy every 3 months, achieving a complete or near-complete clinical response (CR) at 2 years. (b) TREX a… view at source ↗

**Figure 2.** Figure 2: Performance of the baseline models and TREX models across longitudinal follow-up timepoints, where clinicians labeled the image at the last available followup (timepoint 0) and any other previous follow-ups are retrospectively assigned that label. TREX achieved the highest balanced accuracy and sensitivity at clinically relevant timepoints, particularly 3–6 months before clinical detection and at the fina… view at source ↗

**Figure 2.** Figure 2: TREX achieved the highest balanced accuracy and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Representative trajectories of three patients during the WW period. Patients 1 and 2 developed local regrowth, with TREX correctly identifying tumor presence as early as 3–6 months prior to clinical detection in both cases, and 6–12 months prior in Patient 1. Patient 3 achieved a sustained CR, which TREX correctly classified throughout the entire WW period. siamese models is provided in the SI Appendix, Ta… view at source ↗

**Figure 5.** Figure 5: Representative Grad-CAMs for misclassified survey images. The top row shows false negatives, with white arrows indicating regions of residual disease. From left to right: (a) nodules, (b) vascular abnormalities, and (c) nodules with stool and partial visualization of the scope. The bottom row shows false positives: (d) normal mucosal fold, (e) stool, and (f) telangiectasia. We additionally evaluated TREX a… view at source ↗

**Figure 6.** Figure 6: TREX specificity and sensitivity on common endoscopic image artifacts for the analyzed timepoints (averaged over folds). Timepoint labels are abbreviated for readability: ‘0’ = clinical detection at the last follow-up, ‘3–6’ = 3–6 months before detection, ‘6–12’ = 6–12 months before detection, and ‘12–24’ = 12–24 months before detection. LR LR cCR GradCAM Follow -up Restaging Restaging Follow -up [PITH_FU… view at source ↗

**Figure 7.** Figure 7: Grad-CAM and attention maps for four representative test cases produced by TREX, illustrating good correspondence of relevant spatial features between image pairs. near chance levels. The impact of image artifacts (including blood, stool, telangiectasia, and poor image quality) was assessed by manually annotating images with these factors and then computing sensitivity and specificity ( [PITH_FULL_IMAGE:f… view at source ↗

**Figure 8.** Figure 8: Ablation experiments evaluating the contribution of key TREX components and design choices. (a) Removing balanced sampling or data augmentation produced the largest reduction in balanced accuracy across all timepoints, while removing dual cross-attention (DCA) or temporal encoding (∆t) also consistently reduced performance, confirming the importance of both architectural and training components. (b) Perfor… view at source ↗

**Figure 9.** Figure 9: Pairwise Temporal Rectal Endoscopy Cross-Attention architecture. Restaging (res) and follow-up (fup) images are processed through siamese Swin Transformer encoders. The final feature maps undergo dual cross-attention through two CA blocks to model temporal changes, followed by an MLP for classification of CR versus LR across variable follow-up timepoints (∆t). This was accomplished by computing the time di… view at source ↗

read the original abstract

Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3--6, 6--12, and 12--24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% $\pm$ 6% and a balanced accuracy of 90% $\pm$ 3%, and outperformed all baselines in early detection at both 3--6 (74% $\pm$ 1%) and 6--12 months (62% $\pm$ 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% $\pm$ 1.28%). Finally, we explored TREX's ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% $\pm$ 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TREX applies siamese Swin Transformers with cross-attention to unregistered longitudinal endoscopy pairs for rectal cancer regrowth detection and reports solid early-detection numbers, though dataset and split details remain thin.

read the letter

The main contribution is TREX, a siamese setup of pretrained Swin Transformers plus dual cross-attention that processes pairs of restaging and follow-up endoscopy images without registration. It distinguishes complete response from local regrowth and extends to early prediction at 3-6 and 6-12 months before clinical confirmation. They also test a pre-treatment to restaging variant for treatment response. This is a straightforward new application of established transformer blocks to this longitudinal unregistered medical imaging task. The reported results look decent on the surface: 97% sensitivity and 90% balanced accuracy for detection, with 74% and 62% accuracy at the two early windows, beating the Swin baselines. The surgeon survey where the model reaches 86% accuracy close to clinicians' 88% gives it some clinical grounding. Those numbers are presented with standard deviations that suggest cross-validation was used. The architecture itself is described clearly enough from the abstract. The soft spots sit in the evaluation setup. No patient count, no breakdown of CR versus LR cases, and no explicit statement on patient-disjoint splits appear in the provided summary. If the train-test division was done at the image or pair level rather than strictly by patient, leakage is possible because later clinical labels could indirectly inform earlier time points from the same case. That would make the early-detection claims and the small standard deviations look more optimistic than they are. The stress-test note on this point is worth checking against the full methods. If the paper shows proper patient-level partitioning and a reasonable sample size, the central claims hold up better. This paper is for researchers working on longitudinal medical imaging or watch-and-wait rectal cancer protocols. Readers who follow transformer adaptations to unregistered time-series data will get value from the architecture and the early-detection experiments. It shows clear technical engagement with the problem and honest metric reporting, so it deserves a serious referee even if the data description needs expansion in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces TREX, a siamese Swin Transformer architecture with dual cross-attention for analyzing pairs of longitudinal rectal endoscopy images to distinguish complete response from local regrowth in watch-and-wait rectal cancer patients. It reports TREX achieving 97% sensitivity ±6% and 90% balanced accuracy ±3% for LR detection at the last follow-up, outperforming baselines in early detection at 3-6 months (74% ±1%) and 6-12 months (62% ±4%) prior to clinical confirmation, plus a surgeon survey showing TREX matches attending-level accuracy (86.21% vs 87.84% ±1.28%) and an exploratory pre-treatment response prediction task (73% ±12% balanced accuracy).

Significance. If the performance claims are supported by patient-disjoint validation, the work has clear clinical significance for improving surveillance in non-operative rectal cancer management by enabling automated early detection of regrowth from endoscopy pairs without spatial registration. The surgeon survey provides a useful form of clinical validation, and the longitudinal cross-attention design addresses a practical challenge in serial imaging.

major comments (2)

[Methods and Abstract] The manuscript provides no patient count, total image count, exclusion criteria, or explicit statement on whether train/test splits were performed at the patient level (rather than image or pair level). Given the longitudinal setup where regrowth labels derive from later clinical confirmation, this omission leaves open the possibility of patient-level data leakage, which directly undermines confidence in the held-out metrics of 97% ±6% sensitivity and 90% ±3% balanced accuracy reported in the abstract and results.
[Results] The early-detection experiments (3-6 and 6-12 months prior) use the same held-out evaluation protocol as the primary detection task. Without confirmation of strictly patient-disjoint partitioning, the claim that TREX outperforms all baselines at these time horizons rests on potentially optimistic estimates and requires explicit verification to support the central generalization argument.

minor comments (2)

[Abstract] The abstract would be strengthened by immediately stating the number of patients and images to contextualize the reported standard deviations.
[Results] Clarify the exact composition of the surgeon survey (number of participants, experience levels, and how cases were selected) to allow readers to assess the clinical validation strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for identifying key areas where additional detail will strengthen the manuscript. We address each major comment below and will revise the manuscript to incorporate the requested clarifications on dataset composition and validation strategy.

read point-by-point responses

Referee: [Methods and Abstract] The manuscript provides no patient count, total image count, exclusion criteria, or explicit statement on whether train/test splits were performed at the patient level (rather than image or pair level). Given the longitudinal setup where regrowth labels derive from later clinical confirmation, this omission leaves open the possibility of patient-level data leakage, which directly undermines confidence in the held-out metrics of 97% ±6% sensitivity and 90% ±3% balanced accuracy reported in the abstract and results.

Authors: We agree that these details are necessary for readers to evaluate the risk of data leakage. In the revised manuscript we will add the total number of patients, total number of images, explicit exclusion criteria, and a clear statement that all train/test splits (including those used for the early-detection experiments) were performed at the patient level with no patient overlap between training and test sets. This partitioning was already enforced in our experiments; the omission was an oversight in the initial submission. revision: yes
Referee: [Results] The early-detection experiments (3-6 and 6-12 months prior) use the same held-out evaluation protocol as the primary detection task. Without confirmation of strictly patient-disjoint partitioning, the claim that TREX outperforms all baselines at these time horizons rests on potentially optimistic estimates and requires explicit verification to support the central generalization argument.

Authors: We confirm that the early-detection experiments used exactly the same patient-disjoint splits as the primary last-follow-up detection task. In the revision we will explicitly restate this partitioning protocol in both the Methods and Results sections and will add a sentence confirming that no patient contributes images to both training and test sets at any time horizon. This will directly address the concern about optimistic estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical performance reported on held-out longitudinal data

full rationale

The paper trains the TREX siamese Swin Transformer model on pairs of restaging and follow-up endoscopy images under explicit detection and early-detection settings, then reports sensitivity, balanced accuracy, and other metrics on held-out test data. No equations or claims reduce a result to its own inputs by construction, no fitted parameters are relabeled as predictions, and no self-citation chain supplies the central performance numbers. The reported figures (e.g., 97% sensitivity, 74% early detection) are standard empirical outcomes of supervised training and evaluation rather than tautological restatements of the training procedure or prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on transfer learning from general-domain pretraining working for medical endoscopy and on the clinical trial data being sufficient to train a generalizable model.

free parameters (1)

Swin Transformer hyperparameters and training settings
Standard deep learning training involves many tunable parameters fitted to the specific dataset.

axioms (1)

domain assumption Pretrained Swin Transformers extract features relevant to rectal endoscopy without domain-specific fine-tuning details provided
Assumes general image pretraining transfers effectively to this medical imaging task.

pith-pipeline@v0.9.0 · 5691 in / 1285 out tokens · 40831 ms · 2026-05-14T20:27:11.791585+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 29 canonical work pages

[1]

Diseases of the Colon and Rectum 67(1), 18–31 (2024)https://doi.org/10.1097/DCR

Langenfeld, S.J., Davis, B.R., Vogel, J.D., Davids, J.S., Temple, L.K., Cologne, K.G., Hendren, S., Hunt, S., Aguilar, J.G., Feingold, D.L.,et al.: The ameri- can society of colon and rectal surgeons clinical prac- tice guidelines for the management of rectal cancer 2023 supplement. Diseases of the Colon and Rectum 67(1), 18–31 (2024)https://doi.org/10.10...

work page doi:10.1097/dcr 2023
[2]

Journal of Clinical Oncology40(23), 2546–2556 (2022)https://doi.org/10.1200/JCO.22.00032 https://ascopubs.org/doi/pdf/10.1200/JCO.22.00032

Garcia-Aguilar, J., Patil, S., Gollub, M.J., Kim, J.K., Yuval, J.B., Thompson, H.M., Verheij, F.S., Omer, D.M., Lee, M., Dunne, R.F., Marcet, J., Cataldo, P., Polite, B., Herzig, D.O., Liska, D., Oommen, S., Friel, C.M., Ternent, C., Coveler, A.L., Hunt, S., Gregory, A., Varma, M.G., Bello, B.L., Carmichael, J.C., Krauss, J., Gleisner, A., Paty, P.B., Wei...

work page doi:10.1200/jco.22.00032 2022
[3]

Journal of Clinical Oncology42(5), 500–506 (2024)https://doi.org/10.1200/JCO.23.01208 https://ascopubs.org/doi/pdf/10.1200/JCO.23.01208

Verheij, F.S., Omer, D.M., Williams, H., Lin, S.T., Qin, L.-X., Buckley, J.T., Thompson, H.M., Yuval, J.B., Kim, J.K., Dunne, R.F., Marcet, J., Cataldo, P., Polite, B., Herzig, D.O., Liska, D., Oommen, S., Friel, C.M., Ternent, C., Coveler, A.L., Hunt, S., Gregory, A., Varma, M.G., Bello, B.L., Carmichael, J.C., Krauss, J., Gleisner, A., Guillem, J.G., Te...

work page doi:10.1200/jco.23.01208 2024
[4]

An- nals of surgery268(6), 955–967 (2018)

Dattani, M., Heald, R.J., Goussous, G., Broadhurst, J., Sao Juliao, G.P., Habr-Gama, A., Perez, R.O., Moran, B.J.: Oncological and survival outcomes in watch and wait patients with a clinical complete re- sponse after neoadjuvant chemoradiotherapy for rectal cancer: a systematic review and pooled analysis. An- nals of surgery268(6), 955–967 (2018)

2018
[5]

The Lancet Gastroenterology & Hep- atology3(12), 825–836 (2018)https://doi.org/10

Chadi, S.A., Malcomson, L., Ensor, J., Riley, R.D., Vaccaro, C.A., Rossi, G.L., Daniels, I.R., Smart, N.J., Osborne, M.E., Beets, G.L., Maas, M., Bit- terman, D.S., Du, K., Gollins, S., Sun Myint, A., Smith, F.M., Saunders, M.P., Scott, N., O’Dwyer, S.T., de Castro Araujo, R.O., Valadao, M., Lopes, A., Hsiao, C.-W., Lai, C.-L., Smith, R.K., Paulson, E.C.,...

2018
[6]

Diseases of the Colon and Rectum 67(3), 369–376 (2024)

Williams, H., Thompson, H.M., Lin, S.T., Verheij, F.S., Omer, D.M., Qin, L.-X., Garcia-Aguilar, J., Con- sortium, O.,et al.: Endoscopic predictors of residual tumor after total neoadjuvant therapy: a post hoc analysis from the organ preservation in rectal adeno- carcinoma trial. Diseases of the Colon and Rectum 67(3), 369–376 (2024)

2024
[7]

Annals of Surgical Oncology28(9), 5205–5223 (2021)

Felder, S., Patil, S., Kennedy, E., Garcia-Aguilar, J.: Endoscopic feature and response reproducibility in tu- mor assessment after neoadjuvant therapy for rectal adenocarcinoma. Annals of Surgical Oncology28(9), 5205–5223 (2021)

2021
[8]

Annals of Surgical Oncology22(12), 3873–3880 (2015)

Maas, M., Lambregts, D.M., Nelemans, P.J., Heij- nen, L.A., Martens, M.H., Leijtens, J.W., Sosef, M., Hulsew´ e, K.W., Hoff, C., Breukink, S.O.,et al.: As- sessment of clinical complete response after chemoradi- ation for rectal cancer with digital rectal examination, endoscopy, and mri: selection for organ-saving treat- ment. Annals of Surgical Oncology2...

2015
[9]

Ko, H.M., Choi, Y.H., Lee, J.E., Lee, K.H., Kim, J.Y., Kim, J.S.: Combination assessment of clinical com- plete response of patients with rectal cancer follow- ing chemoradiotherapy with endoscopy and magnetic resonance imaging. Ann. Coloproctol.35(4), 202–208 (2019)

2019
[10]

Radiology269(1), 101–112 (2013) https://doi.org/10.1148/radiol.13122833 https://doi.org/10.1148/radiol.13122833

Paardt, M.P., Zagers, M.B., Beets-Tan, R.G.H., Stoker, J., Bipat, S.: Patients who undergo preoperative chemoradiotherapy for locally ad- vanced rectal cancer restaged by using diag- nostic mr imaging: A systematic review and meta-analysis. Radiology269(1), 101–112 (2013) https://doi.org/10.1148/radiol.13122833 https://doi.org/10.1148/radiol.13122833. PMI...

work page doi:10.1148/radiol.13122833 2013
[11]

Kawai, K., Ishihara, S., Nozawa, H., Hata, K., Kiy- 11 omatsu, T., Morikawa, T., Fukayama, M., Watanabe, T.: Prediction of pathological complete response us- ing endoscopic findings and outcomes of patients who underwent watchful waiting after chemoradiotherapy for rectal cancer. Dis. Colon Rectum60(4), 368–375 (2017)

2017
[12]

Clinical and Transla- tional Oncology26(4), 825–835 (2024)https://doi

Safont, M.J., Garc´ ıa-Figueiras, R., Hernando-Requejo, O., Jimenez-Rodriguez, R., Lopez-Vicente, J., Machado, I., Ayuso, J.-R., Bustamante-Bal´ en, M., De Torres-Olombrada, M.V., Dom´ ınguez Tristancho, J.L., Fern´ andez-Ace˜ nero, M.J., Suarez, J., Vera, R.: Interdisciplinary spanish consensus on a watch-and- wait approach for rectal cancer. Clinical an...

work page doi:10.1007/s12094-023-03322-2 2024
[13]

Medical Im- age Analysis90, 102962 (2023)https://doi.org/10

Wang, A.Q., Yu, E.M., Dalca, A.V., Sabuncu, M.R.: A robust and interpretable deep learning framework for multi-modal registration via keypoints. Medical Im- age Analysis90, 102962 (2023)https://doi.org/10. 1016/j.media.2023.102962

work page arXiv 2023
[14]

Medical Physics50(8), 4758– 4774 (2023)https://doi.org/10.1002/mp.16527 https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1002/mp.16527

Jiang, J., Hong, J., Tringale, K., Reyngold, M., Crane, C., Tyagi, N., Veeraraghavan, H.: Progressively refined deep joint registration segmentation (prorseg) of gastrointestinal organs at risk: Application to mri and cone-beam ct. Medical Physics50(8), 4758– 4774 (2023)https://doi.org/10.1002/mp.16527 https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1002/mp.16527

work page doi:10.1002/mp.16527 2023
[15]

Frontiers in OncologyV olume 15 - 2025(2025)https://doi

Mendes, J., Oliveira, B., Ara´ ujo, C., Galr˜ ao, J., Mota, A.M., Garcia, N.C., Matela, N.: Deep learning in breast cancer risk prediction: a review of recent ap- plications in full-field digital mammography. Frontiers in OncologyV olume 15 - 2025(2025)https://doi. org/10.3389/fonc.2025.1656842

work page doi:10.3389/fonc.2025.1656842 2025
[16]

Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, D., Etemadi, M., Ye, W., Corrado, G., Naidich, D.P., Shetty, S.: End-to- end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med.25(6), 954–961 (2019)

2019
[17]

International Journal of Radiation Oncology*Biology*Physics122(5), 1380–1390 (2025) https://doi.org/10.1016/j.ijrobp.2025.03.012

Chen, X., Meng, F., Zhang, P., Wang, L., Yao, S., An, C., Li, H., Zhang, D., Li, H., Li, J., Wang, L., Liu, Y.: Establishing a deep learning model that in- tegrates pretreatment and midtreatment computed to- mography to predict treatment response in non-small cell lung cancer. International Journal of Radiation Oncology*Biology*Physics122(5), 1380–1390 (2...

work page doi:10.1016/j.ijrobp.2025.03.012 2025
[18]

Diseases of the Colon and Rectum66(12), 1195–1206 (2023)https://doi.org/ 10.1097/DCR.0000000000002931

Ke, J.,et al.: A longitudinal mri-based artificial in- telligence system to predict pathological complete re- sponse after neoadjuvant therapy in rectal cancer: A multicenter validation study. Diseases of the Colon and Rectum66(12), 1195–1206 (2023)https://doi.org/ 10.1097/DCR.0000000000002931

work page doi:10.1097/dcr.0000000000002931 2023
[19]

Nature Commu- nications12, 1851 (2021)https://doi.org/10.1038/ s41467-021-22188-y

Jin, C., Yu, H., Ke, J., Ding, P., Yi, Y., Jiang, X., Duan, X., Tang, J., Chang, D.T., Wu, X., Gao, F., Li, R.: Predicting treatment response from longitudinal images using multi-task deep learning. Nature Commu- nications12, 1851 (2021)https://doi.org/10.1038/ s41467-021-22188-y

2021
[20]

In: Proceedings of Medical Im- age Computing and Computer Assisted Intervention – MICCAI 2024, vol

Sun, Y., Li, K., Chen, D., Hu, Y., Zhang, S.: LOMIA- T: A Transformer-based LOngitudinal Medical Image Analysis framework for predicting treatment response of esophageal cancer . In: Proceedings of Medical Im- age Computing and Computer Assisted Intervention – MICCAI 2024, vol. LNCS 15005. Springer, ??? (2024)

2024
[21]

Medical Image Analysis34, 200–219 (2016)https://doi.org/10.1016/j.media

Gerig, G., Chung, A., Datar, M., Gouttard, S., Lee, J., Shi, Y., Wang, T., Wu, J., Faria, A.V.: Longitu- dinal modeling of appearance and shape and its po- tential for clinical use. Medical Image Analysis34, 200–219 (2016)https://doi.org/10.1016/j.media. 2016.06.011

work page doi:10.1016/j.media 2016
[22]

Proceedings of the National Academy of Sciences122(8), 2411492122 (2025)https://doi.org/10.1073/pnas.2411492122 https://www.pnas.org/doi/pdf/10.1073/pnas.2411492122

Kim, H., Karaman, B.K., Zhao, Q., Wang, A.Q., Sabuncu, M.R., Alzheimer’s Disease Neuroimag- ing Initiative: Learning-based inference of longitudinal image changes: Applications in embryo development, wound healing, and aging brain. Proceedings of the National Academy of Sciences122(8), 2411492122 (2025)https://doi.org/10.1073/pnas.2411492122 https://www.p...

work page doi:10.1073/pnas.2411492122 2025
[23]

In: 2021 7th International Conference on Computer and Communications (ICCC), pp

Shen, Z., Fu, R., Lin, C., Zheng, S.: Cotr: Convo- lution in transformer network for end to end polyp detection. In: 2021 7th International Conference on Computer and Communications (ICCC), pp. 1757– 1761 (2021).https://doi.org/10.1109/ICCC54389. 2021.9674267

work page doi:10.1109/iccc54389 2021
[24]

Yamada, M., Saito, Y., Imaoka, H., Saiko, M., Ya- mada, S., Kondo, H., Takamaru, H., Sakamoto, T., Sese, J., Kuchiba, A., Shibata, T., Hamamoto, R.: Development of a real-time endoscopic image diagno- sis support system using deep learning technology in colonoscopy. Sci. Rep.9(1), 14465 (2019)

2019
[25]

Medical Image Analysis70, 102002 (2021)

Ali, S., Dmitrieva, M., Ghatwary, N., Bano, S., Po- lat, G., Temizel, A., Krenzer, A., Hekalo, A., Guo, Y.B., Matuszewski, B., Gridach, M., Voiculescu, I., Yoganand, V., Chavan, A., Raj, A., Nguyen, N.T., Tran, D.Q., Huynh, L.D., Boutry, N., Rezvy, S., Chen, H., Choi, Y.H., Subramanian, A., Balasubra- manian, V., Gao, X.W., Hu, H., Liao, Y., Stoyanov, D.,...

2021
[26]

In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp

Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L.: Pranet: Parallel reverse attention network for polyp segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 263–273 (2020). Springer

2020
[27]

CAAI Artificial Intelligence Research2, 9150015 (2023)

Dong, B., Wang, W., Fan, D.-P., Li, J., Fu, H., Shao, L.: Polyp-pvt: Polyp segmentation with pyra- mid vision transformers. CAAI Artificial Intelligence Research2, 9150015 (2023)

2023
[28]

Gut68(1), 94–100 (2019) https://doi.org/10.1136/gutjnl-2017-314547 https://gut.bmj.com/content/68/1/94.full.pdf

Byrne, M.F., Chapados, N., Soudan, F., Oertel, C., Linares P´ erez, M., Kelly, R., Iqbal, N., Chandelier, F., Rex, D.K.: Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut68(1), 94–100 (2019) https://doi.org/10.1136/gutjnl-2017-...

work page doi:10.1136/gutjnl-2017-314547 2019
[29]

Gastroenterology158(8), 2169– 21798 (2020)https://doi.org/10.1053/j.gastro

Jin, E.H., Lee, D., Bae, J.H., Kang, H.Y., Kwak, M.- S., Seo, J.Y., Yang, J.I., Yang, S.Y., Lim, S.H., Yim, J.Y., Lim, J.H., Chung, G.E., Chung, S.J., Choi, J.M., Han, Y.M., Kang, S.J., Lee, J., Chan Kim, H., Kim, J.S.: Improved accuracy in optical diagnosis of colorec- tal polyps using convolutional neural networks with visual explanations. Gastroenterol...

work page doi:10.1053/j.gastro 2020
[30]

Medical Image Analysis82, 102587 (2022)

Turan, M., Durmus, F.:UC-NfNet: Deep learning-enabled assessment of ulcerative colitis from colonoscopy images. Medical Image Analysis82, 102587 (2022)

2022
[31]

NPJ Digit Med6(1), 64 (2023)

Dong, Z., Wang, J., Li, Y., Deng, Y., Zhou, W., Zeng, X., Gong, D., Liu, J., Pan, J., Shang, R., Xu, Y., Xu, M., Zhang, L., Zhang, M., Tao, X., Zhu, Y., Du, H., Lu, Z., Yao, L., Wu, L., Yu, H.: Explainable artifi- cial intelligence incorporated with domain knowledge diagnosing early gastric neoplasms under white light endoscopy. NPJ Digit Med6(1), 64 (2023)

2023
[32]

Endoscopy 57(11), 1254–1260 (2025)

Almeida, E., Martins, M.L., Marques, D., Delas, R., Almeida, T., Chaves, J., Libˆ anio, D., Renna, F., Coim- bra, M.T., Dinis-Ribeiro, M.: Artificial intelligence for endoscopic grading of gastric intestinal metaplasia: ad- vancing risk stratification for gastric cancer. Endoscopy 57(11), 1254–1260 (2025)

2025
[33]

NPJ Digit

Lin, Y., Huang, C., Tian, H., Yang, B., Deng, T., Pan, Y., Wang, H., Li, X.: Improving generalization of polyp detection via conditional StyleGAN augmented training. NPJ Digit. Med.9(1), 113 (2026)

2026
[34]

IEEE Jour- nal of Biomedical and Health Informatics29(6), 3864– 3873 (2025)https://doi.org/10.1109/JBHI.2024

Golhar, M.V., Bobrow, T.L., Ngamruengphong, S., Durr, N.J.: Gan inversion for data augmentation to improve colonoscopy lesion classification. IEEE Jour- nal of Biomedical and Health Informatics29(6), 3864– 3873 (2025)https://doi.org/10.1109/JBHI.2024. 3397611

work page doi:10.1109/jbhi.2024 2025
[35]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R

Wang, Z., Liu, C., Zhang, S., Dou, Q.: Foundation model for endoscopy video analysis via large-scale self- supervised pre-train. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R. (eds.) Medical Image Com- puting and Computer Assisted Intervention – MICCAI 2023, pp. 101–111. Springer, Cham (2023)

2023
[36]

The Laryngoscope134(11), 4535–4541 (2024)https://doi.org/10.1002/lary

Paderno, A., Rau, A., Bedi, N., Bossi, P., Mercante, G., Piazza, C., Holsinger, F.C.: Computer vision foundation models in endoscopy: Proof of concept in oropharyngeal cancer. The Laryngoscope134(11), 4535–4541 (2024)https://doi.org/10.1002/lary. 31534

work page doi:10.1002/lary 2024
[37]

Gas- troenterology170(1), 174–187 (2025)https://doi

Jong, M.R., Boers, T.G.W., Fockens, K.N., Jukema, J.B., Kusters, C.H.J., Jaspers, T.J.M., Heslinga, R.A.H., Slooter, F.C., Struyvenberg, M.R., Bisschops, R., Putten, J.A., With, P.H.N., Sommen, F., Groof, A.J., Bergman, J.J.G.H.M., BONS-AI Consortium: Gastronet-5m: A multicenter dataset for developing foundation models in gastrointestinal endoscopy. Gas- ...

work page doi:10.1053/j.gastro.2025.07.030 2025
[38]

Dis Colon Rec- tum66(3), 383–391 (2023)

Thompson, H., Kim, J.K., Jimenez-Rodriguez, R.M., Garcia-Aguilar, J., Veeraraghavan, H.: Deep learning- based model for identifying tumors in endoscopic im- ages from patients with locally advanced rectal cancer treated with total neoadjuvant therapy. Dis Colon Rec- tum66(3), 383–391 (2023)

2023
[39]

Surgical Endoscopy36, 1–9 (2021) https://doi.org/10.1007/s00464-021-08685-7

Haak, H., Gao, X., Maas, M., Waktola, S., Benson, S., Beets-Tan, R., Beets, G., Leerdam, M., Melen- horst, J.: The use of deep learning on endoscopic images to assess the response of rectal cancer after chemoradiation. Surgical Endoscopy36, 1–9 (2021) https://doi.org/10.1007/s00464-021-08685-7

work page doi:10.1007/s00464-021-08685-7 2021
[40]

Annals of Surgical Oncol- ogy31(10), 6443–6451 (2024)https://doi.org/10

Williams, H., Thompson, H.M., Lee, C., Rangnekar, A., Gomez, J.T., Widmar, M., Wei, I.H., Pappou, E.P., Nash, G.M., Weiser, M.R., Paty, P.B., Smith, J.J., Veeraraghavan, H., Garcia-Aguilar, J.: Assess- ing endoscopic response in locally advanced rectal can- cer treated with total neoadjuvant therapy: Devel- opment and validation of a highly accurate convo...

2024
[41]

Remote Sensing15(9) (2023)https://doi

Zhou, Y., Huo, C., Zhu, J., Huo, L., Pan, C.: Dcat: Dual cross-attention-based transformer for change de- tection. Remote Sensing15(9) (2023)https://doi. org/10.3390/rs15092395

work page doi:10.3390/rs15092395 2023
[42]

IEEE Journal of Selected Top- ics in Applied Earth Observations and Remote Sens- ing17, 4917–4935 (2024)https://doi.org/10.1109/ JSTARS.2024.3354310

Lu, W., Wei, L., Nguyen, M.: Bitemporal attention transformer for building change detection and build- ing damage assessment. IEEE Journal of Selected Top- ics in Applied Earth Observations and Remote Sens- ing17, 4917–4935 (2024)https://doi.org/10.1109/ JSTARS.2024.3354310

work page arXiv 2024
[43]

In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI 2023, pp

Zhu, Q., Mathai, T.S., Mukherjee, P., Peng, Y., Sum- mers, R.M., Lu, Z.: Utilizing longitudinal chest x-rays and reports to pre-fill radiology reports. In: Interna- tional Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI 2023, pp. 189–198 (2023). Springer

2023
[44]

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: Visual explana- tions from deep networks via gradient-based localiza- tion. Int. J. Comput. Vis.128(2), 336–359 (2020)

2020
[45]

Sande, M.E., Maas, M., Melenhorst, J., Breukink, S.O., Leerdam, M.E., Beets, G.L.: Predictive value of endoscopic features for a complete response af- ter chemoradiotherapy for rectal cancer. Ann. Surg. 274(6), 541–547 (2021)

2021
[46]

In: Colliot, O., Mitra, J

Gomez, J.T., Rangnekar, A., Williams, H., Thomp- son, H.M., Garcia-Aguilar, J., Smith, J.J., Veeraragha- van, H.: Swin transformers are robust to distribution and concept drift in endoscopy-based longitudinal rec- tal cancer assessment. In: Colliot, O., Mitra, J. (eds.) Medical Imaging 2025: Image Processing, vol. 13406, p. 134061. SPIE, ??? (2025).https:...

work page doi:10.1117/12.3046794 2025
[47]

European Journal of Surgical Oncology44(8), 1247–1253 (2018)https: //doi.org/10.1016/j.ejso.2018.04.013

Chino, A., Konishi, T., Ogura, A., Kawachi, H., Os- umi, H., Yoshio, T., Kishihara, T., Ide, D., Saito, S., Igarashi, M., Akiyoshi, T., Ueno, M., Fujisaki, J.: Endoscopic criteria to evaluate tumor response of rectal cancer to neoadjuvant chemoradiotherapy us- ing magnifying chromoendoscopy. European Journal of Surgical Oncology44(8), 1247–1253 (2018)http...

work page doi:10.1016/j.ejso.2018.04.013 2018
[48]

Colorectal Disease25(2), 211–221 (2023)https://doi.org/10.1111/codi.16341 https://onlinelibrary.wiley.com/doi/pdf/10.1111/codi.16341

Stijns, R.C.H., Leijtens, J., Graaf, E., Bach, S.P., Beets, G., Bremers, A.J.A., Beets-Tan, R.G.H., Wilt, J.H.W.: Endoscopy and mri for restaging early rectal cancer after neoadjuvant treatment. Colorectal Disease25(2), 211–221 (2023)https://doi.org/10.1111/codi.16341 https://onlinelibrary.wiley.com/doi/pdf/10.1111/codi.16341

work page doi:10.1111/codi.16341 2023
[49]

Annals of Surgery254(1), 97–102 (2011)https://doi.org/ 10.1097/SLA.0b013e3182196e1f

Garcia-Aguilar, J., Shi, Q., Thomas, C.R., Chan, E., Cataldo, P., Marcet, J., Haller, D., Bergsland, E., al.: Optimal timing of surgery after chemoradiation for ad- vanced rectal cancer: Preliminary results of a multicen- ter, nonrandomized phase ii prospective trial. Annals of Surgery254(1), 97–102 (2011)https://doi.org/ 10.1097/SLA.0b013e3182196e1f

work page doi:10.1097/sla.0b013e3182196e1f 2011
[50]

ICCV, 2021.https://arxiv.or g/abs/2103.14030 35 Supplementary Material S1

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchi- cal Vision Transformer using Shifted Windows (2021). https://arxiv.org/abs/2103.14030

work page arXiv 2021
[51]

ImageNet:

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical im- age database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[52]

Advances in neural information processing systems32(2019)

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing systems32(2019)

2019
[53]

In: International Conference on Learning Representations (ICLR) (2015)

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)

2015
[54]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

Ramesh, S., Srivastav, V., Alapatt, D., Yu, T., Mu- rali, A., Sestini, L., Nwoye, C.I., Hamoud, I., Sharma, S., Fleurentin, A., Exarchakis, G., Karargyris, A., Padoy, N.: Dissecting self-supervised learning meth- ods for surgical computer vision. Medical Image Anal- ysis88, 102844 (2023)https://doi.org/10.1016/j. media.2023.102844

work page doi:10.1016/j 2023
[55]

Psychometrika12(2), 153–157 (1947)https://doi

McNemar, Q.: Note on the sampling error of the dif- ference between correlated proportions or percentages. Psychometrika12(2), 153–157 (1947)https://doi. org/10.1007/BF02295996 14

work page doi:10.1007/bf02295996 1947