pith. sign in

arxiv: 2604.13397 · v1 · submitted 2026-04-15 · 💻 cs.CV

A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy

Pith reviewed 2026-05-10 13:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords deformable image registrationproton therapylongitudinal CTmultimodal learningcoarse-to-fine frameworkattention mechanismsadaptive radiotherapyclinical priors
0
0 comments X

The pith

The multimodal coarse-to-fine framework integrates clinical priors to produce faster and more robust deformable registration of longitudinal CT scans in proton therapy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a deep learning framework for aligning planning and repeat CT scans in proton therapy patients. Proton therapy is sensitive to anatomical changes, so accurate and fast registration supports better treatment adaptation, yet current methods either run too slowly or ignore available clinical details beyond the images themselves. The model uses dual CNN encoders and a transformer decoder in a coarse-to-fine setup, then adds anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization to incorporate contours, dose distributions, and treatment planning text. A reader would care because these additions aim to focus the registration on clinically relevant structures across many body sites and cancer types, as shown on 1,222 paired scans with gains over prior approaches.

Core claim

The proposed clinically scalable coarse-to-fine deformable registration framework integrates multimodal information from the proton radiotherapy workflow, employing dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation on a dataset of 1,222 paired planning and repeat CT scans across多个s

What carries the argument

The anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization mechanisms inside a dual-CNN and transformer coarse-to-fine architecture.

If this is right

  • The framework enables registration speeds suitable for online adaptive proton therapy workflows.
  • It yields consistent accuracy gains over existing methods on a large multi-region proton therapy dataset.
  • Registrations become more clinically meaningful by prioritizing targets and organs at risk.
  • The design accommodates varied anatomical sites and disease types when the clinical priors are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mechanisms could be tested on other radiotherapy modalities where similar planning data exist.
  • If text from treatment plans adds value here, the conditioning step might transfer to other medical report types.
  • Deployment in real-time treatment rooms would test whether the speed gains translate to reduced setup uncertainty.
  • A version without the priors could still serve as a baseline image-only model for cases where clinical data are missing.

Load-bearing premise

Clinical priors such as contours, dose distributions, and treatment planning text are reliably available and high quality across diverse clinical scenarios.

What would settle it

On the same 1,222-scan dataset, removing the multimodal components and showing no remaining accuracy or speed advantage over standard deep-learning registration methods would falsify the central contribution.

Figures

Figures reproduced from arXiv: 2604.13397 by Caiwen Jiang, Dinggang Shen, Jean-Claude M. Rwigema, Jonathan B. Ashman, Lisa A. McGee, Michele Y. Halyard, Mi Jia, Nathan Y. Yu, Sameer R. Keole, Samir H. Patel, Steven E. Schild, Sujay A. Vora, Terence T. Sio, Wei Liu, William G. Rule, William W. Wong, Yuzhen Ding.

Figure 1
Figure 1. Figure 1: Multimodal information available in the proton radiotherapy workflow for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed multimodal clinically informed coarse-to-fine [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-region results [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of registration results for a representative head [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a multimodal coarse-to-fine deformable image registration (DIR) framework for longitudinal CT scans in proton therapy. It uses dual CNN encoders and a transformer decoder, incorporating clinical priors (target/OAR contours, dose distributions, treatment planning text) via anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization. The central claim is that this yields fast, robust, and clinically meaningful registration with consistent improvements over state-of-the-art methods on an internal dataset of 1,222 paired planning/repeat CT scans spanning multiple anatomical sites and disease types.

Significance. If the performance gains are shown to be robust and generalizable, the work could meaningfully advance online adaptive proton therapy by making DIR both faster and more clinically informed than current image-only approaches. The explicit use of workflow-derived priors (contours, dose, text) addresses a practical gap in existing DL registration methods.

major comments (3)
  1. [§4] §4 (Experiments): The manuscript reports results on the 1,222-pair internal dataset but provides no details on the train/validation/test partitioning, cross-validation strategy, or statistical significance testing of the claimed improvements over baselines. This information is load-bearing for assessing whether the 'consistent improvements' reflect genuine generalization rather than dataset-specific effects.
  2. [§4.3] §4.3 (Ablations): No ablation experiments isolate the individual contributions of anatomy- and risk-guided attention, text-conditioned feature modulation, or foreground-aware optimization. Without these controls, it is impossible to determine whether the multimodal clinical priors drive the reported gains or whether simpler image-only variants would perform comparably.
  3. [§4.2] §4.2 (Results): The evaluation is confined to a single internal cohort with no external validation on data from another proton center and no public release of the dataset or code. This directly limits support for the claim of applicability to 'diverse clinical scenarios' and 'unseen anatomical regions and disease types'.
minor comments (2)
  1. [Abstract] Abstract: The summary states 'consistent improvements' and 'clinically meaningful registration' without any quantitative metrics, error bars, or named baselines, reducing the ability of readers to immediately gauge the practical magnitude of the advance.
  2. [§3] §3 (Method): The precise mechanism by which treatment planning text is tokenized and injected into the feature modulation module could be expanded with a short equation or diagram for reproducibility.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The manuscript reports results on the 1,222-pair internal dataset but provides no details on the train/validation/test partitioning, cross-validation strategy, or statistical significance testing of the claimed improvements over baselines. This information is load-bearing for assessing whether the 'consistent improvements' reflect genuine generalization rather than dataset-specific effects.

    Authors: We agree that these methodological details are necessary for a rigorous evaluation. In the revised manuscript we will add a dedicated subsection in §4.1 describing the patient-level partitioning (70/15/15 train/validation/test split with no patient overlap), the 5-fold cross-validation protocol used to ensure robustness, and the statistical analysis (paired Wilcoxon signed-rank tests with exact p-values reported for all primary metrics against each baseline). These additions will be supported by tables in the main text and supplementary material. revision: yes

  2. Referee: [§4.3] §4.3 (Ablations): No ablation experiments isolate the individual contributions of anatomy- and risk-guided attention, text-conditioned feature modulation, or foreground-aware optimization. Without these controls, it is impossible to determine whether the multimodal clinical priors drive the reported gains or whether simpler image-only variants would perform comparably.

    Authors: We acknowledge the value of isolating each component. While the original submission contains comparative experiments, the revised §4.3 will include a full set of controlled ablations in which each module (anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization) is disabled individually. Performance changes will be reported on the same metrics (Dice, TRE, and clinical relevance scores) to quantify the incremental benefit of each clinical prior. revision: yes

  3. Referee: [§4.2] §4.2 (Results): The evaluation is confined to a single internal cohort with no external validation on data from another proton center and no public release of the dataset or code. This directly limits support for the claim of applicability to 'diverse clinical scenarios' and 'unseen anatomical regions and disease types'.

    Authors: We recognize this as a genuine limitation of the current study. The dataset originates from a single proton therapy center and contains protected health information, which precludes public release under applicable privacy regulations. External validation would require multi-center data access that is not available for this work. In the revision we will add an explicit limitations paragraph in the discussion that qualifies the generalizability claims, highlights the internal diversity across anatomical sites and disease types within the 1,222 pairs, and outlines the need for future multi-institutional validation. revision: partial

standing simulated objections not resolved
  • External validation on data from another proton center
  • Public release of the dataset or code due to patient privacy regulations

Circularity Check

0 steps flagged

No circularity in empirical evaluation of multimodal registration framework

full rationale

The paper describes a coarse-to-fine multimodal DIR model incorporating contours, dose, and text via attention and modulation mechanisms, then reports empirical improvements over SOTA on an internal 1,222-pair dataset. No derivation chain, equations, or predictions are presented that reduce to fitted inputs or self-citations by construction. Performance claims rest on held-out experimental comparisons rather than self-definitional or ansatz-smuggled steps. This is the standard non-circular pattern for applied ML registration papers.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that multimodal clinical data is available and that learned attention mechanisms can effectively incorporate it for better registration. No new physical entities are postulated.

free parameters (2)
  • neural network weights
    All model parameters are fitted to the training data during optimization.
  • attention modulation parameters
    Parameters controlling text-conditioned and anatomy-guided attention are learned from data.
axioms (2)
  • domain assumption Paired CT scans with clinical annotations are available for supervised training
    Invoked in the evaluation on the 1,222-pair dataset.
  • standard math Deformation fields can be progressively refined via transformer decoder
    Core architectural assumption in the coarse-to-fine design.

pith-pipeline@v0.9.0 · 5584 in / 1426 out tokens · 38106 ms · 2026-05-10T13:42:39.285339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    IEEE Transac- tions on Medical Imaging38(8), 1788–1800 (2019)

    Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: A learning framework for deformable medical image registration. IEEE Transac- tions on Medical Imaging38(8), 1788–1800 (2019)

  2. [2]

    Medical Image Analysis82, 102615 (2022)

    Chen, J., Frey, E., He, Y., Segars, W., Li, Y., Du, Y.: Transmorph: Transformer for unsupervised medical image registration. Medical Image Analysis82, 102615 (2022)

  3. [3]

    Medical Image Analysis p

    Chen, J., Wei, S., Liu, Y., Bian, Z., He, Y., Carass, A., Bai, H., Du, Y.: Unsuper- vised learning of spatially varying regularization for diffeomorphic image registra- tion. Medical Image Analysis p. 103887 (2025)

  4. [4]

    Medical Image Analysis57, 226–236 (2019)

    Dalca, V., Balakrishnan, G., Guttag, J., Sabuncu, R.: Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Medical Image Analysis57, 226–236 (2019)

  5. [5]

    arXiv preprint arXiv:2411.02372 (2024)

    Dey, N., Billot, B., Wong, E., Wang, J., Ren, M., Grant, E., Dalca, V., Golland, P.: Learning general-purpose biomedical volume representations using randomized synthesis. arXiv preprint arXiv:2411.02372 (2024)

  6. [6]

    Ai in proton therapy treatment planning: A review,

    Ding, Y., Feng, H., Bues, M., Fatyga, M., Liu, T., Whitaker, J., Lin, H., Lee, Y., Simone, B., Patel, H., Ma, J., Frank, J., Vora, A., Ashman, A., Liu, W.: Ai in proton therapy treatment planning: A review. arXiv preprint arXiv:2510.19213 (2025)

  7. [7]

    arXiv preprint arXiv:1301.0970 (2013)

    Gu, X., Dong, B., Wang, J., Yordy, J., Mell, L., Jia, X., Jiang, S.B.: A contour- guided deformable image registration algorithm for adaptive radiotherapy. arXiv preprint arXiv:1301.0970 (2013)

  8. [8]

    Journal of Applied Clinical Medical Physics24(8), e13991 (2023)

    Hemon, C., Rigaud, B., Barateau, A., Tilquin, F., Noblet, V., Sarrut, D., Meyer, P., Bert, J., De Crevoisier, R., Simon, A.: Contour-guided deep learning based de- formable image registration for dose monitoring during CBCT-guided radiother- apy of prostate cancer. Journal of Applied Clinical Medical Physics24(8), e13991 (2023)

  9. [9]

    Medical Image Analysis78, 102379 (2022)

    Kang, M., Hu, X., Huang, W., Scott, R., Reyes, M.: Dual-stream pyramid regis- tration network. Medical Image Analysis78, 102379 (2022)

  10. [10]

    arXiv preprint arXiv:2405.00430 (2024) 10 Authors Suppressed Due to Excessive Length

    Li, X., Li, M., Lomax, A., Buhmann, J., Zhang, Y.: Continuous spatial-temporal deformable image registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods. arXiv preprint arXiv:2405.00430 (2024) 10 Authors Suppressed Due to Excessive Length

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012– 10022 (2021)

  12. [12]

    arXiv preprint arXiv:2406.00123 (2024)

    Meng, M., Feng, D., Bi, L., Kim, J.: Correlation-aware coarse-to-fine MLPs for deformable medical image registration. arXiv preprint arXiv:2406.00123 (2024)

  13. [13]

    In: Medical Image Computing and Computer Assisted Intervention

    Mok, W., Chung, S.: Large deformation diffeomorphic image registration with laplacian pyramid networks. In: Medical Image Computing and Computer Assisted Intervention. pp. 561–570 (2020)

  14. [14]

    Physics in Medicine & Biology 66(4), 045008 (2021)

    Palaniappan, P., Meyer, S., Kamp, F., Belka, C., Riboldi, M., Parodi, K., Gianoli, C.: Deformable image registration of the treatment planning CT with proton radio- graphies in perspective of adaptive proton therapy. Physics in Medicine & Biology 66(4), 045008 (2021)

  15. [15]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)

  16. [16]

    Proceedings of the International Conference on Ma- chine Learning139, 8748–8763 (2021)

    Radford, A., Kim, J., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Ma- chine Learning139, 8748–8763 (2021)

  17. [17]

    Physics and Imaging in Radiation Oncology26, 100441 (2023)

    Vestergaard, D., Muren, P., Elstrøm, V., Johansen, G., Taasti, T.: Tissue-specific range uncertainty estimation in proton therapy. Physics and Imaging in Radiation Oncology26, 100441 (2023)

  18. [18]

    arXiv preprint arXiv:2103.08213 (2021)

    Zhang, L., Zhou, L., Li, R., Wang, X., Han, B., Liao, H.: Cascaded feature warping network for unsupervised medical image registration. arXiv preprint arXiv:2103.08213 (2021)

  19. [19]

    arXiv preprint arXiv:1907.12353 (2019)

    Zhao, S., Dong, Y., Chang, C., Xu, Y.: Recursive cascaded networks for unsuper- vised medical image registration. arXiv preprint arXiv:1907.12353 (2019)