arxiv: 2605.09622 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang , Zihan Li , Han Liu , Simon Arberet , Martin Kraus , Yuyin Zhou , Florin-Cristian Ghesu , Dorin Comaniciu

show 2 more authors

Ali Kamen Riqiang Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D diffusion modelsdose predictionradiotherapy planningknowledge transferreinforcement learningmedical imagingAny2Any conditioningclinical scorecard

0 comments

The pith

Transferring priors from video diffusion models via modality-aware conditioning and reinforcement learning improves 3D dose prediction accuracy for radiotherapy planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Radiotherapy dose prediction must balance tumor coverage against damage to nearby healthy organs, yet models trained from scratch on limited clinical data often fail to generalize. The paper proposes DiffKT3D, a single 3D diffusion framework that imports statistical knowledge from large pretrained video diffusion models and adapts it to medical volumes. Flexible conditioning on CT images, anatomical structures, body outlines, and beam settings occurs through modality-specific embeddings that avoid expensive cross-attention. A final reinforcement-learning stage then tunes the generator to match an institution-specific clinical Scorecard. If the transfer works, the method would deliver lower voxel errors and higher clinical preference scores than bespoke or challenge-winning baselines across varied treatment sites.

Core claim

DiffKT3D is a unified Any2Any 3D diffusion model that transfers generative priors from video diffusion models through modality-specific embeddings for conditioning on CT, structures, body, and beam data, followed by reinforcement learning post-training aligned to a clinical Scorecard; this combination reduces voxel-level mean absolute error from 2.07 to 1.93 while producing dose maps that better match institutional preferences and visual quality standards.

What carries the argument

DiffKT3D framework, which performs knowledge transfer from pretrained video diffusion models using modality-aware embeddings for flexible Any2Any conditioning and applies reinforcement learning guided by a clinically-informed Scorecard to refine outputs.

If this is right

Voxel-wise dose prediction error drops from 2.07 to 1.93 mean absolute error on the evaluated benchmark.
Generated dose maps show higher image quality and closer alignment with institutional treatment preferences.
The same model handles conditioning on any combination of CT, anatomical structures, body outlines, and beam settings without cross-attention overhead.
Clinically aligned RL post-training produces outputs that generalize across diverse radiotherapy scenarios rather than overfitting to one site.
The approach supplies a single trainable pipeline that replaces multiple task-specific models for different clinical modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals could retrain only the RL stage on their own Scorecard to adapt the model without repeating the full diffusion pretraining.
The same transfer pattern might extend to other 3D medical generation tasks such as synthetic CT or organ segmentation where large natural-data priors exist.
Integration into commercial treatment planning software could shorten planning time by providing higher-quality starting dose maps for physician review.
If the domain gap proves larger than expected, targeted medical pretraining on CT volumes before the RL stage could be tested as a lightweight fix.

Load-bearing premise

The assumption that statistical patterns learned from natural video scenes can transfer to three-dimensional medical dose distributions without large domain gaps that would require heavy retraining.

What would settle it

Running the model on a new multi-center test set where the voxel MAE remains above 2.0 or where blinded clinical reviewers consistently prefer the GDP-HMM winner over DiffKT3D outputs.

Figures

Figures reproduced from arXiv: 2605.09622 by Ali Kamen, Dorin Comaniciu, Florin-Cristian Ghesu, Han Liu, Martin Kraus, Riqiang Gao, Simon Arberet, Yuhan Wang, Yuyin Zhou, Zihan Li.

**Figure 2.** Figure 2: Training mechanism for DiffKT3D. The multi-modal data first pass through the VAE encoder to obtain latent features. With the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: MAE vs. training epochs / inference steps. The single [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-structure scorecard value comparison of head-and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative predictions on GDP-HMM and REQUITE [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture of the proposed VAE–DiT-based conditional diffusion model DiffKT3D. Left: multi-branch VAE–DiT pipeline for CT [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on representative head-and-neck, lung, and prostate cases. For each case we show CT with delineated [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a modest MAE win on radiotherapy dose prediction via Any2Any conditioning and RL post-training, but provides no ablation to confirm the video diffusion priors are responsible for the gain.

read the letter

The main point is that DiffKT3D reduces voxel MAE from 2.07 to 1.93 on the GDP-HMM challenge data by adapting a pretrained video diffusion model with a new Any2Any conditioning scheme and RL fine-tuning against a clinical scorecard. That combination also improves image quality and preference match over the prior winner. The work is practical for handling varied clinical inputs like CT, structures, and beam settings in one framework without heavy cross-attention costs. The RL step is a reasonable way to align outputs with institutional preferences after the base training. These elements together give a reusable path for dose prediction that adapts across scenarios. The concrete number on a public challenge dataset is the clearest strength. The soft spot is the missing control for the knowledge transfer claim. The abstract and methods do not compare against an identical architecture trained from scratch on the radiotherapy data, so it is unclear whether the video priors actually help or whether the new conditioning and RL alone produce the small drop. The domain shift from natural video to constrained 3D dose volumes is also not analyzed. If the scorecard weights were adjusted to favor the model's outputs, the reported preference gains could be partly circular. This paper is for researchers working on conditional generative models for medical imaging and radiotherapy planning. Readers focused on multi-modal 3D synthesis or preference-aligned post-training would pick up usable ideas from the conditioning and RL design. It deserves peer review because the empirical result is specific and the framework is novel in combination, even though the central transfer claim needs tighter evidence. I would send it to referees with a request for the missing ablations and more detail on data handling.

Referee Report

3 major / 1 minor

Summary. The paper proposes DiffKT3D, a unified Any2Any 3D diffusion framework for voxel-wise dose prediction in radiotherapy planning. It transfers priors from pretrained video diffusion models via modality-specific embeddings for flexible conditioning on inputs such as CT, anatomical structures, body, and beam settings, without cross-attention overhead. A reinforcement learning post-training step is guided by a clinically-informed Scorecard tailored to institutional preferences. The central empirical claim is that this yields a new state-of-the-art, reducing voxel-level MAE from 2.07 to 1.93 versus the GDP-HMM challenge winner, while also improving image quality and preference match.

Significance. If the claims hold after addressing the gaps below, the work would demonstrate a viable path for leveraging billion-scale vision priors in constrained 3D medical tasks, potentially improving generalization across clinical scenarios where bespoke models fail. The Any2Any conditioning and RL alignment with clinical scorecards are conceptually interesting extensions. However, the current manuscript provides no evidence isolating the contribution of the video priors, no error bars or statistical tests on the MAE gain, and insufficient experimental details, so the significance cannot yet be assessed.

major comments (3)

[Abstract] Abstract: The claim that the 2.07→1.93 MAE reduction results from transferring diffusion priors via the Any2Any paradigm is unsupported. No ablation is reported that compares the full DiffKT3D against an otherwise identical 3D diffusion architecture trained from scratch (or randomly initialized) on the same radiotherapy data with the same conditioning scheme. Without this, the gain could equally be attributed to the new conditioning or the RL step alone.
[Abstract] Abstract and Methods (RL post-training): The RL post-training uses a Scorecard explicitly tailored to institutional treatment preferences. This introduces a circularity risk: if the scorecard weights or metrics were selected or tuned with knowledge of the model's outputs, the reported 'preference match' and any associated MAE improvement become partly tautological rather than an independent clinical validation.
[Abstract] Abstract: The SOTA claim is presented without error bars, cross-validation details, data-split descriptions, or baseline re-implementation specifics. Post-hoc comparison to a single challenge winner risks selection bias and prevents assessment of whether the 0.14 MAE drop is statistically meaningful or reproducible.

minor comments (1)

[Abstract] Abstract: The statements 'superior image quality and preference match' are not accompanied by the specific quantitative metrics (e.g., SSIM, PSNR, or clinical scoring protocol) used to establish superiority.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are needed to strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the 2.07→1.93 MAE reduction results from transferring diffusion priors via the Any2Any paradigm is unsupported. No ablation is reported that compares the full DiffKT3D against an otherwise identical 3D diffusion architecture trained from scratch (or randomly initialized) on the same radiotherapy data with the same conditioning scheme. Without this, the gain could equally be attributed to the new conditioning or the RL step alone.

Authors: We agree that an ablation isolating the contribution of the transferred video priors is required to substantiate the central claim. The current manuscript does not contain this comparison. In the revised version we will add an experiment training an otherwise identical 3D diffusion model from random initialization on the same radiotherapy data using the identical Any2Any conditioning scheme and RL post-training, and report the resulting MAE to quantify the benefit attributable to the video priors. revision: yes
Referee: [Abstract] Abstract and Methods (RL post-training): The RL post-training uses a Scorecard explicitly tailored to institutional treatment preferences. This introduces a circularity risk: if the scorecard weights or metrics were selected or tuned with knowledge of the model's outputs, the reported 'preference match' and any associated MAE improvement become partly tautological rather than an independent clinical validation.

Authors: The Scorecard is derived from established institutional clinical guidelines and standard radiotherapy metrics that were defined prior to model development. To mitigate the circularity concern we will expand the Methods section with a precise description of the metrics, fixed weights, and the independent process used to construct the Scorecard. This will demonstrate that the RL objective targets pre-specified clinical criteria rather than being tuned to the reported model outputs. revision: partial
Referee: [Abstract] Abstract: The SOTA claim is presented without error bars, cross-validation details, data-split descriptions, or baseline re-implementation specifics. Post-hoc comparison to a single challenge winner risks selection bias and prevents assessment of whether the 0.14 MAE drop is statistically meaningful or reproducible.

Authors: We acknowledge that the SOTA claim requires stronger statistical support and transparency. In the revision we will report error bars from repeated runs or cross-validation, provide full details of the data splits, include statistical tests (e.g., paired t-test) for the MAE difference, and clarify the exact procedure used for the GDP-HMM baseline comparison, including whether reported values or re-implementations were employed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical comparisons without self-referential reduction

full rationale

The paper describes DiffKT3D as combining pretrained video diffusion priors, Any2Any modality-specific embeddings, and RL post-training guided by an institutional Scorecard. Performance is reported via direct comparison to the GDP-HMM challenge winner (MAE 2.07 to 1.93) plus qualitative image quality and preference match. No equations, self-citations, or ansatzes are present in the provided text that would make any claimed prediction equivalent to its inputs by construction. The RL Scorecard is described as clinically-informed and tailored to preferences rather than fitted to the model's outputs; without a quoted reduction showing the evaluation metric is identical to the training reward in a tautological way, the strict criteria for circularity are not met. The derivation chain is therefore self-contained as an empirical engineering contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; main unverified assumptions center on effective knowledge transfer from video models and clinical validity of the RL scorecard. No explicit free parameters or invented entities are named.

free parameters (1)

Scorecard weights or metrics
The clinically-informed Scorecard is tailored to institutional preferences and likely involves parameters chosen or optimized to guide RL.

axioms (1)

domain assumption Pretrained video diffusion models contain priors that transfer usefully to 3D medical dose prediction tasks.
The entire knowledge-transfer framework rests on this untested assumption about domain shift between natural video and radiotherapy data.

pith-pipeline@v0.9.0 · 5554 in / 1506 out tokens · 55709 ms · 2026-05-12T04:47:31.074898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 9 internal anchors

[1]

https://www.aapm.org/GrandChallenge/ GDP-HMM/, 2025

Generalizable dose prediction for heterogeneous multi-cohort and multi-site radiotherapy planning (gdp-hmm) grand chal- lenge. https://www.aapm.org/GrandChallenge/ GDP-HMM/, 2025. Accessed: 2025-10-24. 1, 6

work page 2025
[2]

Kazerouni, I

Bobby Azad, Reza Azad, Sania Eskandari, Afshin Bozorg- pour, Amirhossein Kazerouni, Islem Rekik, and Dorit Merhof. Foundational models in medical imaging: A comprehensive survey and future vision.arXiv preprint arXiv:2310.18689,

work page arXiv
[3]

McNiven, Adam Diamant, and Timothy C

Aaron Babier, Rafid Mahmood, Andrea L. McNiven, Adam Diamant, and Timothy C. Y . Chan. Knowledge-based auto- mated planning with three-dimensional generative adversarial networks.Medical Physics, 47(2):297–306, 2020. 2

work page 2020
[4]

Moore, Thomas G

Aaron Babier, Binghao Zhang, Rafid Mahmood, Kevin L. Moore, Thomas G. Purdie, Andrea L. McNiven, and Timothy C. Y . Chan. Openkbp: The open-access knowledge-based planning grand challenge.arXiv preprint arXiv:2011.14076,

work page arXiv 2011
[5]

Openkbp-opt: An international and open- source framework for plan optimization in knowledge-based planning.arXiv preprint arXiv:2202.08303, 2022

Aaron Babier et al. Openkbp-opt: An international and open- source framework for plan optimization in knowledge-based planning.arXiv preprint arXiv:2202.08303, 2022. 1

work page arXiv 2022
[6]

One transformer fits all distributions in multi- modal diffusion at scale.arXiv preprint arXiv:2303.06555,

Fan Bao et al. One transformer fits all distributions in multi- modal diffusion at scale.arXiv preprint arXiv:2303.06555,

work page arXiv
[7]

Multidiffusion: Fusing diffusion paths for controlled image generation,

Omer Bar-Tal et al. Multidiffusion: Fusing diffusion paths for controlled image generation.arXiv preprint arXiv:2302.08113, 2023. 2

work page arXiv 2023
[8]

Three-dimensional dose predic- tion for lung imrt patients with deep neural networks: robust learning from heterogeneous beam configurations.Medical Physics, 2019

Ana Mar ´ıa Barrag´an-Montero, Dan Nguyen, Weiguo Lu, Mu Han Lin, Roya Norouzi-Kandalan, Xavier Geets, Edmond Sterpin, and Steve Jiang. Three-dimensional dose predic- tion for lung imrt patients with deep neural networks: robust learning from heterogeneous beam configurations.Medical Physics, 2019. 2

work page 2019
[9]

Bentzen, Louis S

Søren M. Bentzen, Louis S. Constine, Joseph O. Deasy, Avra- ham Eisbruch, Andrew Jackson, Lawrence B. Marks, Ran- dall K. Ten Haken, and Ellen D. Yorke. Quantitative analyses of normal tissue effects in the clinic (QUANTEC): an in- troduction to the scientific issues.International Journal of Radiation Oncology Biology Physics, 2010. 2

work page 2010
[10]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Diakogiannis, Franc ¸ois Waldner, Peter Caccetta, and Chen Wu

Foivos I. Diakogiannis, Franc ¸ois Waldner, Peter Caccetta, and Chen Wu. Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data.ISPRS Journal of Photogrammetry and Remote Sensing, 162:94–114, 2020. 2

work page 2020
[12]

Ezzell, Jay W

Gary A. Ezzell, Jay W. Burmeister, Nesrin Dogan, Thomas J. LoSasso, John G. Mechalakos, Dimitris Mihailidis, Aimee Molineu, Jatinder R. Palta, Chester R. Ramsey, Brian J. Salter, Jianguo Shi, Ping Xia, Cedric X. Yu, and Ying Xiao. IMRT commissioning: multiple institution planning and dosimetry comparisons, a report from AAPM task group 119.Medical Physics...

work page 2009
[13]

Diffdp: Radiotherapy dose prediction via a diffusion model.arXiv preprint arXiv:2307.09794, 2023

Zhenghao Feng et al. Diffdp: Radiotherapy dose prediction via a diffusion model.arXiv preprint arXiv:2307.09794, 2023. 1, 2, 19

work page arXiv 2023
[14]

Flexible-cm gan: Towards precise 3d dose predic- tion in radiotherapy

Riqiang Gao, Bin Lou, Zhoubing Xu, Dorin Comaniciu, and Ali Kamen. Flexible-cm gan: Towards precise 3d dose predic- tion in radiotherapy. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 715–725, 2023. 1, 2, 6, 8

work page 2023
[15]

Multi-agent reinforcement learning meets leaf sequencing in radiotherapy.arXiv preprint arXiv:2406.01853,

Riqiang Gao, Florin-Cristian Ghesu, Simon Arberet, Shahab Basiri, Esa Kuusela, Martin Kraus, Dorin Comaniciu, and Ali Kamen. Multi-agent reinforcement learning meets leaf sequencing in radiotherapy.arXiv preprint arXiv:2406.01853,

work page arXiv
[16]

Automating rt planning at scale: High quality data for ai training.arXiv preprint arXiv:2501.11803, 2025

Riqiang Gao, Mamadou Diallo, Han Liu, Anthony Magliari, Jonathan Sackett, Wilko Verbakel, Sandra Meyers, Rafe Mc- beth, Masoud Zarepisheh, Simon Arberet, et al. Automating rt planning at scale: High quality data for ai training.arXiv preprint arXiv:2501.11803, 2025. 1, 2, 6, 8, 15, 17

work page arXiv 2025
[17]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. InAdvances in Neural Information Processing Systems, pages 2672–2680,

work page
[18]

Deep learning–based dose prediction for automated, individualized quality assurance of head and neck radiation therapy plans

Mary P Gronberg, Beth M Beadle, Adam S Garden, Heath Skinner, Skylar Gay, Tucker Netherton, Wenhua Cao, Car- los E Cardenas, Christine Chung, David T Fuentes, et al. Deep learning–based dose prediction for automated, individualized quality assurance of head and neck radiation therapy plans. Practical radiation oncology, 13(3):e282–e291, 2023. 1

work page 2023
[19]

Text2ct: Towards 3d ct volume generation from free-text descriptions using diffusion model.arXiv preprint arXiv:2505.04522,

Pengfei Guo, Can Zhao, Dong Yang, Yufan He, Vishwesh Nath, Ziyue Xu, Pedro RAS Bassi, Zongwei Zhou, Ben- jamin D Simon, Stephanie Anne Harmon, et al. Text2ct: Towards 3d ct volume generation from free-text descriptions using diffusion model.arXiv preprint arXiv:2505.04522,

work page arXiv
[20]

Maisi: Medical ai for synthetic imaging

Pengfei Guo, Can Zhao, Dong Yang, Ziyue Xu, Vish- wesh Nath, Yucheng Tang, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, et al. Maisi: Medical ai for synthetic imaging. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4430–4441. IEEE, 2025. 15, 21

work page 2025
[21]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InAdvances in Neural Information Processing Systems, pages 6840–6851, 2020. 2, 20 9

work page 2020
[22]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

work page
[23]

Adapting visual-language mod- els for generalizable anomaly detection in medical images

Chengwei Huang, Yicheng Zhang, Chen Chen, Meng Wang, Bing Li, and Xiangming He. Adapting visual-language mod- els for generalizable anomaly detection in medical images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3312–3322,

work page
[24]

Jaeger, Simon A

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Pe- tersen, and Klaus H. Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmenta- tion.Nature Methods, 18(2):203–211, 2021. 15

work page 2021
[25]

Domain knowledge driven 3d dose prediction using moment-based loss function

Gourav Jhanwar, Navdeep Dahiya, Parmida Ghahremani, Ma- soud Zarepisheh, and Saad Nadeem. Domain knowledge driven 3d dose prediction using moment-based loss function. Physics in Medicine & Biology, 67(18):185017, 2022. 2, 8

work page 2022
[26]

Repurpos- ing diffusion-based image generators for monocular depth estimation.arXiv preprint arXiv:2312.02145, 2023

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation.arXiv preprint arXiv:2312.02145, 2023. 2

work page arXiv 2023
[27]

Dosenet: a volumetric dose prediction algorithm using 3d fully-convolutional neural net- works.Physics in Medicine & Biology, 63(23):235022, 2018

Vasant Kearney, Jason W Chan, Samuel Haaf, Martina De- scovich, and Timothy D Solberg. Dosenet: a volumetric dose prediction algorithm using 3d fully-convolutional neural net- works.Physics in Medicine & Biology, 63(23):235022, 2018. 2

work page 2018
[28]

Chan, Tianqi Wang, Alan Perry, Martina Descovich, Olivier Morin, Sue S

Vasant Kearney, Jason W. Chan, Tianqi Wang, Alan Perry, Martina Descovich, Olivier Morin, Sue S. Yom, and Timo- thy D. Solberg. DoseGAN: a generative adversarial network for synthetic dose prediction using attention-gated discrimi- nation and generation.Scientific Reports, 2020. 1, 2

work page 2020
[29]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding varia- tional bayes.arXiv preprint arXiv:1312.6114, 2014. 14

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, 2023. 2, 3

work page 2023
[31]

A review of dose prediction methods for tumor radiation therapy

Xiaoyan Kui, Fang Liu, Min Yang, Hao Wang, Canwei Liu, Dan Huang, Qinsong Li, Liming Chen, and Beiji Zou. A review of dose prediction methods for tumor radiation therapy. Meta-Radiology, 2(1):100057, 2024. 1

work page 2024
[32]

Om- niflow: Any-to-any generation with multi-modal rectified flows.arXiv preprint arXiv:2412.01169, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. Om- niflow: Any-to-any generation with multi-modal rectified flows.arXiv preprint arXiv:2412.01169, 2024. 2

work page arXiv 2024
[33]

H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes.IEEE Transactions on Medical Imaging, 2018

Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng-Ann Heng. H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes.IEEE Transactions on Medical Imaging, 2018. 2

work page 2018
[34]

PMC-CLIP: Con- trastive language-image pre-training using biomedical docu- ments.arXiv preprint arXiv:2303.07240, 2023

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. PMC-CLIP: Con- trastive language-image pre-training using biomedical docu- ments.arXiv preprint arXiv:2303.07240, 2023. 1

work page arXiv 2023
[35]

Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open-language instructions.arXiv preprint arXiv:2409.15278, 2024. 2

work page arXiv 2024
[36]

Yaron Lipman, Ricky T. Q. Chen, and Heli Ben-Hamu. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 16

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Technical note: A cascade 3d u-net for dose prediction in radiotherapy.Medical Physics, 48(11):7132–7141, 2021

Shuolin Liu, Jingjing Zhang, Teng Li, Hui Yan, and Jianfei Liu. Technical note: A cascade 3d u-net for dose prediction in radiotherapy.Medical Physics, 48(11):7132–7141, 2021. 2

work page 2021
[38]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, 2022. 16

work page 2022
[39]

Anthony Magliari, Ryan Clark, Lesley Rosa, and Sushil Beri- wal. Hn-sib-bpi: A single click, sub-site specific, dosimetric scorecard tuned rapidplan model created from a foundation model for treating head and neck with bilateral neck.Medical Dosimetry, 50(1):63–69, 2025. 3

work page 2025
[40]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou et al. T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. 2

work page arXiv 2023
[41]

3d radiotherapy dose predic- tion on head and neck cancer patients with a hierarchically densely connected u-net deep learning architecture.Physics in Medicine & Biology, 2019

Dan Nguyen, Xun Jia, David Sher, Mu-Han Lin, Zohaib Iqbal, Hui Liu, and Steve Jiang. 3d radiotherapy dose predic- tion on head and neck cancer patients with a hierarchically densely connected u-net deep learning architecture.Physics in Medicine & Biology, 2019. 2

work page 2019
[42]

Dan Nguyen, Rafe McBeth, Azar Sadeghnejad Barkousaraie, Gyanendra Bohara, Chenyang Shen, Xun Jia, and Steve Jiang. Incorporating human and learned domain knowledge into training deep neural networks: A differentiable dose-volume histogram and adversarial inspired framework for generating pareto optimal dose distributions in radiation therapy.Medical Physi...

work page 2020
[43]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Piotr Bojanowski, Gautier Izacard, et al. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Scalable diffusion mod- els with transformers

William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 14, 15, 21

work page 2023
[46]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 14

work page 2018
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1 10

work page 2021
[48]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 15

work page 2022
[50]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer Assisted Inter- vention (MICCAI). Springer, 2015. 2

work page 2015
[51]

Jaeger, and Klaus H

Saikat Roy, Gregor Koehler, Constantin Ulrich, Michael Baumgartner, Jens Petersen, Fabian Isensee, Paul F. Jaeger, and Klaus H. Maier-Hein. Mednext: Transformer-driven scal- ing of convnets for medical image segmentation. InMedical Image Computing and Computer Assisted Intervention (MIC- CAI), 2023. 2, 15

work page 2023
[52]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Confer- ence on Learning Representations (ICLR), 2022. 20

work page 2022
[53]

Aguado-Barrera, David Azria, Celine Bourgier, Muriel Brengues, Erik Briers, Renee Bultijnck, Patricia Calvo-Crespo, Ana Carballo, et al

Petra Seibold, Adam Webb, Miguel E. Aguado-Barrera, David Azria, Celine Bourgier, Muriel Brengues, Erik Briers, Renee Bultijnck, Patricia Calvo-Crespo, Ana Carballo, et al. Requite: a prospective multicentre cohort study of patients undergoing radiotherapy for breast, lung or prostate cancer. Radiotherapy and Oncology, 138:212–224, 2019. 2, 6, 15

work page 2019
[54]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Di- nov3.arXiv preprint arXiv:2508.10104, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Dino-reg: Gen- eral purpose image encoder for training-free multi-modal deformable medical image registration

Xinrui Song, Xuanang Xu, and Pingkun Yan. Dino-reg: Gen- eral purpose image encoder for training-free multi-modal deformable medical image registration. InInternational Con- ference on Medical Image Computing and Computer-Assisted Intervention, pages 608–617. Springer, 2024. 1

work page 2024
[56]

Mumtaz Hussain Soomro, Victor Gabriel, Leandro Alves, Hamidreza Nourzadeh, and Jeffrey V . Siebers. Deepdosenet: A deep learning model for 3d dose prediction in radiation therapy.arXiv preprint arXiv:2111.00077, 2021. 2

work page arXiv 2021
[57]

Any-to-any generation via composable diffusion.arXiv preprint arXiv:2305.11846, 2023

Zineng Tang et al. Any-to-any generation via composable diffusion.arXiv preprint arXiv:2305.11846, 2023. 2

work page arXiv 2023
[58]

Bilateral head&neck 70/63/56gy (hn-sib-bpi) [rapidplan]

Varian Medical Affairs. Bilateral head&neck 70/63/56gy (hn-sib-bpi) [rapidplan]. https://medicalaffairs. varian . com / hn - sib - bpi - rapidplan - vmat2,

work page
[59]

Accessed: 2024-10-19. 3, 5, 6

work page 2024
[60]

Lung – conventional 60gy (nrg lu- 004 / atkins km 2021)

Varian Medical Affairs. Lung – conventional 60gy (nrg lu- 004 / atkins km 2021). https://medicalaffairs. varian . com / lung - conventional - vmat2, 2024. Accessed: 2024-10-19. 3, 5, 6

work page 2021
[61]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017. 14

work page 2017
[62]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, 2024. 2

work page 2024
[63]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gener- ative models.arXiv preprint arXiv:2503.20314, 2025. 2, 14, 15, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Deep learning-based head and neck radiotherapy planning dose prediction via beam-wise dose decomposition

Bin Wang, Lin Teng, Lanzhuju Mei, Zhiming Cui, Xuanang Xu, Qianjin Feng, and Dinggang Shen. Deep learning-based head and neck radiotherapy planning dose prediction via beam-wise dose decomposition. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 575–584. Springer, 2022. 2

work page 2022
[65]

Fluence map prediction using deep learning models–direct plan generation for pancreas stereotactic body radiation therapy.Frontiers in artificial intelligence, 3:68, 2020

Wentao Wang, Yang Sheng, Chunhao Wang, Jiahan Zhang, Xinyi Li, Manisha Palta, Brian Czito, Christopher G Willett, Qiuwen Wu, Yaorong Ge, et al. Fluence map prediction using deep learning models–direct plan generation for pancreas stereotactic body radiation therapy.Frontiers in artificial intelligence, 3:68, 2020. 1

work page 2020
[66]

Medical SAM adapter: Adapting seg- ment anything model for medical image segmentation.arXiv preprint arXiv:2304.12620, 2023

Junde Wu, Wei Ji, Yuanpei Liu, Huazhu Fu, Min Xu, Yanwu Xu, and Yueming Jin. Medical SAM adapter: Adapting seg- ment anything model for medical image segmentation.arXiv preprint arXiv:2304.12620, 2023. 1

work page arXiv 2023
[67]

Human preference score: Better aligning text- to-image models with human preference.arXiv preprint arXiv:2303.14420, 2023

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text- to-image models with human preference.arXiv preprint arXiv:2303.14420, 2023. 3

work page arXiv 2023
[68]

Omnigen: Unified image generation

Shitao Xiao, Yueze Wang, Junjie Zhou, Zhenbang Yang, Chao Shen, Wenrui Dai, Jiarui Gan, Yu Liu, Ke Shang, Zhifeng Chen, and Qingshan Liu. Omnigen: Unified image generation. arXiv preprint arXiv:2409.11340, 2024. 2

work page arXiv 2024
[69]

Learning and evaluating human preferences for text-to-image generation

Jun Xu, Siyao Ren, Zeqiang Lin, Jiaming Zhu, Zhi Zhang, Yixiao Jiang, Wenwang Ye, Jianzhuang Wang, Tong Lu, Ji- Rong Gu, Xiaoyang Wang, and Shuai Yang. Learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, 2023. 2, 3

work page 2023
[70]

Versatile diffusion: Text, images and variations all in one diffusion model.arXiv preprint arXiv:2211.08332, 2022

Xingqian Xu et al. Versatile diffusion: Text, images and variations all in one diffusion model.arXiv preprint arXiv:2211.08332, 2022. 2

work page arXiv 2022
[71]

Jodi: Unification of visual generation and under- standing via joint modeling.arXiv preprint arXiv:2505.19084,

Yifeng Xu, Zhenliang He, Meina Kan, Shiguang Shan, and Xilin Chen. Jodi: Unification of visual generation and under- standing via joint modeling.arXiv preprint arXiv:2505.19084,

work page arXiv
[72]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[73]

Predicting voxel-level dose distributions for esophageal radiotherapy using densely connected network with dilated convolutions.Physics in Medicine & Biology, 65 (20):205013, 2020

Jingjing Zhang, Shuolin Liu, Hui Yan, Teng Li, Ronghu Mao, and Jianfei Liu. Predicting voxel-level dose distributions for esophageal radiotherapy using densely connected network with dilated convolutions.Physics in Medicine & Biology, 65 (20):205013, 2020. 2 11

work page 2020
[74]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models.arXiv preprint arXiv:2302.05543, 2023. 2, 19

work page arXiv 2023
[75]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedi- cal foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023. 1

work page internal anchor Pith review arXiv 2023
[76]

Learning multi- dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8018–8027,

work page
[77]

Dosediff: Distance-aware diffusion model for dose prediction in radiotherapy.arXiv preprint arXiv:2306.16324, 2023

Yiwen Zhang et al. Dosediff: Distance-aware diffusion model for dose prediction in radiotherapy.arXiv preprint arXiv:2306.16324, 2023. 1, 2

work page arXiv 2023
[78]

Maisi-v2: Accelerated 3d high-resolution medical image synthesis with rectified flow and region-specific contrastive loss.arXiv preprint arXiv:2508.05772,

Can Zhao, Pengfei Guo, Dong Yang, Yucheng Tang, Yu- fan He, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, and Daguang Xu. MAISI-v2: Accelerated 3d high-resolution medical image synthesis with rectified flow and region-specific contrastive loss.arXiv preprint arXiv:2508.05772, 2025. 2

work page arXiv 2025
[79]

Diffusionnft: Online diffusion rein- forcement with forward process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion rein- forcement with forward process. InInternational Conference on Learning Representations (ICLR), 2026. 2, 5, 16 12 Supplementary Contents A. Detailed Model Structures 14 B. Training Details 15...

work page 2026