MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation
Pith reviewed 2026-05-18 20:57 UTC · model grok-4.3
The pith
MedShift uses flow matching and Schrödinger bridges to translate between synthetic and real X-ray images from a single shared model trained on unpaired data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedShift is a unified class-conditional generative model based on flow matching and Schrödinger bridges that learns a shared domain-agnostic latent space and thereby enables high-fidelity unpaired translation between any pair of X-ray domains (synthetic or real) observed during training.
What carries the argument
The implicit conditional transport realized by flow matching combined with Schrödinger bridges, which performs the mapping between domains inside one class-conditional generative model.
If this is right
- One trained model handles translation in either direction between every pair of domains seen at training time.
- The same model can be adjusted at inference to favor either perceptual quality or geometric fidelity.
- The approach achieves competitive results with a smaller parameter count than diffusion-based domain-adaptation methods.
- A new benchmark dataset of aligned synthetic and real skull X-rays at multiple radiation doses is provided for future comparisons.
Where Pith is reading between the lines
- The same transport mechanism could be tested on other medical modalities such as CT or MRI where synthetic data is also abundant.
- If the shared latent space proves stable across institutions, the method might reduce reliance on site-specific real-data collection for model training.
- The inference-time tuning knob offers a practical way to adapt outputs for different clinical priorities without retraining.
Load-bearing premise
The differences in attenuation, noise, and soft-tissue appearance between synthetic and real X-ray images can be captured and bridged by one class-conditional generative model trained only on unpaired examples.
What would settle it
A downstream segmentation or detection model trained on real clinical X-rays shows no accuracy gain when the training set is augmented with MedShift-translated synthetic images instead of raw synthetic images.
Figures
read the original abstract
Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedShift, a unified class-conditional generative model based on Flow Matching and Schrödinger Bridges for unpaired image translation across synthetic and real X-ray domains of the head. It claims to learn a shared domain-agnostic latent space enabling seamless translation between any pair of domains. The work also presents the X-DigiSkull dataset and demonstrates that the model achieves strong performance with a smaller parameter count than diffusion-based methods, while offering inference-time flexibility to balance perceptual fidelity and structural consistency.
Significance. If validated, this approach could significantly aid in leveraging synthetic medical data for real-world applications by providing an efficient, flexible domain adaptation technique without requiring paired data. The technical integration of conditional flow matching with Schrödinger bridges represents a meaningful advancement, and the open-sourcing of code and dataset is commendable for promoting reproducibility in the field.
major comments (1)
- Section 5 (experimental results): the quantitative comparisons lack error bars or results from multiple random seeds; this makes it difficult to assess whether the reported improvements over diffusion baselines are statistically significant and undermines confidence in the 'strong performance' claim.
minor comments (3)
- Abstract: the spelling 'Schrodinger' should be corrected to 'Schrödinger'.
- Section 3: provide more explicit description of how the class-conditioning is implemented in the vector field to guarantee a domain-agnostic latent space.
- Figure captions: ensure all visualizations of translated images include clear indications of source/target domains and any quantitative metrics shown.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below in a point-by-point manner.
read point-by-point responses
-
Referee: Section 5 (experimental results): the quantitative comparisons lack error bars or results from multiple random seeds; this makes it difficult to assess whether the reported improvements over diffusion baselines are statistically significant and undermines confidence in the 'strong performance' claim.
Authors: We agree that reporting results across multiple random seeds with error bars would provide a more rigorous evaluation of statistical significance and strengthen confidence in the performance claims. This is a valid and constructive observation. In the revised manuscript we will rerun all quantitative experiments in Section 5 using at least three independent random seeds, report mean values together with standard deviations, and include error bars on the relevant tables and figures. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's derivation chain is self-contained. MedShift trains a class-conditional vector field via Flow Matching to match Schrödinger Bridge marginals on unpaired multi-domain X-ray data, with class-conditioning used to encourage a shared latent space. These steps follow directly from the stated training objective and architecture without reducing to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation. Performance claims are supported by explicit comparisons to diffusion baselines on the newly introduced X-DigiSkull dataset, and the smaller model size is tabulated independently. No equation or claim collapses to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flow matching and Schrödinger bridges can learn a domain-agnostic latent space that captures shared anatomical structure across synthetic and real X-ray distributions.
Reference graph
Works this paper leans on
-
[1]
One-shot unsupervised do- main adaptation with personalized diffusion models
Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalo- geiton, and St´ephane Lathuili`ere. One-shot unsupervised do- main adaptation with personalized diffusion models. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 698–708, 2023. 1
work page 2023
-
[2]
Likelihood training of schr \” odinger bridge using forward-backward sdes theory
Tianrong Chen, Guan-Horng Liu, and Evangelos A Theodorou. Likelihood training of schr \” odinger bridge using forward-backward sdes theory. arXiv preprint arXiv:2110.11291, 2021. 3
-
[3]
Cartoongan: Generative adversarial networks for photo cartoonization
Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9465–9474, 2018. 2
work page 2018
-
[4]
Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797,
-
[5]
Z*: Zero-shot style transfer via attention reweighting
Yingying Deng, Xiangyu He, Fan Tang, and Weiming Dong. Z*: Zero-shot style transfer via attention reweighting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 6934–6944, 2024. 5, 6
work page 2024
-
[6]
Hierarchy flow for high-fidelity image-to-image translation
Weichen Fan, Jinghuan Chen, and Ziwei Liu. Hierarchy flow for high-fidelity image-to-image translation. arXiv preprint arXiv:2308.06909, 2023. 4, 6
-
[7]
Im- age style transfer using convolutional neural networks
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016. 2
work page 2016
-
[8]
Alignflow: Cycle consistent learning from multiple domains via normalizing flows
Aditya Grover, Christopher Chute, Rui Shu, Zhangjie Cao, and Stefano Ermon. Alignflow: Cycle consistent learning from multiple domains via normalizing flows. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 4028–4035, 2020. 1
work page 2020
-
[9]
Accelerate: Training and inference at scale made simple, efficient and adaptable
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https: //github.com/huggingface/accelerate , 2022. 1
work page 2022
-
[10]
Dual contrastive learning for unsu- pervised image-to-image translation
Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mo- hammad Ali Armin. Dual contrastive learning for unsu- pervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 746–755, 2021. 2
work page 2021
-
[11]
Neural style transfer: A review
Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics , 26(11):3365–3385, 2019. 1
work page 2019
-
[12]
Diverse image-to-image translation via disentangled representations
Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceed- ings of the European conference on computer vision (ECCV), pages 35–51, 2018. 2
work page 2018
-
[13]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023. 2
work page 2023
-
[14]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez- Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Sdedit: Guided image synthesis and editing with stochastic differential equa- tions
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022. 5, 6
work page 2022
-
[17]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 2
work page 2024
-
[18]
Un- supervised medical image translation with adversarial diffu- sion models
Muzaffer ¨Ozbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, S ¸aban¨Ozturk, Alper G ¨ung¨or, and Tolga C ¸ ukur. Un- supervised medical image translation with adversarial diffu- sion models. IEEE Transactions on Medical Imaging , 42 (12):3524–3539, 2023. 1
work page 2023
-
[19]
One-step image translation with text-to-image models,
Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024. 2, 4, 6
-
[20]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 8
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,
Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. Unit-ddpm: Unpaired image translation with denois- ing diffusion probabilistic models. arXiv preprint arXiv:2104.05358, 2021. 2
-
[22]
Learning from simulated and unsupervised images through adversarial training
Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116, 2017. 2
work page 2017
-
[23]
Improved techniques for training score-based generative models
Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020. 2
work page 2020
-
[24]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based 9 generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[25]
Dual diffusion implicit bridges for image-to-image translation
Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image transla- tion. arXiv preprint arXiv:2203.08382, 2022. 3
-
[26]
Texture Networks: Feed-forward Synthesis of Textures and Stylized Images
Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward syn- thesis of textures and stylized images. arXiv preprint arXiv:1603.03417, 2016. 2
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
A latent space of stochastic diffusion models for zero-shot image editing and guidance
Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023. 3
work page 2023
-
[28]
Attention-aware multi-stroke style transfer
Yuan Yao, Jianqiang Ren, Xuansong Xie, Weidong Liu, Yong-Jin Liu, and Jun Wang. Attention-aware multi-stroke style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1467– 1475, 2019. 2
work page 2019
-
[29]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2
work page 2023
-
[30]
Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image comple- tion via co-modulated generative adversarial networks.arXiv preprint arXiv:2103.10428, 2021. 2
-
[31]
Unpaired image-to-image translation using cycle- consistent adversarial networks
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision , pages 2223– 2232, 2017. 1, 2
work page 2017
-
[32]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks, 2020. 2
work page 2020
-
[33]
Sean: Image synthesis with semantic region-adaptive nor- malization
Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Sean: Image synthesis with semantic region-adaptive nor- malization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5104–5113,
-
[34]
Appendix B contains empiric proof of the shared manifold assumption of Section 3
2 10 MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation Supplementary Material The supplementary material is organized as follows: Ap- pendix A describes the implementation details of MedShift. Appendix B contains empiric proof of the shared manifold assumption of Section 3. A. Implementation Details The model was trained on a workstatio...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.