pith. sign in

arxiv: 2508.21435 · v3 · submitted 2025-08-29 · 💻 cs.CV · cs.AI

MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation

Pith reviewed 2026-05-18 20:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords domain adaptationX-ray imagingflow matchingSchrödinger bridgesunpaired image translationmedical imagingsynthetic datagenerative models
0
0 comments X

The pith

MedShift uses flow matching and Schrödinger bridges to translate between synthetic and real X-ray images from a single shared model trained on unpaired data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a single class-conditional generative model can close the gap between synthetic and real skull X-rays by learning one domain-agnostic latent space. This would matter because synthetic data can be generated at scale but differs from real scans in attenuation, noise, and soft-tissue contrast, limiting its direct use for training clinical models. MedShift supports translation between any pair of domains seen in training without needing new models or paired examples, and it allows tuning the output toward either visual quality or structural accuracy at inference time. The authors also release X-DigiSkull, a dataset of aligned synthetic and real X-rays at different doses, to test such translations.

Core claim

MedShift is a unified class-conditional generative model based on flow matching and Schrödinger bridges that learns a shared domain-agnostic latent space and thereby enables high-fidelity unpaired translation between any pair of X-ray domains (synthetic or real) observed during training.

What carries the argument

The implicit conditional transport realized by flow matching combined with Schrödinger bridges, which performs the mapping between domains inside one class-conditional generative model.

If this is right

  • One trained model handles translation in either direction between every pair of domains seen at training time.
  • The same model can be adjusted at inference to favor either perceptual quality or geometric fidelity.
  • The approach achieves competitive results with a smaller parameter count than diffusion-based domain-adaptation methods.
  • A new benchmark dataset of aligned synthetic and real skull X-rays at multiple radiation doses is provided for future comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transport mechanism could be tested on other medical modalities such as CT or MRI where synthetic data is also abundant.
  • If the shared latent space proves stable across institutions, the method might reduce reliance on site-specific real-data collection for model training.
  • The inference-time tuning knob offers a practical way to adapt outputs for different clinical priorities without retraining.

Load-bearing premise

The differences in attenuation, noise, and soft-tissue appearance between synthetic and real X-ray images can be captured and bridged by one class-conditional generative model trained only on unpaired examples.

What would settle it

A downstream segmentation or detection model trained on real clinical X-rays shows no accuracy gain when the training set is augmented with MedShift-translated synthetic images instead of raw synthetic images.

Figures

Figures reproduced from arXiv: 2508.21435 by Christiaan Viviers, Fons van der Sommen, Francisco Caetano, Peter H.N. De With.

Figure 1
Figure 1. Figure 1: Overview of MedShift inference. A source image [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset overview. The synthetic domain contains Low and High dosage samples generated using the Mentice VIST® simulator; the real domain includes Low, Normal, and Exposure dosage categories acquired from a skull phantom using the Philips Azurion IGT system. intermediate steps using closed-form conditional distribu￾tions, FM offers a scalable and efficient alternative to tradi￾tional diffusion-based methods… view at source ↗
Figure 3
Figure 3. Figure 3: Trade-off between structural fidelity (SSIM) and real [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UMAP visualization of the latent-space features for dif [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at https://caetas.github.io/medshift.html

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MedShift, a unified class-conditional generative model based on Flow Matching and Schrödinger Bridges for unpaired image translation across synthetic and real X-ray domains of the head. It claims to learn a shared domain-agnostic latent space enabling seamless translation between any pair of domains. The work also presents the X-DigiSkull dataset and demonstrates that the model achieves strong performance with a smaller parameter count than diffusion-based methods, while offering inference-time flexibility to balance perceptual fidelity and structural consistency.

Significance. If validated, this approach could significantly aid in leveraging synthetic medical data for real-world applications by providing an efficient, flexible domain adaptation technique without requiring paired data. The technical integration of conditional flow matching with Schrödinger bridges represents a meaningful advancement, and the open-sourcing of code and dataset is commendable for promoting reproducibility in the field.

major comments (1)
  1. Section 5 (experimental results): the quantitative comparisons lack error bars or results from multiple random seeds; this makes it difficult to assess whether the reported improvements over diffusion baselines are statistically significant and undermines confidence in the 'strong performance' claim.
minor comments (3)
  1. Abstract: the spelling 'Schrodinger' should be corrected to 'Schrödinger'.
  2. Section 3: provide more explicit description of how the class-conditioning is implemented in the vector field to guarantee a domain-agnostic latent space.
  3. Figure captions: ensure all visualizations of translated images include clear indications of source/target domains and any quantitative metrics shown.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. We address the single major comment below in a point-by-point manner.

read point-by-point responses
  1. Referee: Section 5 (experimental results): the quantitative comparisons lack error bars or results from multiple random seeds; this makes it difficult to assess whether the reported improvements over diffusion baselines are statistically significant and undermines confidence in the 'strong performance' claim.

    Authors: We agree that reporting results across multiple random seeds with error bars would provide a more rigorous evaluation of statistical significance and strengthen confidence in the performance claims. This is a valid and constructive observation. In the revised manuscript we will rerun all quantitative experiments in Section 5 using at least three independent random seeds, report mean values together with standard deviations, and include error bars on the relevant tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain is self-contained. MedShift trains a class-conditional vector field via Flow Matching to match Schrödinger Bridge marginals on unpaired multi-domain X-ray data, with class-conditioning used to encourage a shared latent space. These steps follow directly from the stated training objective and architecture without reducing to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation. Performance claims are supported by explicit comparisons to diffusion baselines on the newly introduced X-DigiSkull dataset, and the smaller model size is tabulated independently. No equation or claim collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; model likely relies on standard assumptions of flow matching and Schrödinger bridges plus hyperparameters for conditioning and transport cost, but none are enumerated here.

axioms (1)
  • domain assumption Flow matching and Schrödinger bridges can learn a domain-agnostic latent space that captures shared anatomical structure across synthetic and real X-ray distributions.
    This is the core modeling premise invoked to justify unpaired translation.

pith-pipeline@v0.9.0 · 5752 in / 1131 out tokens · 37652 ms · 2026-05-18T20:57:53.945911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    One-shot unsupervised do- main adaptation with personalized diffusion models

    Yasser Benigmim, Subhankar Roy, Slim Essid, Vicky Kalo- geiton, and St´ephane Lathuili`ere. One-shot unsupervised do- main adaptation with personalized diffusion models. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 698–708, 2023. 1

  2. [2]

    Likelihood training of schr \” odinger bridge using forward-backward sdes theory

    Tianrong Chen, Guan-Horng Liu, and Evangelos A Theodorou. Likelihood training of schr \” odinger bridge using forward-backward sdes theory. arXiv preprint arXiv:2110.11291, 2021. 3

  3. [3]

    Cartoongan: Generative adversarial networks for photo cartoonization

    Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9465–9474, 2018. 2

  4. [4]

    Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation

    Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. Stargan: Unified genera- tive adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797,

  5. [5]

    Z*: Zero-shot style transfer via attention reweighting

    Yingying Deng, Xiangyu He, Fan Tang, and Weiming Dong. Z*: Zero-shot style transfer via attention reweighting. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 6934–6944, 2024. 5, 6

  6. [6]

    Hierarchy flow for high-fidelity image-to-image translation

    Weichen Fan, Jinghuan Chen, and Ziwei Liu. Hierarchy flow for high-fidelity image-to-image translation. arXiv preprint arXiv:2308.06909, 2023. 4, 6

  7. [7]

    Im- age style transfer using convolutional neural networks

    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016. 2

  8. [8]

    Alignflow: Cycle consistent learning from multiple domains via normalizing flows

    Aditya Grover, Christopher Chute, Rui Shu, Zhangjie Cao, and Stefano Ermon. Alignflow: Cycle consistent learning from multiple domains via normalizing flows. In Proceed- ings of the AAAI Conference on Artificial Intelligence, pages 4028–4035, 2020. 1

  9. [9]

    Accelerate: Training and inference at scale made simple, efficient and adaptable

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https: //github.com/huggingface/accelerate , 2022. 1

  10. [10]

    Dual contrastive learning for unsu- pervised image-to-image translation

    Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mo- hammad Ali Armin. Dual contrastive learning for unsu- pervised image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 746–755, 2021. 2

  11. [11]

    Neural style transfer: A review

    Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics , 26(11):3365–3385, 2019. 1

  12. [12]

    Diverse image-to-image translation via disentangled representations

    Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Diverse image-to-image translation via disentangled representations. In Proceed- ings of the European conference on computer vision (ECCV), pages 35–51, 2018. 2

  13. [13]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023. 2

  14. [14]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 2

  15. [15]

    Flow Matching Guide and Code

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez- Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024. 2

  16. [16]

    Sdedit: Guided image synthesis and editing with stochastic differential equa- tions

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equa- tions. In International Conference on Learning Representa- tions, 2022. 5, 6

  17. [17]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 2

  18. [18]

    Un- supervised medical image translation with adversarial diffu- sion models

    Muzaffer ¨Ozbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, S ¸aban¨Ozturk, Alper G ¨ung¨or, and Tolga C ¸ ukur. Un- supervised medical image translation with adversarial diffu- sion models. IEEE Transactions on Medical Imaging , 42 (12):3524–3539, 2023. 1

  19. [19]

    One-step image translation with text-to-image models,

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models. arXiv preprint arXiv:2403.12036, 2024. 2, 4, 6

  20. [20]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 8

  21. [21]

    Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,

    Hiroshi Sasaki, Chris G Willcocks, and Toby P Breckon. Unit-ddpm: Unpaired image translation with denois- ing diffusion probabilistic models. arXiv preprint arXiv:2104.05358, 2021. 2

  22. [22]

    Learning from simulated and unsupervised images through adversarial training

    Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116, 2017. 2

  23. [23]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020. 2

  24. [24]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based 9 generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2

  25. [25]

    Dual diffusion implicit bridges for image-to-image translation

    Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image transla- tion. arXiv preprint arXiv:2203.08382, 2022. 3

  26. [26]

    Texture Networks: Feed-forward Synthesis of Textures and Stylized Images

    Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward syn- thesis of textures and stylized images. arXiv preprint arXiv:1603.03417, 2016. 2

  27. [27]

    A latent space of stochastic diffusion models for zero-shot image editing and guidance

    Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023. 3

  28. [28]

    Attention-aware multi-stroke style transfer

    Yuan Yao, Jianqiang Ren, Xuansong Xie, Weidong Liu, Yong-Jin Liu, and Jun Wang. Attention-aware multi-stroke style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1467– 1475, 2019. 2

  29. [29]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2

  30. [30]

    Large scale image comple- tion via co-modulated generative adversarial networks.arXiv preprint arXiv:2103.10428, 2021

    Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image comple- tion via co-modulated generative adversarial networks.arXiv preprint arXiv:2103.10428, 2021. 2

  31. [31]

    Unpaired image-to-image translation using cycle- consistent adversarial networks

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision , pages 2223– 2232, 2017. 1, 2

  32. [32]

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle- consistent adversarial networks, 2020. 2

  33. [33]

    Sean: Image synthesis with semantic region-adaptive nor- malization

    Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Sean: Image synthesis with semantic region-adaptive nor- malization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5104–5113,

  34. [34]

    Appendix B contains empiric proof of the shared manifold assumption of Section 3

    2 10 MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation Supplementary Material The supplementary material is organized as follows: Ap- pendix A describes the implementation details of MedShift. Appendix B contains empiric proof of the shared manifold assumption of Section 3. A. Implementation Details The model was trained on a workstatio...