Leveraging Image Editing Foundation Models for Data-Efficient CT Metal Artifact Reduction
Pith reviewed 2026-05-10 18:41 UTC · model grok-4.3
The pith
Adapting a vision-language diffusion model via LoRA with multi-reference conditioning suppresses CT metal artifacts using only 16 to 128 paired training examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating metal artifact reduction as an in-context reasoning task and adapting a vision-language diffusion foundation model with LoRA plus multi-reference conditioning on clean exemplars from other subjects, the approach achieves effective artifact suppression and state-of-the-art performance on perceptual and radiological metrics using only 16 to 128 paired examples, two orders of magnitude fewer than conventional supervised methods.
What carries the argument
LoRA adaptation of a vision-language diffusion foundation model that receives the corrupted CT slice together with clean anatomical reference images from unrelated subjects to perform in-context restoration.
If this is right
- Artifact suppression reaches state-of-the-art levels on the AAPM CT-MAR benchmark for both perceptual and radiological-feature metrics.
- Training data requirements drop from thousands to 16-128 paired examples.
- Domain adaptation via LoRA prevents the foundation model from misinterpreting artifacts as unrelated natural objects.
- Multi-reference conditioning with clean exemplars from other patients enables category-specific anatomical inference.
- The adapted foundation model supplies an interpretable, data-efficient route to medical image reconstruction.
Where Pith is reading between the lines
- The same in-context conditioning strategy could be tested on other limited-data medical reconstruction problems such as low-dose CT or MRI denoising.
- Performance may depend on how closely the reference images match the anatomical category of the corrupted scan, suggesting a need for automated reference selection.
- If the approach scales, large unlabeled medical image collections could serve as reference banks without requiring new paired acquisitions.
Load-bearing premise
That clean anatomical exemplars from unrelated subjects supply sufficient category-specific context for the adapted model to correctly infer and restore the underlying anatomy without hallucinating new structures.
What would settle it
Running the model on the AAPM CT-MAR test set without the multi-reference conditioning or without LoRA domain adaptation and measuring whether streak artifacts are still interpreted as natural objects or whether quantitative and perceptual metrics fall below prior supervised baselines.
Figures
read the original abstract
Metal artifacts from high-attenuation implants severely degrade CT image quality, obscuring critical anatomical structures and posing a challenge for standard deep learning methods that require extensive paired training data. We propose a paradigm shift: reframing artifact reduction as an in-context reasoning task by adapting a general-purpose vision-language diffusion foundation model via parameter-efficient Low-Rank Adaptation (LoRA). By leveraging rich visual priors, our approach achieves effective artifact suppression with only 16 to 128 paired training examples reducing data requirements by two orders of magnitude. Crucially, we demonstrate that domain adaptation is essential for hallucination mitigation; without it, foundation models interpret streak artifacts as erroneous natural objects (e.g., waffles or petri dishes). To ground the restoration, we propose a multi-reference conditioning strategy where clean anatomical exemplars from unrelated subjects are provided alongside the corrupted input, enabling the model to exploit category-specific context to infer uncorrupted anatomy. Extensive evaluation on the AAPM CT-MAR benchmark demonstrates that our method achieves state-of-the-art performance on perceptual and radiological-feature metrics . This work establishes that foundation models, when appropriately adapted, offer a scalable alternative for interpretable, data-efficient medical image reconstruction. Code is available at https://github.com/ahmetemirdagi/CT-EditMAR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes reframing CT metal artifact reduction (MAR) as an in-context reasoning task by adapting a vision-language diffusion foundation model via LoRA, combined with multi-reference conditioning using clean anatomical exemplars from unrelated subjects. It claims this enables effective artifact suppression and state-of-the-art performance on the AAPM CT-MAR benchmark using only 16 to 128 paired training examples (two orders of magnitude less data than standard methods), while showing that domain adaptation is essential to prevent the model from misinterpreting artifacts as natural objects.
Significance. If the central performance claims hold under rigorous validation, this work would be significant for demonstrating that foundation models can be adapted for data-scarce medical imaging tasks like CT reconstruction, potentially reducing reliance on large paired datasets. The public code release supports reproducibility and allows community verification of the empirical results.
major comments (1)
- [§3.2] §3.2 (multi-reference conditioning): The central data-efficiency claim depends on the assumption that clean exemplars from unrelated subjects supply usable category-specific priors for recovering patient-specific anatomy. Given substantial inter-patient variability in organ geometry, size, and position in CT, the manuscript provides no analysis, ablation, or patient-specific validation of reference selection/impact to show that the model recovers true geometry rather than blending or hallucinating structures from the references. This is load-bearing for the 16–128 example regime, where LoRA adaptation has limited opportunity to learn mappings.
minor comments (3)
- The abstract asserts state-of-the-art performance on 'perceptual and radiological-feature metrics' without naming the specific metrics, providing numerical values, or referencing the corresponding results table; this should be stated explicitly in the abstract or introduction.
- [Method] The method section should specify the exact foundation model (e.g., which Stable Diffusion variant or vision-language model) and the precise LoRA configuration (rank, target modules) for reproducibility.
- [Results] Figure captions in the results section could more explicitly describe the role of the provided references and highlight differences in artifact suppression across methods.
Simulated Author's Rebuttal
We thank the referee for their insightful review and for recognizing the potential significance of adapting foundation models to data-scarce medical imaging tasks. We address the major comment below and will revise the manuscript to incorporate additional analysis as suggested.
read point-by-point responses
-
Referee: [§3.2] §3.2 (multi-reference conditioning): The central data-efficiency claim depends on the assumption that clean exemplars from unrelated subjects supply usable category-specific priors for recovering patient-specific anatomy. Given substantial inter-patient variability in organ geometry, size, and position in CT, the manuscript provides no analysis, ablation, or patient-specific validation of reference selection/impact to show that the model recovers true geometry rather than blending or hallucinating structures from the references. This is load-bearing for the 16–128 example regime, where LoRA adaptation has limited opportunity to learn mappings.
Authors: We thank the referee for this important observation. The multi-reference conditioning strategy is intended to exploit the foundation model's in-context reasoning by supplying clean anatomical exemplars as category-specific visual priors, enabling inference of uncorrupted patient anatomy even with minimal paired data. Our experiments indicate that this yields measurable gains over single-reference or no-reference baselines on the AAPM CT-MAR benchmark. Nevertheless, we agree that the current manuscript lacks a dedicated ablation and validation of reference selection and impact. In the revised version, we will add: (1) an ablation varying reference count and anatomical similarity (measured via feature-based metrics), (2) quantitative assessments of anatomical fidelity on non-artifact regions using available ground-truth structures, and (3) patient-specific case studies with visualizations to demonstrate recovery of true geometry rather than blending or hallucination. These additions will directly address inter-patient variability and strengthen the data-efficiency claims for the low-data regime. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical method: reframing metal artifact reduction as in-context reasoning via LoRA adaptation of a vision-language diffusion foundation model, combined with multi-reference conditioning using clean exemplars. Performance claims rest on AAPM CT-MAR benchmark evaluation and ablation studies showing data efficiency (16-128 examples) and the necessity of domain adaptation to avoid hallucinations. No equations, first-principles derivations, or predictions are offered that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central results are externally falsifiable via the public benchmark and code release, with no load-bearing self-citations or uniqueness theorems invoked.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption General-purpose vision-language diffusion models contain rich visual priors transferable to medical CT images after parameter-efficient adaptation
- standard math LoRA enables effective domain adaptation without catastrophic forgetting of the base model's capabilities
Reference graph
Works this paper leans on
-
[1]
Aapm ct metal artifact reduction grand challenge benchmark tool.https : / / github
AAPM CT-MAR Grand Challenge Team. Aapm ct metal artifact reduction grand challenge benchmark tool.https : / / github . com / xcist / example / tree / main / AAPM _ datachallenge/, 2025. 4
work page 2025
-
[2]
Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp, and R. G ¨okberk Cinbis ¸. Exploring sparsity for pa- rameter efficient fine tuning using wavelets.preprint arXiv:2505.12532, 2025. 2
-
[3]
FouRA: Fourier low-rank adaptation
Shubhankar Borse, Shreya Kadambi, Nilesh Prasad Pandey, Kartikeya Bhardwaj, Viswanath Ganapathy, Sweta Priyadarshi, Risheek Garrepalli, Rafael Es- teves, Munawar Hayat, and Fatih Porikli. FouRA: Fourier low-rank adaptation. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems, 2024. 2
work page 2024
-
[4]
C.-H. Chang, H.-N. Wu, C.-H. Hsu, and H.-H. Lin. Virtual monochromatic imaging with projection-based material decomposition algorithm for metal artifacts reduction in photon-counting detector computed to- mography.PLoS ONE, 18(3):e0282900, 2023. 1
work page 2023
- [5]
-
[6]
Bruno De Man, Johan Nuyts, Patrick Dupont, et al. An iterative maximum-likelihood polychromatic algo- rithm for ct.IEEE Transactions on Medical Imaging, 20(10):999–1008, 2001. 1
work page 2001
-
[7]
Scaling rectified flow transformers for high-resolution image synthe- sis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Harry Lorenz, Yam Zhang, Robin Caplette, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthe- sis. InInternational Conference on Machine Learn- ing, 2024. 3
work page 2024
-
[8]
Parameter- efficient fine-tuning with discrete fourier transform
Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter- efficient fine-tuning with discrete fourier transform. InForty-first International Conference on Machine Learning, 2024. 2
work page 2024
-
[9]
Metal artifact reduction in ct: Where are we after four decades?IEEE Access, 4:5826–5849, 2016
Lars Gjesteby, Bruno De Man, Yannan Jin, et al. Metal artifact reduction in ct: Where are we after four decades?IEEE Access, 4:5826–5849, 2016. 1
work page 2016
-
[10]
Nir Goren, James Avery, Thomas Dowrick, Eleanor Mackle, Anna Witkowska-Wrobel, and David Holder. Multi-frequency electrical impedance tomography and neuroimaging data in stroke patients.Scientific Data, 5:180112, 2018. 4
work page 2018
- [11]
-
[12]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Un- terthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017. 5
work page 2017
-
[13]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022. 2, 3
work page 2022
-
[14]
W. A. Kalender, R. Hebel, and J. Ebersberger. Reduc- tion of ct artifacts caused by metallic implants.Radi- ology, 164(2):576–577, 1987. 1
work page 1987
-
[15]
Onur Keles ¸, M. Akın Yılmaz, A. Murat Tekalp, Cansu Korkmaz, and Zafer Do ˘gan. On the computation of psnr for a set of images or video. In2021 Picture Coding Symposium (PCS), pages 1–5, 2021. 5
work page 2021
-
[16]
VeRA: Vector-based random matrix adaptation
Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. VeRA: Vector-based random matrix adaptation. InThe Twelfth International Conference on Learning Representations, 2024. 2
work page 2024
-
[17]
Haofu Liao, Wei-An Lin, S. Kevin Zhou, and Jiebo Luo. Adn: Artifact disentanglement network for unsu- pervised metal artifact reduction.IEEE Transactions on Medical Imaging, 39(3):634–643, 2020. 2, 4, 5
work page 2020
-
[18]
Dudonet: Dual domain network for ct metal artifact reduction
Wei-An Lin, Haofu Liao, Cheng Peng, et al. Dudonet: Dual domain network for ct metal artifact reduction. InCVPR, pages 10512–10521, 2019. 1
work page 2019
-
[19]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. 3
work page 2023
-
[20]
Dora: Weight- decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight- decomposed low-rank adaptation. InICML, 2024. 2
work page 2024
-
[21]
Chenglong Ma, Zilong Li, Yuanlin Li, Jing Han, Jun- ping Zhang, Yi Zhang, Jiannan Liu, and Hongming Shan. Radiologist-in-the-loop self-training for gener- alizable ct metal artifact reduction.IEEE Transactions on Medical Imaging, 44(6):2504–2514, 2025. 4, 8
work page 2025
-
[22]
Segment anything in medical images
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nat. Commun., 15(1):654, 2024. 2
work page 2024
-
[23]
Robson, Brett Marinelli, Mingqian Huang, Amish Doshi, Adam Ja- cobi, Chendi Cao, Katherine E
Xueyan Mei, Zelong Liu, Philip M. Robson, Brett Marinelli, Mingqian Huang, Amish Doshi, Adam Ja- cobi, Chendi Cao, Katherine E. Link, Thomas Yang, et al. Radimagenet: An open radiologic deep learning research dataset for effective transfer learning.Radi- ology: Artificial Intelligence, 4(5):e210315, 2022. 5
work page 2022
-
[24]
Esther Meyer, Rainer Raupach, Michael Lell, Bern- hard Schmidt, and Marc Kachelrieß. Normalized metal artifact reduction (nmar) in computed tomogra- phy.Medical Physics, 37(10):5482–5493, 2010. 1
work page 2010
-
[25]
OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023. 2
work page 2023
- [26]
-
[27]
Learning trans- ferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion. InInternational Conference on Machine Learn- ing, pages 8748–8763. PMLR, 2021. 2
work page 2021
-
[28]
Nasrin Rahimi, Mısra Yavuz, Burak Can Biner, Yunus Bilge Kurt, Ahmet Rasim Emirda˘gı, S¨uleyman Aslan, G¨orkay Aydemir, M. Akın Yılmaz, and A. Mu- rat Tekalp. Edit2interp: Adapting image founda- tion models from spatial editing to video frame in- terpolation with few-shot learning.arXiv preprint arXiv:2603.15003, 2026. 2
-
[29]
High- resolution image synthesis with latent diffusion mod- els
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2
work page 2022
-
[30]
Mark Selles, Jochen A C van Osch, Mario Maas, Mar- tijn F Boomsma, and Ruud H H Wellenberg. Ad- vances in metal artifact reduction in CT images: A review of traditional and novel metal artifact reduc- tion techniques.Eur. J. Radiol., 170(111276):111276,
-
[31]
Solving inverse problems in medical imaging with score-based generative models
Yang Song, Liyue Shen, Lei Xing, and Stefano Er- mon. Solving inverse problems in medical imaging with score-based generative models. InICLR, 2022. 2
work page 2022
-
[32]
Alexandru Telea. An image inpainting technique based on the fast marching method.Journal of Graph- ics Tools, 9(1):23–34, 2004. 4
work page 2004
-
[33]
Ge Wang, D. L. Snyder, J. A. O’Sullivan, and M. W. Vannier. Iterative deblurring for metal artifact reduc- tion in ct.IEEE Transactions on Medical Imaging, 15 (5):657–664, 1996. 1
work page 1996
-
[34]
Orientation-shared convolution representation for CT metal artifact learn- ing
Hong Wang, Qi Xie, Yuexiang Li, Yawen Huang, Deyu Meng, and Yefeng Zheng. Orientation-shared convolution representation for CT metal artifact learn- ing. InLecture Notes in Computer Science, pages 665–675. Springer Nature Switzerland, Cham, 2022. 4, 5
work page 2022
-
[35]
Hong Wang, Yuexiang Li, Haimiao Zhang, Deyu Meng, and Yefeng Zheng. Indudonet+: A deep un- folding dual domain network for metal artifact reduc- tion in ct images.Medical Image Analysis, 85:102729,
-
[36]
Oscnet: Orientation-shared convolutional network for ct metal artifact learning
Hong Wang, Qi Xie, Dong Zeng, Jianhua Ma, Deyu Meng, and Yefeng Zheng. Oscnet: Orientation-shared convolutional network for ct metal artifact learning. IEEE Transactions on Medical Imaging, 43(1):489– 502, 2024. 4, 5
work page 2024
-
[37]
Conditional generative adversarial networks for metal artifact reduction in CT images of the ear.Med
Jianing Wang, Yiyuan Zhao, Jack H Noble, and Benoit M Dawant. Conditional generative adversarial networks for metal artifact reduction in CT images of the ear.Med. Image Comput. Comput. Assist. Interv., 11070:3–11, 2018. 2
work page 2018
-
[38]
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transac- tions on Image Processing, 13(4):600–612, 2004. 5
work page 2004
-
[39]
Springer International Publishing, 2023
Tamar Willson.CT and SPECT/CT Artefacts, page 1–4. Springer International Publishing, 2023. 1
work page 2023
-
[40]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Mingye Wu, Paul FitzGerald, Jiayong Zhang, W Paul Segars, Hengyong Yu, Yongshun Xu, and Bruno De Man. Xcist—an open access x-ray/ct simulation toolkit.Physics in Medicine &; Biology, 67(19): 194002, 2022. 4
work page 2022
- [42]
-
[43]
Akın Yılmaz, Ahmet Bilican, Burak Can Biner, and A
M. Akın Yılmaz, Ahmet Bilican, Burak Can Biner, and A. Murat Tekalp. Edit2restore: Few-shot image restoration via parameter-efficient adaptation of pre-trained editing models.arXiv preprint arXiv:2601.03391, 2026. 2
-
[44]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. InIEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018. 5
work page 2018
-
[45]
Lungren, Tristan Naumann, and Hoifung Poon
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Han- wen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. A multimodal biomedical foundation model trained from fifteen million image–text pairs.NEJM AI, 2(1), 2024. 2
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.