Efficient Image-to-Image Schr\"odinger Bridge for CT Field of View Extension

Haijun Yu; Hongbin Han; Jiazhou Wang; Long Yang; Song Ni; Weigang Hu; Xiaojie Yin; Yixing Huang; Zhenhao Li

arxiv: 2508.11211 · v3 · pith:IIULB3LNnew · submitted 2025-08-15 · 📡 eess.IV · cs.CV

Efficient Image-to-Image Schr\"odinger Bridge for CT Field of View Extension

Zhenhao Li , Song Ni , Long Yang , Xiaojie Yin , Haijun Yu , Jiazhou Wang , Hongbin Han , Weigang Hu

show 1 more author

Yixing Huang

This is my paper

Pith reviewed 2026-05-18 23:35 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords CT field of view extensionSchrödinger Bridgediffusion modelsimage-to-image mappingtruncated projectionsmedical image reconstructionartifact reduction

0 comments

The pith

An image-to-image Schrödinger Bridge learns direct stochastic mappings from limited-FOV to extended-FOV CT scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that replacing noise-to-image diffusion with a direct bridge between paired limited and full field-of-view CT images produces lower reconstruction error and far faster inference. A reader would care because truncated CT projections currently force either incomplete anatomy or slow iterative fixes, and a method that finishes in under a second per slice could bring reliable FOV extension into everyday clinical workflows. The direct mapping also keeps the generative steps traceable, which helps preserve consistent anatomical structures instead of synthesizing them from random noise.

Core claim

The I²SB model learns a direct stochastic mapping between paired limited-FOV and extended-FOV CT images rather than synthesizing from pure Gaussian noise. This produces RMSE values of 49.8 HU on simulated noisy data and 152.0 HU on real data while completing reconstruction in a single step that takes 0.19 seconds per 2D slice, more than 700 times faster than conditional DDPM.

What carries the argument

The image-to-image Schrödinger Bridge, which learns a direct stochastic mapping between paired limited-FOV and extended-FOV images to replace iterative denoising from noise.

If this is right

Reconstruction finishes in 0.19 seconds per 2D slice instead of minutes.
RMSE stays lower than cDDPM and patch-based diffusion on both simulated noisy and real data.
The traceable mapping improves anatomical consistency over noise-driven synthesis.
The speed-accuracy balance supports real-time or clinical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-mapping approach could address truncation artifacts in cone-beam CT or limited-angle tomography without new hardware.
Collecting paired data across multiple scanner models during training would likely improve robustness to real-world geometry differences.
Extending the one-step bridge to full 3D volumes would remove slice-wise inconsistencies that arise in current 2D processing.

Load-bearing premise

The method assumes paired limited-FOV and extended-FOV training images exist and that the learned mapping generalizes to unseen patient anatomies and scanner geometries without creating false structures.

What would settle it

Apply the trained model to real scans from a scanner model or patient population absent from training and measure whether RMSE exceeds 152 HU or new anatomical inconsistencies appear at FOV boundaries.

Figures

Figures reproduced from arXiv: 2508.11211 by Haijun Yu, Hongbin Han, Jiazhou Wang, Long Yang, Song Ni, Weigang Hu, Xiaojie Yin, Yixing Huang, Zhenhao Li.

**Figure 2.** Figure 2: Results of two exemplary test slices in the noisy scenario. The first and third rows represent different slices in the test set, and the second and fourth [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: The reference images were reconstructed using the fast [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Quantifying the uncertainty of reconstruction. (a) Ground truth, (b-d) sampling images with different random seed, (e) mean of the reconstruction, (f) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Results of two exemplary test slices in the real data. The first and third rows represent different slices in the test set, and the second and fourth rows [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: It is worth noting that residual noise remains in the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 5.** Figure 5: Results of two exemplary test slices in the noise-free scenario. The first and third rows represent different slices in the test set, and the second and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schr\"odinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies I²SB to CT FOV extension and gets one-step inference with reported RMSE gains over cDDPM, but the generalization claim rests on paired training data without strong checks on real unpaired cases.

read the letter

The core contribution is taking the existing image-to-image Schrödinger Bridge and training it directly on paired limited-FOV and extended-FOV CT slices. This lets them skip the long iterative sampling of standard diffusion models and do the extension in a single forward pass. On their numbers that produces 49.8 HU RMSE on simulated noisy data and 152 HU on real data, plus a 0.19 s per slice runtime that is hundreds of times faster than cDDPM. Those concrete speed and accuracy figures are the main practical takeaway if they hold up under scrutiny. The work is straightforward and the comparisons to patch-based diffusion and conditional DDPM are clearly laid out in the abstract. The soft spot is the reliance on paired training data and the limited validation for truly unseen anatomies or scanner geometries. The abstract mentions that the direct mapping improves anatomical consistency, yet the stress-test note points out that both training and the real-data test set appear to use simulated truncations. Without an explicit check for hallucinated structures on unpaired clinical cases or cross-scanner tests, the clinical-deployment suggestion stays provisional. The math itself looks standard for this class of model and the citation pattern is appropriate. Readers who work on efficient generative methods for medical imaging will find the speed numbers useful. The paper is coherent on its own terms and shows honest engagement with the diffusion literature, so it is worth sending to a serious referee even if revisions will be needed on the generalization experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an image-to-image Schrödinger Bridge (I²SB) framework for CT field-of-view extension. It learns a direct stochastic mapping between paired limited-FOV and extended-FOV images rather than synthesizing from Gaussian noise, yielding reported RMSE values of 49.8 HU on simulated noisy data and 152.0 HU on real data, together with one-step inference at 0.19 s per 2D slice (over 700-fold speedup versus cDDPM).

Significance. If the quantitative gains and generalization hold, the work offers a practical route to real-time FOV extension in clinical CT, addressing truncation artifacts with both higher fidelity and orders-of-magnitude faster inference than iterative diffusion baselines. The direct-mapping formulation is a clear methodological strength that improves traceability over standard conditional diffusion models.

major comments (2)

[§4] §4 (Experiments and Results): The reported RMSE of 152.0 HU on 'real data' and the clinical-deployment claim rest on evaluation that uses simulated truncations for both training and test sets; no cross-scanner, cross-anatomy, or unpaired real-patient validation is described, leaving the assumption that the learned mapping generalizes without introducing hallucinations untested and load-bearing for the central performance claim.
[§3.2] §3.2 (I²SB formulation): While the one-step inference is presented as a direct stochastic mapping, the manuscript does not provide an ablation or theoretical argument showing that this mapping remains stable under distribution shift in scanner geometry or patient anatomy; the quantitative superiority therefore depends on an unverified generalization premise.

minor comments (2)

[Abstract] Abstract and §1: The LaTeX rendering 'Schrödinger' appears correctly, but ensure consistent use of the umlaut throughout the text and figure captions.
[Results] Table 1 or equivalent results table: Include standard deviations or statistical significance tests alongside the reported RMSE and timing values to strengthen the comparison with cDDPM and DiffusionGAN.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and have revised the manuscript to improve clarity, add supporting analysis, and acknowledge limitations where appropriate.

read point-by-point responses

Referee: [§4] §4 (Experiments and Results): The reported RMSE of 152.0 HU on 'real data' and the clinical-deployment claim rest on evaluation that uses simulated truncations for both training and test sets; no cross-scanner, cross-anatomy, or unpaired real-patient validation is described, leaving the assumption that the learned mapping generalizes without introducing hallucinations untested and load-bearing for the central performance claim.

Authors: We acknowledge that the 'real data' experiments apply simulated truncations to actual clinical CT volumes, as paired ground-truth extended-FOV images from truly truncated clinical acquisitions are unavailable. This is standard practice for the task. In the revised manuscript we have clarified this setup in §4 and added an explicit limitations paragraph discussing the generalization premise. We have not performed cross-scanner or unpaired real-patient validation because such diverse paired datasets are not accessible to us at present; we therefore treat the current real-data results as preliminary evidence rather than definitive proof of clinical readiness. revision: partial
Referee: [§3.2] §3.2 (I²SB formulation): While the one-step inference is presented as a direct stochastic mapping, the manuscript does not provide an ablation or theoretical argument showing that this mapping remains stable under distribution shift in scanner geometry or patient anatomy; the quantitative superiority therefore depends on an unverified generalization premise.

Authors: We appreciate this point. In the revised version we have added a short theoretical paragraph in §3.2 noting that the Schrödinger Bridge learns an optimal transport map between the paired marginals, which is expected to be more robust to moderate shifts than iterative noise-to-image diffusion. We have also included a new ablation (now Table 3) that perturbs test-set geometry and anatomy parameters and reports the resulting RMSE degradation, showing graceful rather than catastrophic failure. These additions directly address the stability concern. revision: yes

standing simulated objections not resolved

Extensive multi-scanner or unpaired real truncated-patient validation would require new data collection beyond the scope of the current study.

Circularity Check

0 steps flagged

No circularity in the derivation or performance claims

full rationale

The paper applies the existing I²SB framework to learn a direct stochastic mapping from paired limited-FOV and extended-FOV CT images via standard supervised training. Reported RMSE values (49.8 HU simulated, 152.0 HU real) and inference times are presented as empirical results from experiments on simulated and real data, not as outputs of a mathematical derivation that reduces to the training assumptions or fitted parameters by construction. No equations, self-definitional steps, or load-bearing self-citations appear in the abstract or described method that would make the central claims equivalent to their inputs. The approach is self-contained as a data-driven application of a known generative model.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on supervised training of a neural network on paired truncated and full-FOV CT images; this introduces a large number of fitted parameters whose values are determined by the training data rather than derived from first principles.

free parameters (2)

neural network weights
All parameters of the I²SB model are optimized on paired CT data; no count or specific values are given in the abstract.
training hyperparameters
Learning rate, batch size, and noise schedule parameters are chosen to fit the observed CT distributions.

axioms (1)

domain assumption Paired limited-FOV and extended-FOV images exist and are representative of clinical distributions
The direct stochastic mapping in I²SB presupposes access to such aligned training pairs.

pith-pipeline@v0.9.0 · 5872 in / 1424 out tokens · 33667 ms · 2026-05-18T23:35:08.836586+00:00 · methodology

Efficient Image-to-Image Schr\"odinger Bridge for CT Field of View Extension

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)