Few-Shot Distribution-Aligned Flow Matching for Data Synthesis in Medical Image Segmentation

Aihua Ke; Bo Cai; Jian Luo; Jie Yang; Xiaosong Wang; Ziqi Ye

arxiv: 2604.02868 · v1 · submitted 2026-04-03 · 📡 eess.IV · cs.CV

Few-Shot Distribution-Aligned Flow Matching for Data Synthesis in Medical Image Segmentation

Jie Yang , Ziqi Ye , Aihua Ke , Jian Luo , Bo Cai , Xiaosong Wang This is my paper

Pith reviewed 2026-05-13 18:03 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords flow matchingmedical image segmentationdata augmentationdistribution alignmentfew-shot learninggenerative modelsimage synthesisdifferentiable reward

0 comments

The pith

Flow matching model aligns generated medical images to target distributions using few-shot differentiable reward fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AlignFlow, a flow matching approach for synthesizing image-mask pairs to augment medical image segmentation datasets. It trains the model in two stages: first to fit existing training data for plausible images, then applies differentiable reward fine-tuning to align the outputs with the distribution of a small set of reference images from the target domain. This targets the problem of distribution shifts that hurt model performance in clinical use. Experiments show consistent gains of 3.5 to 4.0 percent in mean Dice score and 3.5 to 5.6 percent in mean IoU across multiple datasets. The method also includes a flow matching component for generating diverse masks to better cover regions of interest.

Core claim

AlignFlow divides flow matching training into an initial stage that learns to generate plausible images from the training distribution and a second stage that uses differentiable reward fine-tuning to shift generations toward the distribution of limited target reference samples, while a separate flow matching process enhances mask diversity for improved segmentation training.

What carries the argument

Two-stage flow matching training combined with differentiable reward fine-tuning for distribution alignment in few-shot settings.

If this is right

Generated image-mask pairs improve downstream segmentation mDice by 3.5-4.0% over baselines.
mIoU scores rise by 3.5-5.6% across varied medical datasets and scenarios.
The approach remains effective with only a small number of reference images defining the target distribution.
Flow matching based mask generation increases diversity in regions of interest.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lower the barrier for deploying segmentation models in new clinical settings with minimal new data collection.
If the reward signal generalizes, it might extend to aligning other generative models like diffusion or GANs in medical domains.
Testing on more diverse modalities such as CT or MRI could reveal if the alignment holds beyond the tested cases.

Load-bearing premise

That the differentiable reward fine-tuning successfully aligns generated images to the target distribution without introducing artifacts or collapsing diversity, with a small number of reference images providing a reliable signal.

What would settle it

Observing no performance gain or visible artifacts in generated images when applying the fine-tuning stage compared to the base flow matching model.

Figures

Figures reproduced from arXiv: 2604.02868 by Aihua Ke, Bo Cai, Jian Luo, Jie Yang, Xiaosong Wang, Ziqi Ye.

**Figure 1.** Figure 1: Illustration of the data distribution of images generated by AlignFlow that align with the target domain distribution. The green dots represent the data points of images generated by AlignFlow, which are accurately scattered at the center of the target-domain data points. by limited data amount and diversity, medical image segmentation models have not yet reached their full potential compared to their cou… view at source ↗

**Figure 2.** Figure 2: (a) Illustration of the AlignFlow architecture. In stage 1, we optimize the denoising loss to enable the model to generate reasonable images; in stage 2, we simultaneously optimize the denoising loss and the alignment loss, allowing the model to align the generated images with reference images from the target domain while maintaining its original image generation capability. (b) Illustration of the mask sy… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on REFUGE2 dataset. The source domain is Canon, and the target domains are annotated on the right side of each row. In our implementation, to ensure that the number of samples in Fout and Fref is consistent, we first compute the mean of Fout and Fref before calculating the SKL. In summary, the expression for the reward function r SKL align is as follows: r SKL align(Fout, Fref ) = − … view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of the data distributions of images generated by different methods. style, compared to priors. Visualization of data distribution. We use the tSNE(Maaten & Hinton, 2008) algorithm to visualize the distributions of images generated by different methods. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on FedPolyp dataset. The source domain is Canon, and the target domains are annotated on the right side of each row. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on FedPolyp dataset using the same mask. Mask (a) T2I-Adapter (b) ControlNet-Diff (c) ControlNet-FM (d) Siamese-Diffusion (e) AlignFlow (Ours) KOWA TOPCON Zeiss [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on REFUGE2 dataset using the same mask. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Data heterogeneity hinders clinical deployment of medical image analysis models, and generative data augmentation helps mitigate this issue. However, recent diffusion-based methods that synthesize image-mask pairs often ignore distribution shifts between generated and real images across scenarios, and such mismatches can markedly degrade downstream performance. To address this issue, we propose AlignFlow, a flow matching model that aligns with the target reference image distribution via differentiable reward fine-tuning, and remains effective even when only a small number of reference images are provided. Specifically, we divide the training of the flow matching model into two stages: in the first stage, the model fits the training data to generate plausible images; Then, we introduce a distribution alignment mechanism and employ differentiable reward to steer the generated images toward the distribution of the given samples from the target domain. In addition, to enhance the diversity of generated masks, we also design a flow matching based mask generation to complement the diversity in regions of interest. Extensive experiments demonstrate the effectiveness of our approach, i.e., performance improvement by 3.5-4.0% in mDice and 3.5-5.6% in mIoU across a variety of datasets and scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlignFlow adds a two-stage flow matching setup with differentiable reward tuning to handle few-shot target alignment for medical image-mask synthesis, but the abstract leaves the robustness of that tuning step unproven.

read the letter

The core move here is splitting flow matching into a first stage that learns to generate plausible image-mask pairs from source data, then a second stage that uses a differentiable reward to nudge outputs toward a handful of target-domain references. That addresses distribution shift in medical segmentation without requiring large target sets, and the added flow-based mask generator is a sensible touch to preserve ROI diversity. The reported 3.5-4% mDice and 3.5-5.6% mIoU lifts across datasets are the kind of numbers that would matter if they hold up under proper controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes AlignFlow, a two-stage flow-matching model for synthesizing image-mask pairs to augment medical image segmentation datasets under distribution shifts. Stage one fits the model to source training data; stage two applies differentiable reward fine-tuning to steer outputs toward a small set of target-domain reference images, supplemented by a separate flow-matching module for mask diversity. The authors claim this yields 3.5-4.0% gains in mDice and 3.5-5.6% in mIoU across multiple datasets and scenarios.

Significance. If the few-shot alignment step proves robust, the method could meaningfully improve generative augmentation for heterogeneous medical imaging data, where acquiring large target-domain sets is costly. The two-stage design and explicit mask-generation component are practical strengths that could translate to better downstream segmentation generalization in clinical settings.

major comments (2)

[§3.2] §3.2 (Differentiable Reward Fine-Tuning): The reward signal is described as steering generated images toward the target distribution from few references, yet no explicit formulation, loss weighting, or regularization against mode collapse is provided. Without these details it is impossible to verify that the optimization aligns to the underlying distribution rather than overfitting low-level statistics of the reference samples.
[§4] §4 (Experiments): The reported 3.5-5.6% mIoU improvements are presented without ablations on reference-set size, pre-/post-fine-tuning diversity metrics (e.g., FID, intra-class variance), statistical significance tests, or controls for artifact introduction. These omissions leave the causal link between the alignment stage and the metric gains unverified.

minor comments (2)

[Abstract] Abstract and §3: The phrase 'distribution alignment mechanism' is used without an accompanying equation or pseudocode reference, making the precise role of the reward term difficult to reconstruct.
[§4] §4: Table captions and axis labels should explicitly state the number of reference images used in each few-shot setting to allow direct comparison across scenarios.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We have addressed all major comments in the point-by-point responses below, with corresponding revisions to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Differentiable Reward Fine-Tuning): The reward signal is described as steering generated images toward the target distribution from few references, yet no explicit formulation, loss weighting, or regularization against mode collapse is provided. Without these details it is impossible to verify that the optimization aligns to the underlying distribution rather than overfitting low-level statistics of the reference samples.

Authors: We thank the referee for highlighting this gap. We have revised Section 3.2 to include the explicit formulation of the differentiable reward fine-tuning objective, the specific loss weighting scheme, and a regularization term designed to prevent mode collapse by encouraging diversity in the generated samples. These additions allow verification that the method aligns to the target distribution. revision: yes
Referee: [§4] §4 (Experiments): The reported 3.5-5.6% mIoU improvements are presented without ablations on reference-set size, pre-/post-fine-tuning diversity metrics (e.g., FID, intra-class variance), statistical significance tests, or controls for artifact introduction. These omissions leave the causal link between the alignment stage and the metric gains unverified.

Authors: We agree that these elements are important for validating the results. In the revised paper, we have added ablations on reference-set size, pre- and post-fine-tuning diversity metrics such as FID and intra-class variance, statistical significance tests, and controls for artifact introduction through additional qualitative and quantitative analysis. These new results strengthen the causal link between the alignment stage and the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected; two-stage alignment presented as independent mechanism

full rationale

The abstract and description outline a two-stage process—initial flow-matching fit to training data, followed by separate differentiable reward fine-tuning for target distribution alignment—without any visible equations, self-citations, or reductions that equate the alignment output to its inputs by construction. No fitted parameters are renamed as predictions, no uniqueness theorems are imported from prior author work, and the performance gains are framed as empirical results rather than tautological consequences of the method definition. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a differentiable reward can be defined to measure and enforce distribution alignment from few samples without additional free parameters or mode collapse; no explicit free parameters, axioms, or invented entities are named in the abstract.

axioms (1)

domain assumption Differentiable reward fine-tuning can steer flow matching outputs to match a target distribution from few samples without degrading image quality or mask diversity
Invoked in the second training stage description.

pith-pipeline@v0.9.0 · 5520 in / 1280 out tokens · 58909 ms · 2026-05-13T18:03:10.495714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we divide the training of the flow matching model into two stages: in the first stage, the model fits the training data to generate plausible images; Then, we introduce a distribution alignment mechanism and employ differentiable reward to steer the generated images toward the distribution of the given samples from the target domain
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a reward function based on Maximum Mean Discrepancy (MMD) to measure the discrepancy between two image distributions effectively

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Springer, 2017. Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K. Aligning text-to-image diffusion models with reward backpropagation. 2023. Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., and Pathak, D. Video diffusion alignment via reward gradients. arXiv preprint arXiv:2407.08737, 2024. Qi, C., Chen, J., Xu, G., Xu, Z., Lukasiewicz, ...

work page arXiv 2017
[2]

Denoising Diffusion Implicit Models

Springer, 2015. Silva, J., Histace, A., Romain, O., Dray, X., and Granado, B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9(2): 283–293, 2014. Song, J., Meng, C., and Ermon, S. Denoising diffusion im- plicit models. arXiv preprint arXiv:2010.0...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

To further validate the robustness of AlignFlow on different types of data, we also test its generated image quality on retinal fundus image datasets TOPCON and Zeiss, with results shown in Tables 7 and 8, respectively. Except for achieving suboptimal results in the SSIM metric on the Zeiss dataset, our method achieves the best performance in all other ca...

work page arXiv 2024

[1] [1]

Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737, 2024

Springer, 2017. Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K. Aligning text-to-image diffusion models with reward backpropagation. 2023. Prabhudesai, M., Mendonca, R., Qin, Z., Fragkiadaki, K., and Pathak, D. Video diffusion alignment via reward gradients. arXiv preprint arXiv:2407.08737, 2024. Qi, C., Chen, J., Xu, G., Xu, Z., Lukasiewicz, ...

work page arXiv 2017

[2] [2]

Denoising Diffusion Implicit Models

Springer, 2015. Silva, J., Histace, A., Romain, O., Dray, X., and Granado, B. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9(2): 283–293, 2014. Song, J., Meng, C., and Ermon, S. Denoising diffusion im- plicit models. arXiv preprint arXiv:2010.0...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

To further validate the robustness of AlignFlow on different types of data, we also test its generated image quality on retinal fundus image datasets TOPCON and Zeiss, with results shown in Tables 7 and 8, respectively. Except for achieving suboptimal results in the SSIM metric on the Zeiss dataset, our method achieves the best performance in all other ca...

work page arXiv 2024