pith. sign in

arxiv: 2504.13109 · v2 · pith:H7PXKXAQnew · submitted 2025-04-17 · 💻 cs.CV

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

Pith reviewed 2026-05-22 18:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords flow matchinginversionimage editingpredictor-correctortuning-freegenerative modelsregion-aware
0
0 comments X

The pith

Flow models support tuning-free inversion and region-aware editing through a predictor-corrector approach.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow matching models use straight-line trajectories that differ from the paths in diffusion models, making existing inversion and editing techniques ineffective. The paper develops Uni-Inv as a predictor-corrector inversion method tailored to these straight paths for precise image reconstruction. It then extends delayed injection to create Uni-Edit, a method for performing edits in specific regions while leaving other areas unchanged. This framework requires no tuning and applies to different flow models. Experiments show it achieves better results than adapted diffusion methods in efficiency and preservation of unedited content.

Core claim

The authors establish that a predictor-corrector-based framework for inversion and editing works effectively in flow models. Uni-Inv achieves accurate reconstruction by using prediction and correction steps along the model's trajectories. Uni-Edit applies the idea of delayed injection in a region-aware fashion to enable robust editing. The overall methodology is presented as tuning-free, model-agnostic, efficient, and capable of diverse edits with strong preservation of edit-irrelevant regions.

What carries the argument

The predictor-corrector inversion method that leverages straight-line non-crossing trajectories for accurate reconstruction and supports delayed-injection for editing.

If this is right

  • Accurate reconstruction becomes possible for images generated by flow models.
  • Diverse edits can be made while ensuring strong preservation of regions not involved in the edit.
  • The methods function without any tuning or model-specific adjustments.
  • Effective performance is maintained even in low-cost computational settings.
  • Generalizability is shown across various generative flow models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometry of trajectories in generative models may be central to the design of future inversion and editing techniques.
  • This could lead to simpler editing pipelines for flow-based systems in practical applications.
  • Extensions to other modalities like video generation might follow similar principles.

Load-bearing premise

The straight-line, non-crossing trajectories of flow models allow predictor-corrector inversion and delayed-injection editing to work accurately and robustly without post-hoc tuning or model-specific adjustments.

What would settle it

Applying the Uni-Inv method to reconstruct an image and finding that the output differs substantially from the input in pixel accuracy or perceptual quality would challenge the claim of accurate reconstruction.

Figures

Figures reproduced from arXiv: 2504.13109 by Biqing Huang, Guanlong Jiao, Kuan-Chieh Wang, Renjie Liao.

Figure 1
Figure 1. Figure 1: UniEdit-Flow for image inversion and editing. Our approach proposes a highly accurate and efficient, model-agnostic, training and tuning-free sampling strategy for flow models to tackle image inversion and editing problems. Cluttered scenes are difficult for inversion and reconstruction, leading to failure results on various methods. Our Uni-Inv achieves exact reconstruction even in such complex situations… view at source ↗
Figure 2
Figure 2. Figure 2: Delayed injection, which retains the source condition during the early denoising steps and introduces the edit condition at a middle timestep (illustrated in the bottom part), is a widely used technique in diffusion-based editing (top row). However, when applied to flow models (second row), it is ineffective. While flow-based editing exhibits a mild tendency toward the target edit, it fails to produce suff… view at source ↗
Figure 3
Figure 3. Figure 3: An overview of our proposed Uni-Inv and Uni-Edit (bird −→ red bird). (a) indicates that vanilla flow inversion is incapable for both exact image inversion and controllable editing. (b) demonstrates our proposed Uni-Inv and Uni-Edit, which perform efficient and effective inversion and editing. 𝒁0 Various 𝒁1 Inversion [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-step error of the velocities and samples of vanilla inversions. We first synthesis an image Z0, then conduct vanilla inversion to get inverted noises Z1 with per-step velocity of vθ(Zbti−1 , ti−1) (♦) and vθ(Zbti−1 , ti) (■), respectively. We plot the per-step local error of samples (∆Z) velocities (∆v). The right shows the visualization of various Z1, while their border col￾ors correspond to different… view at source ↗
Figure 5
Figure 5. Figure 5: Demonstration of various sampling-based image edit￾ing methods (dog −→ lion). Directly utilizing c T as condition leads to an undue editing. Leveraging delayed injection, which is widely used in diffusion-based methods, inevitably results in an inchoate performance when using deterministic models. Our Uni-Edit mitigates early steps obtained components that are not conducive to editing, ultimately achieving… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on inversion & reconstruction. Our method ensures stable reconstruction results in both situations with description accessible (conditional) and unaccessible (unconditional), while taking into account both overall and detail consistency. lay rate α, strikes a balance between preserving background details and achieving effective modifications, while simul￾taneously reducing inference … view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison on image editing. Our method consistently achieves more appropriate editing with better background preservation across various flow models. Bench [30], which contains 700 images with 10 different editing types. To evaluate edit-irrelevant context preserva￾tion, we use structure distance [64], along with PSNR and SSIM for annotated unedited regions. The performance of the edits is ass… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Uni-Edit process. The guidance mask of each denoising step is shown at the upper right of the image. We also demonstrate the ”Sphinx” phenomenon that existing latent fusion approaches may cause at the lower left of the figure. text description is absent. Qualitative Comparison [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Diverse application of Uni-Edit. The top is sketch to image, and the bottom is stroke to image. The left of a image pair is the source image, and the right is the editing result. sults from the existing latent fusion method [14, 24], which uses masks to fuse inversion and edit latents. These results lead to unnatural, ”Sphinx”-like outputs, highlighting the adaptability and efficiency of our approach. 5.4.… view at source ↗
read the original abstract

Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Uni-Inv, a predictor-corrector inversion method tailored to the straight-line trajectories of flow matching models for accurate reconstruction, and Uni-Edit, which adapts delayed injection for region-aware editing that preserves edit-irrelevant areas. It claims these techniques are tuning-free, model-agnostic, efficient, and effective, with extensive experiments across generative models demonstrating superiority even under low-cost settings.

Significance. If the central claims hold, the work would offer a practical, general-purpose framework for inversion and editing in flow-based generative models, which are emerging as strong alternatives to diffusion models. This could enable more reliable image manipulation applications without per-model calibration or high computational overhead, addressing a gap in current editing pipelines.

major comments (1)
  1. [§4, §3.2] §4 (Experiments) and §3.2 (Uni-Inv): The tuning-free and model-agnostic claims rest on using a single fixed set of hyperparameters (corrector iteration count, step schedule, and injection delay fraction) across multiple flow models. The reported quantitative results (e.g., reconstruction PSNR/SSIM and editing metrics) do not include an ablation or sensitivity analysis demonstrating that performance remains stable when these values are perturbed on architectures with differing velocity-field accuracies or training distributions; without this, the generalizability assertion is not fully load-bearing.
minor comments (2)
  1. [Figures 3-4] Figure 3 and 4: The qualitative editing examples would benefit from explicit annotation of the source and target regions to make the preservation of edit-irrelevant areas more immediately verifiable.
  2. [§2] Related work section: The discussion of prior diffusion-based inversion methods (e.g., DDIM inversion) could more explicitly contrast the non-crossing property of flow trajectories with the stochasticity in diffusion paths.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The suggestion to bolster evidence for the tuning-free and model-agnostic properties is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [§4, §3.2] §4 (Experiments) and §3.2 (Uni-Inv): The tuning-free and model-agnostic claims rest on using a single fixed set of hyperparameters (corrector iteration count, step schedule, and injection delay fraction) across multiple flow models. The reported quantitative results (e.g., reconstruction PSNR/SSIM and editing metrics) do not include an ablation or sensitivity analysis demonstrating that performance remains stable when these values are perturbed on architectures with differing velocity-field accuracies or training distributions; without this, the generalizability assertion is not fully load-bearing.

    Authors: We appreciate this observation. Our experiments already show that the same fixed hyperparameter set produces competitive reconstruction and editing results across several flow-based models without per-model tuning. Nevertheless, we agree that an explicit sensitivity analysis would make the generalizability claim more robust. In the revised version we will add an ablation that perturbs the corrector iteration count, step schedule, and injection delay fraction on models with different velocity-field characteristics and report the resulting changes in PSNR, SSIM, and editing metrics. This addition will directly address the concern while preserving the tuning-free nature of the proposed methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives Uni-Inv and Uni-Edit from the geometric properties of flow-matching trajectories (straight-line, non-crossing paths) by adapting predictor-corrector inversion and delayed-injection editing. These steps are presented as direct consequences of the flow ODE structure rather than quantities fitted to target editing outputs or defined circularly in terms of each other. No load-bearing self-citation chain, ansatz smuggling, or renaming of known results is required for the central claims; the tuning-free and model-agnostic assertions rest on the stated trajectory properties and are tested externally across models. The derivation remains self-contained against the independent mathematical features of flow models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that flow trajectories are straight and non-crossing, which is used to justify why diffusion-style inversion fails and why the new predictor-corrector works. No free parameters are introduced because the method is described as tuning-free. No new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Flow matching models possess straight-line, non-crossing trajectories that differ from diffusion paths.
    Explicitly stated in the abstract as the reason existing diffusion methods are ineffective and new solutions are possible.

pith-pipeline@v0.9.0 · 5703 in / 1183 out tokens · 28029 ms · 2026-05-22T18:49:10.310206+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    FlowAnchor stabilizes editing signals in flow-based inversion-free video editing via spatial-aware attention refinement and adaptive magnitude modulation for improved faithfulness and temporal coherence.

  2. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  3. Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

    cs.CV 2026-03 unverdicted novelty 7.0

    Dress-ED is the first large-scale benchmark unifying virtual try-on, try-off, and text-guided garment editing with 146k verified samples plus a multimodal diffusion baseline.

  4. StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    StreamGVE enables high-quality training-free video editing by converting the task to noise-to-data streaming generation with dual-branch fast sampling, self-attention bridges, cross-attention grounding, source-oriente...

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 4 Pith papers · 17 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797,

  2. [2]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18208–18218, 2022. 1, 2

  3. [3]

    Blended latent diffusion

    Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ACM transactions on graphics (TOG) , 42 (4):1–11, 2023. 1

  4. [4]

    Stable flow: Vital layers for training-free image editing

    Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchi- nov, Kfir Aberman, Dani Lischinski, and Daniel Cohen- Or. Stable flow: Vital layers for training-free image editing. arXiv preprint arXiv:2411.14430, 2024. 2

  5. [5]

    Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection

    Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, and Zeke Xie. Zigzag diffusion sampling: Diffusion models can self-improve via self-reflection. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 2

  6. [6]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023. 1

  7. [7]

    Sega: Instructing text-to-image models using semantic guidance

    Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, and Kristian Kersting. Sega: Instructing text-to-image models using semantic guidance. Advances in Neural Information Processing Systems , 36: 25365–25389, 2023. 2

  8. [8]

    Ledits++: Limitless image editing using text-to-image models

    Manuel Brack, Felix Friedrich, Katharia Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolin´ario Passos. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 8861–8870, 2024. 3

  9. [9]

    Classifier- free guidance is a predictor-corrector

    Arwen Bradley and Preetum Nakkiran. Classifier- free guidance is a predictor-corrector. arXiv preprint arXiv:2408.09000, 2024. 2

  10. [10]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 6, 4

  11. [11]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV) , pages 22560–22570,

  12. [12]

    Fs- coco: Towards understanding of freehand sketches of com- mon objects in context

    Pinaki Nath Chowdhury, Aneeshan Sain, Ayan Kumar Bhu- nia, Tao Xiang, Yulia Gryaditskaya, and Yi-Zhe Song. Fs- coco: Towards understanding of freehand sketches of com- mon objects in context. InEuropean conference on computer vision, pages 253–270. Springer, 2022. 8

  13. [13]

    Diffusion Posterior Sampling for General Noisy Inverse Problems

    Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sam- pling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022. 2

  14. [14]

    Diffedit: Diffusion-based semantic image editing with mask guidance,

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based seman- tic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022. 2, 3, 5, 6, 8

  15. [15]

    Fluxs- pace: Disentangled semantic editing in rectified flow trans- formers

    Yusuf Dalva, Kavana Venkatesh, and Pinar Yanardag. Fluxs- pace: Disentangled semantic editing in rectified flow trans- formers. arXiv preprint arXiv:2412.09611, 2024. 2

  16. [16]

    Fireflow: Fast inversion of rectified flow for image semantic editing, 2024

    Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, and Fan Tang. Fireflow: Fast inversion of rectified flow for image semantic editing, 2024. 2, 6, 7, 4

  17. [17]

    Ccd-3dr: Consistent conditioning in diffusion for single-image 3d reconstruction

    Yan Di, Chenyangguang Zhang, Pengyuan Wang, Guangyao Zhai, Ruida Zhang, Fabian Manhardt, Benjamin Busam, Xi- angyang Ji, and Federico Tombari. Ccd-3dr: Consistent conditioning in diffusion for single-image 3d reconstruction. arXiv preprint arXiv:2308.07837, 2023. 1

  18. [18]

    Prompt tuning inversion for text-driven image editing using diffusion models

    Wenkai Dong, Song Xue, Xiaoyue Duan, and Shumin Han. Prompt tuning inversion for text-driven image editing using diffusion models. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7430–7440,

  19. [19]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning,

  20. [20]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash- nik, Amit H Bermano, Gal Chechik, and Daniel Cohen- Or. An image is worth one word: Personalizing text-to- image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022. 1

  21. [21]

    Renoise: Real image inversion through iterative noising, 2024

    Daniel Garibi, Or Patashnik, Andrey V oynov, Hadar Averbuch-Elor, and Daniel Cohen-Or. Renoise: Real image inversion through iterative noising, 2024. 2, 4, 6, 3

  22. [22]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commu- nications of the ACM, 63(11):139–144, 2020. 2

  23. [23]

    Improving tuning-free real image editing with proximal guidance

    Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, et al. Improving tuning-free real image editing with proximal guidance. arXiv preprint arXiv:2306.05414, 2023. 2

  24. [24]

    Proxedit: Improving tuning-free real image editing with proximal guidance

    Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopou- los, Xiaoxiao He, Yuxiao Chen, et al. Proxedit: Improving tuning-free real image editing with proximal guidance. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision , pages 4291–4301, 2024. 5, 6, 8

  25. [25]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022. 6, 7, 4 10

  26. [26]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2

  27. [27]

    An edit friendly ddpm noise space: Inversion and manipulations

    Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly ddpm noise space: Inversion and manipulations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12469– 12478, 2024. 3, 6, 7, 4

  28. [28]

    Hq-edit: A high-quality dataset for instruction-based image editing

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024. 7, 8, 11

  29. [29]

    Scope of va- lidity of psnr in image/video quality assessment

    Quan Huynh-Thu and Mohammed Ghanbari. Scope of va- lidity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008. 6

  30. [30]

    Pnp inversion: Boosting diffusion-based editing with 3 lines of code

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. International Conference on Learning Representations (ICLR), 2024. 2, 3, 6, 7, 4, 8, 9

  31. [31]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 3

  32. [32]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2023. 6, 7

  33. [33]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux, 2024. 2

  34. [34]

    Open-vocabulary object segmenta- tion with diffusion models

    Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Open-vocabulary object segmenta- tion with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7667– 7676, 2023. 1

  35. [35]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 3

  36. [36]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022. 3, 4

  37. [37]

    Instaflow: One step is enough for high-quality diffusion- based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion- based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023. 3

  38. [38]

    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022. 2

  39. [39]

    Pnp-flow: Plug-and-play image restoration with flow matching

    S ´egol`ene Martin, Anne Gagneux, Paul Hagemann, and Gabriele Steidl. Pnp-flow: Plug-and-play image restoration with flow matching. arXiv preprint arXiv:2410.02423, 2024. 3

  40. [40]

    Null-text inversion for editing real im- ages using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real im- ages using guided diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6038–6047, 2023. 2, 6, 3, 4

  41. [41]

    Zero-shot image-to-image translation

    Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceed- ings, SIGGRAPH 2023, Los Angeles, CA, USA, August 6-10, 2023, pages 11:1–11:11. ACM, 2023. 3, 4

  42. [42]

    Metaxas, and Yezhou Yang

    Maitreya Patel, Song Wen, Dimitris N. Metaxas, and Yezhou Yang. Steering rectified flow models in the vec- tor field for controlled image generation. arXiv preprint arXiv:2412.00100, 2024. 3

  43. [43]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 4195–4205,

  44. [44]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 7, 3, 4

  45. [45]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017. 9, 12

  46. [46]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 1

  47. [47]

    Siminver- sion: A simple framework for inversion-based text-to-image editing

    Qi Qian, Haiyang Xu, Ming Yan, and Juhua Hu. Siminver- sion: A simple framework for inversion-based text-to-image editing. arXiv preprint arXiv:2409.10476, 2024. 2

  48. [48]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 7

  49. [49]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1

  50. [50]

    DreamGaussian4D: Generative 4D gaussian splatting.arXiv preprint arXiv:2312.17142,

    Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Genera- tive 4d gaussian splatting. arXiv preprint arXiv:2312.17142,

  51. [51]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

  52. [52]

    Semantic im- age inversion and editing using rectified stochastic differen- tial equations

    Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations. arXiv preprint arXiv:2410.10792, 2024. 2, 7, 4

  53. [53]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022. 1

  54. [54]

    Mediconfusion: Can you trust your ai radiologist? probing the reliability 11 of multimodal medical foundation models

    Mohammad Shahab Sepehri, Zalan Fabian, Maryam Soltanolkotabi, and Mahdi Soltanolkotabi. Mediconfusion: Can you trust your ai radiologist? probing the reliability 11 of multimodal medical foundation models. arXiv preprint arXiv:2409.15477, 2024. 1

  55. [55]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of ACL, 2018. 6, 5

  56. [56]

    Seededit: Align image re-generation to image editing

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing. arXiv preprint arXiv:2411.06686, 2024. 6, 4

  57. [57]

    Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217,

    Saurabh Singh and Ian Fischer. Stochastic sampling from deterministic flow models.arXiv preprint arXiv:2410.02217,

  58. [58]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2, 4

  59. [59]

    Leveraging previous steps: A training-free fast solver for flow diffusion

    Kaiyu Song and Hanjiang Lai. Leveraging previous steps: A training-free fast solver for flow diffusion. arXiv preprint arXiv:2411.07627, 2024. 2

  60. [60]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. arXiv preprint arXiv:2011.13456, 2020. 2, 4

  61. [61]

    Rectifid: Personalizing rectified flow with an- chored classifier guidance

    Zhicheng Sun, Zhenhao Yang, Yang Jin, Haozhe Chi, Kun Xu, Liwei Chen, Hao Jiang, Yang Song, Kun Gai, and Yadong Mu. Rectifid: Personalizing rectified flow with an- chored classifier guidance. Advances in Neural Information Processing Systems, 37:96993–97026, 2024. 3

  62. [62]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- cient 3d content creation. arXiv preprint arXiv:2309.16653,

  63. [63]

    Ledits: Real image editing with ddpm inversion and semantic guidance

    Linoy Tsaban and Apolin ´ario Passos. Ledits: Real image editing with ddpm inversion and semantic guidance. arXiv preprint arXiv:2307.00522, 2023. 3

  64. [64]

    Splicing vit features for semantic appearance transfer

    Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022. 7

  65. [65]

    Plug-and-play diffusion features for text-driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Van- couver, BC, Canada, June 17-24, 2023 , pages 1921–1930. IEEE, 2023. 2, 6, 7, 4

  66. [66]

    Edict: Exact diffusion inversion via coupled transformations

    Bram Wallace, Akash Gokul, and Nikhil Naik. Edict: Exact diffusion inversion via coupled transformations. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22532–22541, 2023. 2

  67. [67]

    Belm: Bidirec- tional explicit linear multi-step sampler for exact inversion in diffusion models

    Fangyikang Wang, Hubery Yin, Yue-Jiang Dong, Huminhao Zhu, Hanbin Zhao, Hui Qian, Chen Li, et al. Belm: Bidirec- tional explicit linear multi-step sampler for exact inversion in diffusion models. Advances in Neural Information Process- ing Systems, 37:46118–46159, 2025. 2

  68. [68]

    Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

    Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow.arXiv preprint arXiv:2410.07303,

  69. [69]

    Tam- ing rectified flow for inversion and editing

    Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Tam- ing rectified flow for inversion and editing. arXiv preprint arXiv:2411.04746, 2024. 2, 6, 7, 4

  70. [70]

    Moa: Mixture-of-attention for subject-context disentanglement in personalized image gen- eration

    Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, and Kfir Aberman. Moa: Mixture-of-attention for subject-context disentanglement in personalized image gen- eration. In SIGGRAPH Asia 2024 Conference Papers, pages 1–12, 2024. 1

  71. [71]

    Mdp: A generalized framework for text-guided image edit- ing by manipulating the diffusion path

    Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Mdp: A generalized framework for text-guided image edit- ing by manipulating the diffusion path. arXiv preprint arXiv:2303.16765, 2023. 3

  72. [72]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 6

  73. [73]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

  74. [74]

    Omniedit: Building image edit- ing generalist models through specialist supervision

    Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image edit- ing generalist models through specialist supervision. arXiv preprint arXiv:2411.07199, 2024. 6, 4

  75. [75]

    A latent space of stochastic diffusion models for zero-shot image editing and guidance

    Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7378–7387, 2023. 3

  76. [76]

    Uncovering the disentanglement capability in text- to-image diffusion models

    Qiucheng Wu, Yujian Liu, Handong Zhao, Ajinkya Kale, Trung Bui, Tong Yu, Zhe Lin, Yang Zhang, and Shiyu Chang. Uncovering the disentanglement capability in text- to-image diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1900–1910, 2023. 3

  77. [77]

    Turboedit: Instant text-based image editing

    Zongze Wu, Nicholas Kolkin, Jonathan Brandt, Richard Zhang, and Eli Shechtman. Turboedit: Instant text-based image editing. In European Conference on Computer Vision, pages 365–381. Springer, 2024. 6, 4

  78. [78]

    Fastcomposer: Tuning-free multi- subject image generation with localized attention

    Guangxuan Xiao, Tianwei Yin, William T Freeman, Fr ´edo Durand, and Song Han. Fastcomposer: Tuning-free multi- subject image generation with localized attention. Interna- tional Journal of Computer Vision, pages 1–20, 2024. 3

  79. [79]

    Inversion-free image editing with natural language

    Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with natural language

  80. [80]

    Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads

    Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, and Tong-Yee Lee. Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads. arXiv preprint arXiv:2411.15034, 2024. 2

Showing first 80 references.