Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Gal Chechik; Lior Wolf; Yoad Tewel; Yuval Atzmon

arxiv: 2606.03911 · v1 · pith:S4HKWZ2Pnew · submitted 2026-06-02 · 💻 cs.CV

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Yoad Tewel , Yuval Atzmon , Gal Chechik , Lior Wolf This is my paper

Pith reviewed 2026-06-28 10:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords unpaired trainingflow matchingimage editingvideo editingcycle consistencygenerative modelsbootstrap learningvisual editing

0 comments

The pith

Unpaired training lets flow matching models edit images and video by extracting cues from the frozen base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for training flow matching models to perform image and video editing without any paired examples. It extracts instruction-following cues directly from the frozen base model and pairs them with cycle-consistency constraints to keep edits structurally faithful. Gradients from the resulting losses are routed back through clean predictions to the noisy states used during training. This produces editing performance that exceeds supervised baselines trained on millions of paired samples and generalizes to domains never seen during training. The approach removes the need for external reward models or additional data collection.

Core claim

By pairing instruction-following cues extracted from the frozen base model with cycle-consistency for structure preservation and routing gradients from downstream losses over clean predictions to noisy training states, the Bootstrap Your Generator framework enables effective unpaired training of flow matching editing models that generalizes to unseen domains and outperforms supervised baselines trained on millions of samples.

What carries the argument

Bootstrap Your Generator (ByG) framework that extracts instruction-following cues from the frozen model, pairs them with cycle-consistency, and routes gradients from clean predictions to noisy states.

If this is right

State-of-the-art results become achievable on data-scarce image and video editing scenarios.
The method generalizes to domains unseen during training.
Performance exceeds that of supervised baselines trained on millions of paired samples.
Gradient routing closes the gap between training on noisy states and inference on clean predictions.
Semantic cues extracted from the base model supply a sufficient training signal without external reward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cue-extraction pattern could be tested on other generative tasks such as style transfer or conditional synthesis.
If the base model already encodes editing instructions, similar bootstrapping might reduce data requirements for fine-tuning in related vision models.
Extending the gradient-routing step to other diffusion or flow architectures would test whether the train-inference alignment benefit is architecture-specific.
The reliance on cycle-consistency suggests the framework could be combined with existing unpaired translation methods to handle multi-step editing sequences.

Load-bearing premise

The frozen base model contains usable instruction-following cues that can be extracted and paired with cycle-consistency to provide a training signal without any external data.

What would settle it

A controlled test in which the method produces incoherent or non-generalizing edits on a new domain where the base model shows no detectable instruction-following behavior on the target task.

Figures

Figures reproduced from arXiv: 2606.03911 by Gal Chechik, Lior Wolf, Yoad Tewel, Yuval Atzmon.

**Figure 1.** Figure 1: Bootstrap Your Generator. Left: Supervised training requires paired source–target samples to provide explicit editing supervision. External model guidance uses a frozen external model to provide semantic feedback. Our intrinsic signal enables training using only the generator itself, removing the need for paired data or external supervision. Right: Sample image and video editing results produced by our unp… view at source ↗

**Figure 2.** Figure 2: Method overview. Top: Supervised training for image editing. Given a source image x, target image y, and editing instruction c, the target is noised to yt and fed to the network along with x and c. For clarity, we depict the one-step prediction yˆ supervised against y; the actual loss operates on velocities (Eq. 1). Bottom: We finetune a pretrained text-to-image model Gt2i into an editing model Gedit, with… view at source ↗

**Figure 3.** Figure 3: User study results on video editing. Users prefer videos generated by our method in both cartoon to photo-realistic editing and photo-realistic to cartoon editing. 5. Experiments Here we evaluate our method on instruction-based image and video editing across long-tail and general-purpose benchmarks, and compare with state-of-the-art methods. 5.1. Long-Tail Editing We evaluate on long-tail scenarios where … view at source ↗

**Figure 4.** Figure 4: Qualitative results on video-editing. Our method better matches the target style while preserving the source content. Several Motion videos are additionally provided in the supplemental material, and also shown in the Appendix - [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on the long-tail style-editing benchmark. Our method better matches the target style while preserving the source content. The “Style Reference” column is shown for the reader’s convenience and is not used during training or evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative ablation results. Removing gradient routing, cycle loss, or directional loss leads to stronger edits but degrades source preservation (visible in fine details and background consistency). Without bootstrapping, edits become unreliable. Without regularization, the model collapsed to identity mapping, preserving the source unchanged. Input One-step prediction Multi-step prediction “Turn the image… view at source ↗

**Figure 7.** Figure 7: Comparison of one-step vs. multi-step predictions. Onestep predictions tend to be blurry and lack fine details, while multi-step sampling produces clean outputs. Our gradient routing conditions on the clean multi-step estimate while backpropagating through the one-step prediction. Training Stability. Bootstrapping provides stable training inputs that match the forward model’s expected distribution (a nois… view at source ↗

**Figure 8.** Figure 8: Additional qualitative comparisons on the general image editing benchmark (GEdit-Bench). We compare our method against FLUX-Kontext and FlowEdit. Our results are often more realistic than Kontext (which can inherit artifacts from synthetic paired training data) while following the instruction more faithfully than the zero-shot baseline. “Transform the straight road into a curvy road with multiple sharp ben… view at source ↗

**Figure 9.** Figure 9: Additional qualitative results of our method on the general image editing benchmark (GEdit-Bench). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparisons on video-editing. Our method better matches the target style while preserving content. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: A screenshot of the user-study interface 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ByG claims to enable unpaired flow matching editing by bootstrapping cues from the frozen model plus cycle consistency and gradient routing, but the abstract leaves the mechanics and evidence thin.

read the letter

The core idea is a training scheme for flow matching editors that avoids paired data by pulling instruction cues out of the base model itself, pairing them with cycle consistency, and routing gradients from clean outputs back to noisy training steps. That combination is presented as the main novelty, aimed at image and video editing where paired examples are expensive or impossible to collect.

What stands out is the practical framing: the method reportedly generalizes to unseen domains and beats supervised baselines trained on millions of samples, backed by user studies. If the gradient routing step actually closes the train-inference gap without introducing instability, it could be useful for anyone working on data-scarce generative editing.

The soft spot is that the abstract gives almost no equations, ablation details, or training curves. Claims about state-of-the-art results and robust signals from self-extracted cues rest on unshown implementation choices, so it is difficult to judge whether the approach holds up or whether the circularity from using the model's own outputs creates hidden failure modes. The full manuscript is referenced but the provided material does not include it, which limits how far any assessment can go.

This is the kind of paper that would interest researchers building editing tools for video or domain adaptation. It deserves a serious referee because the problem it targets is real and the proposed direction is concrete, even if the current write-up needs more technical grounding to be convincing.

Referee Report

2 major / 1 minor

Summary. The paper proposes Bootstrap Your Generator (ByG), a framework for unpaired training of flow matching models for image and video editing. It extracts instruction-following cues from a frozen base model, combines them with cycle-consistency losses for structure preservation, and introduces gradient routing from downstream losses on clean predictions back to noisy training states. The central claims are state-of-the-art performance on data-scarce editing tasks, generalization to unseen domains, and outperforming supervised baselines trained on millions of paired samples without requiring external data or reward models.

Significance. If the results and derivations hold, the work would be significant for enabling scalable unpaired training of generative editing models, particularly in video where paired data collection is costly. The gradient routing approach to address the train-inference gap and the bootstrapping of semantic cues from the base model itself represent potentially useful technical contributions that could reduce dependence on large paired datasets.

major comments (2)

[Abstract] The abstract asserts SOTA results and generalization but provides no equations, experimental details, error bars, or data; without the full methods section, results tables, or ablation studies, the central claim that the cue extraction plus cycle consistency plus gradient routing produces a usable unpaired signal cannot be evaluated for correctness or stability.
[Abstract / Methods] The method depends on cues extracted from the same frozen base model; this creates a potential circularity where the training signal originates internally, and the manuscript must demonstrate (e.g., via failure case analysis or comparison to external-signal baselines) that this does not limit robustness or novelty for the generalization claims.

minor comments (1)

[Abstract] The abstract would benefit from a brief mention of the specific flow matching formulation or loss terms used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions may be appropriate.

read point-by-point responses

Referee: [Abstract] The abstract asserts SOTA results and generalization but provides no equations, experimental details, error bars, or data; without the full methods section, results tables, or ablation studies, the central claim that the cue extraction plus cycle consistency plus gradient routing produces a usable unpaired signal cannot be evaluated for correctness or stability.

Authors: Abstracts are designed to be brief overviews of the work. The full manuscript provides the requested details: equations for the cue extraction, cycle-consistency loss, and gradient routing in Section 3; experimental setups, results tables with error bars in Section 4; and ablation studies in Section 4.3. These elements collectively demonstrate that the combination of cue extraction, cycle consistency, and gradient routing yields a stable and usable unpaired training signal, as evidenced by the quantitative and qualitative results. revision: no
Referee: [Abstract / Methods] The method depends on cues extracted from the same frozen base model; this creates a potential circularity where the training signal originates internally, and the manuscript must demonstrate (e.g., via failure case analysis or comparison to external-signal baselines) that this does not limit robustness or novelty for the generalization claims.

Authors: We appreciate this point on potential circularity. The manuscript addresses this through analysis in Section 5, including failure cases where internal cues lead to specific limitations, and comparisons to methods using external signals. The results show that bootstrapping from the base model enables better generalization to unseen domains without requiring paired data or external rewards, supporting both robustness and novelty. If the referee believes additional comparisons are needed, we can include them in a revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed prediction or result to its inputs by construction. The method extracts cues from a frozen pre-trained base model (external to the current training) and combines them with standard cycle-consistency losses plus gradient routing; these are presented as leveraging existing model knowledge rather than self-defining the output. No fitted parameters are renamed as predictions, no uniqueness theorems are imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; insufficient detail to identify free parameters, additional axioms, or invented entities beyond the core domain assumption.

axioms (1)

domain assumption The frozen base model contains extractable instruction-following cues usable as a training signal without external supervision or data.
This is the central premise stated in the abstract for enabling unpaired training.

pith-pipeline@v0.9.1-grok · 5704 in / 1306 out tokens · 40095 ms · 2026-06-28T10:46:57.795334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 8 linked inside Pith

[1]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=
[2]

International Conference on Learning Representations (ICLR) , year=

Dual Diffusion Implicit Bridges for Image-to-Image Translation , author=. International Conference on Learning Representations (ICLR) , year=
[3]

Zhang, Jiaxin and Rimchala, Joy and Mouatadid, Lalla and Das, Kamalika and Kumar, Sricharan , booktitle=
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Professor Forcing: A New Algorithm for Training Recurrent Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[7]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , year=

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , year=
[8]

Advances in Neural Information Processing Systems 30 (NIPS) , year=

Unsupervised Image-to-Image Translation Networks , author=. Advances in Neural Information Processing Systems 30 (NIPS) , year=
[9]

arXiv preprint arXiv:2104.05358 , year=

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models , author=. arXiv preprint arXiv:2104.05358 , year=

arXiv
[10]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
[11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[12]

arXiv preprint arXiv:1308.3432 , year=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. arXiv preprint arXiv:1308.3432 , year=

Pith/arXiv arXiv
[13]

Conference on Neural Information Processing Systems , year=

Neural Discrete Representation Learning , author=. Conference on Neural Information Processing Systems , year=
[14]

Categorical Reparameterization with

Jang, Eric and Gu, Shixiang and Poole, Ben , booktitle=. Categorical Reparameterization with
[15]

European Conference on Computer Vision (ECCV) , year=

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. European Conference on Computer Vision (ECCV) , year=
[16]

arXiv preprint arXiv:2304.04968 , year=

Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond , author=. arXiv preprint arXiv:2304.04968 , year=

arXiv
[17]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

Delta Denoising Score , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
[18]

Samuel, Dvir and Levy, Matan and Darshan, Nir and Chechik, Gal and Ben-Ari, Rami , journal=
[19]

Michel, Oscar and Bhattad, Anand and VanderBilt, Eli and Krishna, Ranjay and Kembhavi, Aniruddha and Gupta, Tanmay , booktitle=
[20]

Black Forest Labs and Batifol, Stephen and Blattmann, Andreas and Boesel, Frederic and Consul, Saksham and Diagne, Cyril and Dockhorn, Tim and English, Jack and English, Zion and Esser, Patrick and Kulal, Sumith and Lacey, Kyle and Levi, Yam and Li, Cheng and Lorenz, Dominik and Müller, Jonas and Podell, Dustin and Rombach, Robin and Saini, Harry and Saue...
[21]

arXiv preprint arXiv:2508.02324 , year=

Qwen-Image Technical Report , author=. arXiv preprint arXiv:2508.02324 , year=

Pith/arXiv arXiv
[22]

International Conference on Learning Representations (ICLR) , year=

Prompt-to-Prompt Image Editing with Cross-Attention Control , author=. International Conference on Learning Representations (ICLR) , year=
[23]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Cao, Mingdeng and Wang, Xintao and Qi, Zhongang and Shan, Ying and Qie, Xiaohu and Zheng, Yinqiang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023
[24]

2025 , doi=

Yang, Ling and Zeng, Bohan and Liu, Jiaming and Li, Hong and Xu, Minghao and Zhang, Wentao and Yan, Shuicheng , booktitle=. 2025 , doi=

2025
[25]

Chen, Xi and Zhang, Zhifei and Zhang, He and Zhou, Yuqian and Kim, Soo Ye and Liu, Qing and Li, Yijun and Zhang, Jianming and Zhao, Nanxuan and Wang, Yilin and Ding, Hui and Lin, Zhe and Zhao, Hengshuang , booktitle=
[26]

ACM SIGGRAPH 2024 Conference Papers , pages=

Cross-Image Attention for Zero-Shot Appearance Transfer , author=. ACM SIGGRAPH 2024 Conference Papers , pages=. 2024 , doi=

2024
[27]

Yu, Xin and Wang, Tianyu and Kim, Soo Ye and Guerrero, Paul and Chen, Xi and Liu, Qing and Lin, Zhe and Qi, Xiaojuan , booktitle=
[28]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Pathways on the Image Manifold: Image Editing via Video Generation , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
[29]

Song, Yizhi and Zhang, Zhifei and Lin, Zhe and Cohen, Scott and Price, Brian and Zhang, Jianming and Kim, Soo Ye and Aliaga, Daniel , booktitle=
[30]

Geyer, Michal and Bar-Tal, Omer and Bagon, Shai and Dekel, Tali , booktitle=
[31]

2025 , doi=

Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Dekel, Tali , booktitle=. 2025 , doi=

2025
[32]

Lu, Yi and Lei, Minyi and Li, Bozheng and Cao, Jiawang and Zhu, Wenbo , booktitle=
[33]

Yang, Shaoshu and Zhang, Yingya and He, Ran , booktitle=
[34]

and Wadhwa, Neal and Voynov, Andrey and Ruiz, Nataniel , journal=

Burgert, Ryan and Herrmann, Charles and Cole, Forrester and Ryoo, Michael S. and Wadhwa, Neal and Voynov, Andrey and Ruiz, Nataniel , journal=
[35]

Yu, Shoubin and Liu, Difan and Ma, Ziqiao and Hong, Yicong and Zhou, Yang and Tan, Hao and Chai, Joyce and Bansal, Mohit , booktitle=
[36]

arXiv preprint arXiv:2510.14978 , year=

Learning an Image Editing Model without Image Editing Pairs , author=. arXiv preprint arXiv:2510.14978 , year=

Pith/arXiv arXiv
[37]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

One-step Diffusion with Distribution Matching Distillation , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=
[38]

arXiv preprint arXiv:2510.15742 , year=

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset , author=. arXiv preprint arXiv:2510.15742 , year=

arXiv
[39]

Mou, Chong and Sun, Qichao and Wu, Yanze and Zhang, Pengze and Li, Xinghui and Ye, Fulong and Zhao, Songtao and He, Qian , journal=
[40]

Jiang, Zeyinzi and Han, Zhen and Mao, Chaojie and Zhang, Jingfeng and Pan, Yulin and Liu, Yu , booktitle=
[41]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=
[42]

International Conference on Learning Representations (ICLR) , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. International Conference on Learning Representations (ICLR) , year=
[43]

2024 , doi=

Ku, Max and Jiang, Dongfu and Wei, Cong and Yue, Xiang and Chen, Wenhu , booktitle=. 2024 , doi=

2024
[44]

arXiv preprint arXiv:2502.13923 , year=

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2505.20275 , year=

Imgedit: A unified image editing dataset and benchmark , author=. arXiv preprint arXiv:2505.20275 , year=

Pith/arXiv arXiv
[46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Flowedit: Inversion-free text-based editing using pre-trained flow models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[47]

arXiv preprint arXiv:2504.17761 , year=

Step1X-Edit: A Practical Framework for General Image Editing , author=. arXiv preprint arXiv:2504.17761 , year=

Pith/arXiv arXiv
[48]

2024 , howpublished=

Black Forest Labs , title=. 2024 , howpublished=

2024
[49]

GitHub repository , howpublished =

Kohya-ss , title =. GitHub repository , howpublished =. 2025 , publisher =

2025
[50]

arXiv preprint arXiv:2403.03206 , year=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2403.03206 , year=

Pith/arXiv arXiv
[51]

Wenhao Wang and Yi Yang , booktitle=. Video. 2025 , url=

2025
[52]

arXiv preprint arXiv:2506.13691 , year=

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions , author=. arXiv preprint arXiv:2506.13691 , year=

arXiv
[53]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and Advanced Large-Scale Video Generative Models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv
[54]

NeurIPS , year=

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models , author=. NeurIPS , year=
[55]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[56]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[57]

, author=

OpenImages: A public dataset for large-scale multi-label and multi-class image classification. , author=. Dataset available from https://storage.googleapis.com/openimages/web/index.html , year=
[58]

2023 , howpublished =

Medeiros, Luca , title =. 2023 , howpublished =

2023
[59]

, booktitle =

Brooks, Tim and Holynski, Aleksander and Efros, Alexei A. , booktitle =
[60]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =
[61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Kasten, Yoni and Dekel, Tali , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[62]

Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle =

[1] [1]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

[2] [2]

International Conference on Learning Representations (ICLR) , year=

Dual Diffusion Implicit Bridges for Image-to-Image Translation , author=. International Conference on Learning Representations (ICLR) , year=

[3] [3]

Zhang, Jiaxin and Rimchala, Joy and Mouatadid, Lalla and Das, Kamalika and Kumar, Sricharan , booktitle=

[4] [4]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Professor Forcing: A New Algorithm for Training Recurrent Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[5] [5]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[7] [7]

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , year=

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , year=

[8] [8]

Advances in Neural Information Processing Systems 30 (NIPS) , year=

Unsupervised Image-to-Image Translation Networks , author=. Advances in Neural Information Processing Systems 30 (NIPS) , year=

[9] [9]

arXiv preprint arXiv:2104.05358 , year=

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models , author=. arXiv preprint arXiv:2104.05358 , year=

arXiv

[10] [10]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

[11] [11]

Advances in Neural Information Processing Systems (NeurIPS) , year=

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[12] [12]

arXiv preprint arXiv:1308.3432 , year=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. arXiv preprint arXiv:1308.3432 , year=

Pith/arXiv arXiv

[13] [13]

Conference on Neural Information Processing Systems , year=

Neural Discrete Representation Learning , author=. Conference on Neural Information Processing Systems , year=

[14] [14]

Categorical Reparameterization with

Jang, Eric and Gu, Shixiang and Poole, Ben , booktitle=. Categorical Reparameterization with

[15] [15]

European Conference on Computer Vision (ECCV) , year=

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models , author=. European Conference on Computer Vision (ECCV) , year=

[16] [16]

arXiv preprint arXiv:2304.04968 , year=

Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond , author=. arXiv preprint arXiv:2304.04968 , year=

arXiv

[17] [17]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

Delta Denoising Score , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

[18] [18]

Samuel, Dvir and Levy, Matan and Darshan, Nir and Chechik, Gal and Ben-Ari, Rami , journal=

[19] [19]

Michel, Oscar and Bhattad, Anand and VanderBilt, Eli and Krishna, Ranjay and Kembhavi, Aniruddha and Gupta, Tanmay , booktitle=

[20] [20]

Black Forest Labs and Batifol, Stephen and Blattmann, Andreas and Boesel, Frederic and Consul, Saksham and Diagne, Cyril and Dockhorn, Tim and English, Jack and English, Zion and Esser, Patrick and Kulal, Sumith and Lacey, Kyle and Levi, Yam and Li, Cheng and Lorenz, Dominik and Müller, Jonas and Podell, Dustin and Rombach, Robin and Saini, Harry and Saue...

[21] [21]

arXiv preprint arXiv:2508.02324 , year=

Qwen-Image Technical Report , author=. arXiv preprint arXiv:2508.02324 , year=

Pith/arXiv arXiv

[22] [22]

International Conference on Learning Representations (ICLR) , year=

Prompt-to-Prompt Image Editing with Cross-Attention Control , author=. International Conference on Learning Representations (ICLR) , year=

[23] [23]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Cao, Mingdeng and Wang, Xintao and Qi, Zhongang and Shan, Ying and Qie, Xiaohu and Zheng, Yinqiang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023

[24] [24]

2025 , doi=

Yang, Ling and Zeng, Bohan and Liu, Jiaming and Li, Hong and Xu, Minghao and Zhang, Wentao and Yan, Shuicheng , booktitle=. 2025 , doi=

2025

[25] [25]

Chen, Xi and Zhang, Zhifei and Zhang, He and Zhou, Yuqian and Kim, Soo Ye and Liu, Qing and Li, Yijun and Zhang, Jianming and Zhao, Nanxuan and Wang, Yilin and Ding, Hui and Lin, Zhe and Zhao, Hengshuang , booktitle=

[26] [26]

ACM SIGGRAPH 2024 Conference Papers , pages=

Cross-Image Attention for Zero-Shot Appearance Transfer , author=. ACM SIGGRAPH 2024 Conference Papers , pages=. 2024 , doi=

2024

[27] [27]

Yu, Xin and Wang, Tianyu and Kim, Soo Ye and Guerrero, Paul and Chen, Xi and Liu, Qing and Lin, Zhe and Qi, Xiaojuan , booktitle=

[28] [28]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Pathways on the Image Manifold: Image Editing via Video Generation , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

[29] [29]

Song, Yizhi and Zhang, Zhifei and Lin, Zhe and Cohen, Scott and Price, Brian and Zhang, Jianming and Kim, Soo Ye and Aliaga, Daniel , booktitle=

[30] [30]

Geyer, Michal and Bar-Tal, Omer and Bagon, Shai and Dekel, Tali , booktitle=

[31] [31]

2025 , doi=

Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Dekel, Tali , booktitle=. 2025 , doi=

2025

[32] [32]

Lu, Yi and Lei, Minyi and Li, Bozheng and Cao, Jiawang and Zhu, Wenbo , booktitle=

[33] [33]

Yang, Shaoshu and Zhang, Yingya and He, Ran , booktitle=

[34] [34]

and Wadhwa, Neal and Voynov, Andrey and Ruiz, Nataniel , journal=

Burgert, Ryan and Herrmann, Charles and Cole, Forrester and Ryoo, Michael S. and Wadhwa, Neal and Voynov, Andrey and Ruiz, Nataniel , journal=

[35] [35]

Yu, Shoubin and Liu, Difan and Ma, Ziqiao and Hong, Yicong and Zhou, Yang and Tan, Hao and Chai, Joyce and Bansal, Mohit , booktitle=

[36] [36]

arXiv preprint arXiv:2510.14978 , year=

Learning an Image Editing Model without Image Editing Pairs , author=. arXiv preprint arXiv:2510.14978 , year=

Pith/arXiv arXiv

[37] [37]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

One-step Diffusion with Distribution Matching Distillation , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

[38] [38]

arXiv preprint arXiv:2510.15742 , year=

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset , author=. arXiv preprint arXiv:2510.15742 , year=

arXiv

[39] [39]

Mou, Chong and Sun, Qichao and Wu, Yanze and Zhang, Pengze and Li, Xinghui and Ye, Fulong and Zhao, Songtao and He, Qian , journal=

[40] [40]

Jiang, Zeyinzi and Han, Zhen and Mao, Chaojie and Zhang, Jingfeng and Pan, Yulin and Liu, Yu , booktitle=

[41] [41]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=

[42] [42]

International Conference on Learning Representations (ICLR) , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. International Conference on Learning Representations (ICLR) , year=

[43] [43]

2024 , doi=

Ku, Max and Jiang, Dongfu and Wei, Cong and Yue, Xiang and Chen, Wenhu , booktitle=. 2024 , doi=

2024

[44] [44]

arXiv preprint arXiv:2502.13923 , year=

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:2505.20275 , year=

Imgedit: A unified image editing dataset and benchmark , author=. arXiv preprint arXiv:2505.20275 , year=

Pith/arXiv arXiv

[46] [46]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Flowedit: Inversion-free text-based editing using pre-trained flow models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[47] [47]

arXiv preprint arXiv:2504.17761 , year=

Step1X-Edit: A Practical Framework for General Image Editing , author=. arXiv preprint arXiv:2504.17761 , year=

Pith/arXiv arXiv

[48] [48]

2024 , howpublished=

Black Forest Labs , title=. 2024 , howpublished=

2024

[49] [49]

GitHub repository , howpublished =

Kohya-ss , title =. GitHub repository , howpublished =. 2025 , publisher =

2025

[50] [50]

arXiv preprint arXiv:2403.03206 , year=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. arXiv preprint arXiv:2403.03206 , year=

Pith/arXiv arXiv

[51] [51]

Wenhao Wang and Yi Yang , booktitle=. Video. 2025 , url=

2025

[52] [52]

arXiv preprint arXiv:2506.13691 , year=

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions , author=. arXiv preprint arXiv:2506.13691 , year=

arXiv

[53] [53]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and Advanced Large-Scale Video Generative Models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv

[54] [54]

NeurIPS , year=

UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models , author=. NeurIPS , year=

[55] [55]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[56] [56]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022

[57] [57]

, author=

OpenImages: A public dataset for large-scale multi-label and multi-class image classification. , author=. Dataset available from https://storage.googleapis.com/openimages/web/index.html , year=

[58] [58]

2023 , howpublished =

Medeiros, Luca , title =. 2023 , howpublished =

2023

[59] [59]

, booktitle =

Brooks, Tim and Holynski, Aleksander and Efros, Alexei A. , booktitle =

[60] [60]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

Emerging Properties in Self-Supervised Vision Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

[61] [61]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Kasten, Yoni and Dekel, Tali , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024

[62] [62]

Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei , booktitle =