Learning an Image Editing Model without Image Editing Pairs

· 2025 · cs.CV · arXiv 2510.14978

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.

representative citing papers

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

cs.CV · 2026-06-01 · unverdicted · novelty 7.0 · 2 refs

VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

cs.CV · 2026-06-02 · unverdicted · novelty 5.0

ByG enables unpaired training of flow matching editing models by pairing self-extracted instruction-following cues with cycle-consistency and routing gradients from clean predictions to noisy states.

citing papers explorer

Showing 2 of 2 citing papers after filters.

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization cs.CV · 2026-06-01 · unverdicted · none · ref 74 · 2 links · internal anchor
VLMs formulate differentiable rewards from task-specific rules to enable test-time online LoRA optimization of VGMs, delivering 16.7-point gains on symbolic and general video reasoning benchmarks over VLM-as-solver and Best-of-N baselines.
Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching cs.CV · 2026-06-02 · unverdicted · none · ref 36 · internal anchor
ByG enables unpaired training of flow matching editing models by pairing self-extracted instruction-following cues with cycle-consistency and routing gradients from clean predictions to noisy states.

Learning an Image Editing Model without Image Editing Pairs

fields

years

verdicts

representative citing papers

citing papers explorer