VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

Ahmadou Aidara; Jeremy Moebel; Jingyi Lu; Kai Han; Mengyu Wang; Yan Luo

arxiv: 2605.15661 · v1 · pith:Y34CN5DPnew · submitted 2026-05-15 · 💻 cs.CV · cs.AI

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

Yan Luo , Ahmadou Aidara , Jingyi Lu , Jeremy Moebel , Kai Han , Mengyu Wang This is my paper

Pith reviewed 2026-05-20 18:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords classifier-free guidancevelocity adaptive scaleimage editingimage generationtraining-free methodstructural fidelitydiffusion sampling

0 comments

The pith

Dynamically scaling classifier-free guidance with velocity alignments raises structural fidelity in editing and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard classifier-free guidance holds its scale fixed across the sampling trajectory, yet early steps are noise-heavy with weak signals while later steps fix structure and need firmer direction. The optimal strength also depends on whether the guided velocity aligns with or opposes the model's current flow. VAGS replaces the fixed scale with a training-free multiplier that blends a temporal term and the cosine similarity between task-relevant velocity fields. For editing this similarity tracks compatibility between source and target velocities; for generation it tracks unconditional versus conditional velocities. If the claim holds, models would gain better preservation of input structure during edits and higher overall quality in pure generation without any retraining or added passes.

Core claim

VAGS multiplies the nominal guidance scale by a bounded factor that combines a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing the factor uses alignment between source- and target-guided velocities so that edit strength reflects local compatibility between preservation and change. For generation the analogous factor uses alignment between unconditional and conditional velocities. The method needs no fine-tuning, auxiliary networks, or extra forward passes and reduces to ordinary fixed-scale CFG when the similarity term is ignored.

What carries the argument

The Velocity-Adaptive Guidance Scale, a per-step multiplier on guidance strength computed from cosine similarity of the relevant velocity fields to decide when to strengthen or weaken the directional pull.

If this is right

Editing tasks show higher structural fidelity on PIE-Bench and DIV2K than fixed CFG or prior training-free variants.
Generation tasks achieve higher quality metrics on COCO17, CUB-200, and Flickr30K while preserving the same sampling budget.
Both editing and generation variants remain training-free and recover standard CFG as the special case of constant similarity.
No auxiliary networks or extra model evaluations are required at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same velocity-similarity signal might be tested on video or 3D generation pipelines to check whether the adaptation generalizes beyond 2D images.
High-guidance regimes that currently produce artifacts could be self-regulated by lowering the multiplier when alignment drops, reducing the need for manual scale tuning.
The bounded multiplier might be combined with other adaptive schedulers to handle different noise schedules or model families without redesign.

Load-bearing premise

The cosine similarity between the relevant velocity fields supplies a clean, artifact-free signal for raising or lowering guidance strength at each step.

What would settle it

On PIE-Bench or DIV2K, if VAGS produces lower structural similarity scores or higher perceptual error than the strongest fixed-scale CFG baseline under identical sampling settings, the performance advantage would be refuted.

Figures

Figures reproduced from arXiv: 2605.15661 by Ahmadou Aidara, Jeremy Moebel, Jingyi Lu, Kai Han, Mengyu Wang, Yan Luo.

**Figure 2.** Figure 2: Workflow of Velocity-Adaptive Guidance Scale (VAGS). VAGS replaces fixed CFG [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative editing comparison on PIE-Bench. Columns show, from left to right: Input, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Editing trajectory analysis: per-step cosine similarity and adaptive guidance scale alongside [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Editing diagnostics on PIE-Bench. The cosine signal tracks the source/target velocity [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative generation comparison on COCO17 and CUB-200. VAGS-Gen improves [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Generation qualitative comparison: additional examples on COCO17 and CUB-200. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Generation trajectory analysis: per-step cosine similarity and adaptive guidance scale [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Progressive denoising on COCO17: per-step intermediate images for VAGS ( [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Generation trajectory analysis: additional per-prompt traces. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Generation diagnostics on COCO17. VAGS-Gen uses the unconditional/conditional [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of modulation strength κ on generation metrics. Moderate modulation improves fidelity, while overly large values begin to suppress or amplify guidance too aggressively [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of κ on PIE-Bench editing metrics. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAGS adapts CFG scale via velocity cosine alignment in a training-free way and posts consistent gains on editing and generation benchmarks, but the early-step noise issue is worth watching.

read the letter

Hey, the core of this paper is a small algebraic change to classifier-free guidance: multiply the fixed scale by a bounded term that mixes a temporal factor with the cosine similarity of the relevant velocity fields at each ODE step. For editing it compares source-guided and target-guided velocities; for generation it uses unconditional versus conditional. Fixed CFG falls out as the constant-similarity case, and nothing extra is trained or run forward. That is the actual novelty relative to prior adaptive-guidance work that used attention or predictors instead. The experiments back it up with steady lifts in structural and quality metrics on PIE-Bench, DIV2K, COCO17, CUB-200, and Flickr30K, all without post-hoc exclusions or hidden fitting. Public code helps too. The method is cheap to plug in and the gains look reproducible from the tables. The soft spot is the premise that cosine similarity between those velocities stays a clean, artifact-free signal when the trajectory is still noise-dominated. In the first few steps the velocities can be erratic, so the multiplier might suppress guidance exactly when directional push is most useful or introduce small trajectory wobbles. The abstract does not show explicit checks for monotonic schedules or overshoot relative to fixed CFG, so that part of the argument rests on the empirical tables holding up rather than on a direct stability proof. This paper is for people who already run flow-based or diffusion samplers for image editing and text-to-image work and want a low-effort knob tweak. A reader focused on sampling improvements or practical guidance would get immediate value from the code and the numbers. It is solid enough on its own terms to deserve a serious referee; the idea is clean, the implementation is transparent, and the results are consistent across datasets even if a deeper look at early-step behavior would tighten the claims.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Velocity-Adaptive Guidance Scale (VAGS), a training-free technique that replaces fixed classifier-free guidance (CFG) with a dynamic scale for flow-based image editing and generation. The nominal CFG scale is multiplied at each ODE step by a bounded factor formed from a temporal signal-level term plus the cosine similarity between task-relevant velocity fields (source vs. target velocities for inversion-free editing; unconditional vs. conditional velocities for generation). Fixed CFG is recovered when the similarity term is constant. Experiments report consistent gains in structural fidelity and perceptual quality on PIE-Bench and DIV2K (editing) and on COCO17, CUB-200, and Flickr30K (generation) relative to fixed CFG and recent training-free baselines.

Significance. If the reported metric improvements hold under the velocity-alignment premise, the work supplies a lightweight, zero-extra-forward-pass refinement to a ubiquitous control mechanism in diffusion/flow models. The method is parameter-light (only the bounded-multiplier bounds), recovers the standard baseline by construction, and is accompanied by public code, which aids reproducibility. It directly targets the mismatch between a constant guidance strength and the varying semantic content of velocity fields across the trajectory.

major comments (1)

[Method (definition of bounded multiplier and velocity-similarity term)] The central claim that VAGS improves fidelity without introducing new failure modes rests on the premise that cosine similarity between the relevant velocity pairs supplies a reliable, artifact-free local compatibility signal. This premise is invoked when the bounded multiplier is defined. In early ODE steps the velocities are noise-dominated; low or fluctuating cosine values could suppress guidance precisely when directional commitment is most needed or introduce trajectory jitter. The manuscript should supply either (a) an analysis showing that the resulting per-step scale schedule remains monotonic or bounded relative to fixed CFG or (b) an ablation that isolates early-step behavior and confirms absence of overshoot/undershoot.

minor comments (2)

[Abstract] The abstract states that VAGS 'consistently improves' over 'recent training-free guidance variants' but does not name the specific baselines; listing them explicitly would improve readability.
[Experiments / Results tables] Tables reporting metric gains should include standard deviations across runs or seeds to allow readers to judge the stability of the observed improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. The comment on the bounded multiplier and early-step behavior of the velocity-similarity term is well-taken; we respond below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Method (definition of bounded multiplier and velocity-similarity term)] The central claim that VAGS improves fidelity without introducing new failure modes rests on the premise that cosine similarity between the relevant velocity pairs supplies a reliable, artifact-free local compatibility signal. This premise is invoked when the bounded multiplier is defined. In early ODE steps the velocities are noise-dominated; low or fluctuating cosine values could suppress guidance precisely when directional commitment is most needed or introduce trajectory jitter. The manuscript should supply either (a) an analysis showing that the resulting per-step scale schedule remains monotonic or bounded relative to fixed CFG or (b) an ablation that isolates early-step behavior and confirms absence of overshoot/undershoot.

Authors: We agree that early ODE steps merit explicit verification because velocities are initially noise-dominated. The VAGS multiplier is deliberately bounded (clamped between fixed lower and upper limits) and combines the cosine-similarity term with an explicit temporal signal-level term that grows with denoising progress; this construction prevents the similarity signal from driving the effective scale below a minimum threshold or introducing large per-step jumps. Fixed CFG is recovered exactly when the similarity term is constant, so any deviation is controlled. To address the request directly, we will add (i) plots of per-step effective scale schedules across representative trajectories and (ii) an early-step ablation that freezes the multiplier after the first 20 % of steps and compares fidelity/jitter metrics against the full VAGS schedule. These additions will confirm that the schedule remains bounded relative to fixed CFG and does not produce overshoot or new artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in VAGS derivation

full rationale

The paper defines VAGS by constructing a bounded multiplier from a temporal signal term plus the cosine similarity of velocity fields (source/target for editing or unconditional/conditional for generation) and applies it to the nominal CFG scale. This functional form is computed directly from the diffusion model's intermediate velocity outputs at each ODE step and does not reduce by construction to any fitted parameter, renamed empirical pattern, or self-citation chain. Fixed CFG is recovered explicitly as the constant-similarity special case, but the variable adaptation itself is presented as an independent, training-free modification without auxiliary networks or extra passes. The central performance claims rest on empirical results across PIE-Bench, DIV2K, COCO17, CUB-200, and Flickr30K rather than definitional equivalence to the inputs. No load-bearing step in the described method equates to its own premises via self-definition or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the standard velocity-field formulation of flow-based samplers and the algebraic definition of cosine similarity; no new physical entities or unverified lemmas are introduced. One bounded multiplier is defined but its exact functional form is not a fitted constant in the reported experiments.

free parameters (1)

bounded multiplier bounds
The paper defines a bounded factor that multiplies the nominal scale; the precise upper and lower limits are chosen once and held fixed across experiments.

axioms (1)

domain assumption Cosine similarity between velocity fields is a meaningful local compatibility measure
Invoked when the adaptation factor is constructed from the dot-product term in the method description.

pith-pipeline@v0.9.0 · 5805 in / 1376 out tokens · 87168 ms · 2026-05-20T18:28:34.294442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

[1]

Denoising diffusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 2020

work page 2020
[2]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[3]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[4]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InICLR, 2021

work page 2021
[5]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[6]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[7]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

FLUX.1.https://blackforestlabs.ai, 2024

Black Forest Labs. FLUX.1.https://blackforestlabs.ai, 2024. Open-weight rectified-flow text-to-image model

work page 2024
[9]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Prompt-to-prompt image editing with cross attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023

work page 2023
[11]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023

work page 2023
[12]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

work page 2023
[13]

EDICT: Exact diffusion inversion via coupled transformations

Bram Wallace, Akash Gokul, and Nikhil Naik. EDICT: Exact diffusion inversion via coupled transformations. InCVPR, 2023

work page 2023
[14]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to follow image editing instructions. InCVPR, 2023

work page 2023
[15]

Semantic im- age inversion and editing using rectified stochastic differen- tial equations

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792, 2024

work page arXiv 2024
[16]

Tam- ing rectified flow for inversion and editing

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024
[17]

Fireflow: Fast inversion of rectified flow for image semantic editing,

Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, and Fan Tang. FireFlow: Fast inversion of rectified flow for image semantic editing.arXiv preprint arXiv:2412.07517, 2024

work page arXiv 2024
[18]

Unveil inversion and invariance in flow transformer for versatile image editing.arXiv preprint arXiv:2411.15843, 2025

Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, and Boyu Wang. Unveil inversion and invariance in flow transformer for versatile image editing.arXiv preprint arXiv:2411.15843, 2025

work page arXiv 2025
[19]

Inversion- free image editing with natural language.arXiv preprint arXiv:2312.04965,

Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with natural language.arXiv preprint arXiv:2312.04965, 2023

work page arXiv 2023
[20]

Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. FlowEdit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

work page arXiv 2024
[21]

Text-to-image rectified flow as plug-and-play priors.arXiv preprint arXiv:2406.03293, 2025

Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors.arXiv preprint arXiv:2406.03293, 2025

work page arXiv 2025
[22]

Splitflow: Flow decomposition for inversion-free text-to-image editing.arXiv preprint arXiv:2510.25970, 2025

Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, and Mengyu Wang. Splitflow: Flow decomposition for inversion-free text-to-image editing.arXiv preprint arXiv:2510.25970, 2025. 10

work page arXiv 2025
[23]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023
[24]

Analysis of classifier-free guidance weight schedulers

Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernández Abrevaya, David Picard, and Vicky Kalogeiton. Analysis of classifier-free guidance weight schedulers.arXiv preprint arXiv:2404.13040, 2024

work page arXiv 2024
[25]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. arXiv preprint arXiv:2404.07724, 2024

work page arXiv 2024
[26]

Dynamic classifier-free diffusion guidance via online feedback.arXiv preprint arXiv:2509.16131, 2025

Pinelopi Papalampidi, Olivia Wiles, Ira Ktena, Aleksandar Shtedritski, Emanuele Bugliarello, Ivana Kaji´c, Isabela Albuquerque, and Aida Nematzadeh. Dynamic classifier-free diffusion guidance via online feedback.arXiv preprint arXiv:2509.16131, 2025

work page arXiv 2025
[27]

Learn to guide your diffusion model.arXiv preprint arXiv:2510.00815, 2025

Alexandre Galashov, Ashwini Pokle, Arnaud Doucet, Arthur Gretton, Mauricio Delbracio, and Valentin De Bortoli. Learn to guide your diffusion model.arXiv preprint arXiv:2510.00815, 2025

work page arXiv 2025
[28]

phyworldbench

Weichen Fan, Amber Yijia Zheng, Raymond A Yeh, and Ziwei Liu. Cfg-zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

work page arXiv 2025
[29]

Rectified-cfg++ for flow based models.arXiv preprint arXiv:2510.07631, 2025

Shreshth Saini, Shashank Gupta, and Alan C Bovik. Rectified-cfg++ for flow based models.arXiv preprint arXiv:2510.07631, 2025

work page arXiv 2025
[30]

Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

work page arXiv 2023
[31]

Ntire 2017 challenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InCVPR, 2017

work page 2017
[32]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common objects in context. InECCV, 2014

work page 2014
[33]

Welinder, S

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010

work page 2010
[34]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InICCV, 2015

work page 2015
[35]

A-FloPS: Accelerating diffusion models via adaptive flow path sampler.arXiv preprint arXiv:2509.00036, 2025

Cheng Jin, Zhenyu Xiao, and Yuantao Gu. A-FloPS: Accelerating diffusion models via adaptive flow path sampler.arXiv preprint arXiv:2509.00036, 2025. Also appeared in AAAI 2026

work page arXiv 2025
[36]

Self-guidance: Boosting flow and diffusion generation on their own.IEEE TPAMI, 2025

Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, and Guo-Jun Qi. Self-guidance: Boosting flow and diffusion generation on their own.IEEE TPAMI, 2025

work page 2025
[37]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. NeurIPS, 2024

work page 2024
[38]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023
[39]

Improving diffusion-based image editing faithfulness via guidance and scheduling.arXiv preprint arXiv:2506.21045, 2025

Hansam Cho and Seoung Bum Kim. Improving diffusion-based image editing faithfulness via guidance and scheduling.arXiv preprint arXiv:2506.21045, 2025

work page arXiv 2025
[40]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018
[41]

Godiva: Generating open-domain videos from natural descriptions

Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions.arXiv preprint arXiv:2104.14806, 2021

work page arXiv 2021
[42]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdv. Neural Inform. Process. Syst., 2017

work page 2017
[43]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdv. Neural Inform. Process. Syst., 2016. 11

work page 2016
[44]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProc. Conf. Empir. Methods Nat. Lang. Process., 2021

work page 2021
[45]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InACL, 2002

work page 2002
[46]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InACL Workshop, 2005

work page 2005
[47]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InACL Workshop, 2004

work page 2004
[48]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. InNeurIPS, 2019

work page 2019
[49]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022

work page 2022
[50]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 12 A Technical appendices and supplementary material This appendix collects implementation pseudocode, diagnostic evide...

work page 2023
[51]

Its sign reflects a qualitative shift in the sampler’s task across the trajectory

and positive in the clean regime (σi > 1 2), this signed factor passes through zero at σi = 1 2, where λi recovers the nominal scale exactly. Its sign reflects a qualitative shift in the sampler’s task across the trajectory. Early steps performbasin selection: from a near-isotropic latent the sampler must commit to a region of the data manifold, and the u...

work page arXiv 1932
[52]

V AGS-Gen yields consistent FID and IS gains, with the largest improvement on fine-grained generation

is used to generate captions from the generated images, enabling evaluation with the caption- based metrics. V AGS-Gen yields consistent FID and IS gains, with the largest improvement on fine-grained generation. Generation κ ablation.The optimal κ on each dataset is unchanged under scaling ( κ=1.0 on COCO17 and CUB-200, κ=0.9 on Flickr30K), and the same c...

work page

[1] [1]

Denoising diffusion probabilistic models.NeurIPS, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 2020

work page 2020

[2] [2]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021

[3] [3]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022

[4] [4]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InICLR, 2021

work page 2021

[5] [5]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InICLR, 2023

work page 2023

[6] [6]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023

[7] [7]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

FLUX.1.https://blackforestlabs.ai, 2024

Black Forest Labs. FLUX.1.https://blackforestlabs.ai, 2024. Open-weight rectified-flow text-to-image model

work page 2024

[9] [9]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Prompt-to-prompt image editing with cross attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. InICLR, 2023

work page 2023

[11] [11]

Plug-and-play diffusion features for text-driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. InCVPR, 2023

work page 2023

[12] [12]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

work page 2023

[13] [13]

EDICT: Exact diffusion inversion via coupled transformations

Bram Wallace, Akash Gokul, and Nikhil Naik. EDICT: Exact diffusion inversion via coupled transformations. InCVPR, 2023

work page 2023

[14] [14]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to follow image editing instructions. InCVPR, 2023

work page 2023

[15] [15]

Semantic im- age inversion and editing using rectified stochastic differen- tial equations

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792, 2024

work page arXiv 2024

[16] [16]

Tam- ing rectified flow for inversion and editing

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

work page arXiv 2024

[17] [17]

Fireflow: Fast inversion of rectified flow for image semantic editing,

Yingying Deng, Xiangyu He, Changwang Mei, Peisong Wang, and Fan Tang. FireFlow: Fast inversion of rectified flow for image semantic editing.arXiv preprint arXiv:2412.07517, 2024

work page arXiv 2024

[18] [18]

Unveil inversion and invariance in flow transformer for versatile image editing.arXiv preprint arXiv:2411.15843, 2025

Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, and Boyu Wang. Unveil inversion and invariance in flow transformer for versatile image editing.arXiv preprint arXiv:2411.15843, 2025

work page arXiv 2025

[19] [19]

Inversion- free image editing with natural language.arXiv preprint arXiv:2312.04965,

Sihan Xu, Yidong Huang, Jiayi Pan, Ziqiao Ma, and Joyce Chai. Inversion-free image editing with natural language.arXiv preprint arXiv:2312.04965, 2023

work page arXiv 2023

[20] [20]

Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. FlowEdit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

work page arXiv 2024

[21] [21]

Text-to-image rectified flow as plug-and-play priors.arXiv preprint arXiv:2406.03293, 2025

Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors.arXiv preprint arXiv:2406.03293, 2025

work page arXiv 2025

[22] [22]

Splitflow: Flow decomposition for inversion-free text-to-image editing.arXiv preprint arXiv:2510.25970, 2025

Sung-Hoon Yoon, Minghan Li, Gaspard Beaudouin, Congcong Wen, Muhammad Rafay Azhar, and Mengyu Wang. Splitflow: Flow decomposition for inversion-free text-to-image editing.arXiv preprint arXiv:2510.25970, 2025. 10

work page arXiv 2025

[23] [23]

Muse: Text-to- image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023

[24] [24]

Analysis of classifier-free guidance weight schedulers

Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernández Abrevaya, David Picard, and Vicky Kalogeiton. Analysis of classifier-free guidance weight schedulers.arXiv preprint arXiv:2404.13040, 2024

work page arXiv 2024

[25] [25]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. arXiv preprint arXiv:2404.07724, 2024

work page arXiv 2024

[26] [26]

Dynamic classifier-free diffusion guidance via online feedback.arXiv preprint arXiv:2509.16131, 2025

Pinelopi Papalampidi, Olivia Wiles, Ira Ktena, Aleksandar Shtedritski, Emanuele Bugliarello, Ivana Kaji´c, Isabela Albuquerque, and Aida Nematzadeh. Dynamic classifier-free diffusion guidance via online feedback.arXiv preprint arXiv:2509.16131, 2025

work page arXiv 2025

[27] [27]

Learn to guide your diffusion model.arXiv preprint arXiv:2510.00815, 2025

Alexandre Galashov, Ashwini Pokle, Arnaud Doucet, Arthur Gretton, Mauricio Delbracio, and Valentin De Bortoli. Learn to guide your diffusion model.arXiv preprint arXiv:2510.00815, 2025

work page arXiv 2025

[28] [28]

phyworldbench

Weichen Fan, Amber Yijia Zheng, Raymond A Yeh, and Ziwei Liu. Cfg-zero*: Improved classifier-free guidance for flow matching models.arXiv preprint arXiv:2503.18886, 2025

work page arXiv 2025

[29] [29]

Rectified-cfg++ for flow based models.arXiv preprint arXiv:2510.07631, 2025

Shreshth Saini, Shashank Gupta, and Alan C Bovik. Rectified-cfg++ for flow based models.arXiv preprint arXiv:2510.07631, 2025

work page arXiv 2025

[30] [30]

Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

work page arXiv 2023

[31] [31]

Ntire 2017 challenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InCVPR, 2017

work page 2017

[32] [32]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common objects in context. InECCV, 2014

work page 2014

[33] [33]

Welinder, S

P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010

work page 2010

[34] [34]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InICCV, 2015

work page 2015

[35] [35]

A-FloPS: Accelerating diffusion models via adaptive flow path sampler.arXiv preprint arXiv:2509.00036, 2025

Cheng Jin, Zhenyu Xiao, and Yuantao Gu. A-FloPS: Accelerating diffusion models via adaptive flow path sampler.arXiv preprint arXiv:2509.00036, 2025. Also appeared in AAAI 2026

work page arXiv 2025

[36] [36]

Self-guidance: Boosting flow and diffusion generation on their own.IEEE TPAMI, 2025

Tiancheng Li, Weijian Luo, Zhiyang Chen, Liyuan Ma, and Guo-Jun Qi. Self-guidance: Boosting flow and diffusion generation on their own.IEEE TPAMI, 2025

work page 2025

[37] [37]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. NeurIPS, 2024

work page 2024

[38] [38]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

work page 2023

[39] [39]

Improving diffusion-based image editing faithfulness via guidance and scheduling.arXiv preprint arXiv:2506.21045, 2025

Hansam Cho and Seoung Bum Kim. Improving diffusion-based image editing faithfulness via guidance and scheduling.arXiv preprint arXiv:2506.21045, 2025

work page arXiv 2025

[40] [40]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018

[41] [41]

Godiva: Generating open-domain videos from natural descriptions

Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions.arXiv preprint arXiv:2104.14806, 2021

work page arXiv 2021

[42] [42]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InAdv. Neural Inform. Process. Syst., 2017

work page 2017

[43] [43]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdv. Neural Inform. Process. Syst., 2016. 11

work page 2016

[44] [44]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProc. Conf. Empir. Methods Nat. Lang. Process., 2021

work page 2021

[45] [45]

BLEU: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. InACL, 2002

work page 2002

[46] [46]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InACL Workshop, 2005

work page 2005

[47] [47]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InACL Workshop, 2004

work page 2004

[48] [48]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. InNeurIPS, 2019

work page 2019

[49] [49]

Diffusers: State-of-the-art diffusion models

Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022

work page 2022

[50] [50]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 12 A Technical appendices and supplementary material This appendix collects implementation pseudocode, diagnostic evide...

work page 2023

[51] [51]

Its sign reflects a qualitative shift in the sampler’s task across the trajectory

and positive in the clean regime (σi > 1 2), this signed factor passes through zero at σi = 1 2, where λi recovers the nominal scale exactly. Its sign reflects a qualitative shift in the sampler’s task across the trajectory. Early steps performbasin selection: from a near-isotropic latent the sampler must commit to a region of the data manifold, and the u...

work page arXiv 1932

[52] [52]

V AGS-Gen yields consistent FID and IS gains, with the largest improvement on fine-grained generation

is used to generate captions from the generated images, enabling evaluation with the caption- based metrics. V AGS-Gen yields consistent FID and IS gains, with the largest improvement on fine-grained generation. Generation κ ablation.The optimal κ on each dataset is unchanged under scaling ( κ=1.0 on COCO17 and CUB-200, κ=0.9 on Flickr30K), and the same c...

work page