pith. sign in

arxiv: 2509.22244 · v6 · pith:JUJAIULEnew · submitted 2025-09-26 · 💻 cs.CV

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Pith reviewed 2026-05-21 22:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsimage editingtext-guided editingone-step inversionattention mechanismsreal-time editingbackground preservation
0
0 comments X

The pith

FlashEdit achieves precise text-guided image editing in under 0.2 seconds by using one-step inversion with cycle consistency and attention controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlashEdit to solve the high latency problem in diffusion-based image editing while keeping edits localized and accurate. It relies on a cycle-consistent one-step inversion to generate suitable starting latents quickly instead of running many denoising steps. A background shield protects unchanged areas through self-attention changes, and a sparsified cross-attention step reduces unwanted semantic spread. If these components work as described, editing becomes practical for interactive use without trading away quality.

Core claim

The method combines Cycle-Consistent One-Step Inversion to align latents on the manifold in a single step, Background Shield to intervene in structural self-attention for non-edited region preservation, and Sparsified Spatial Cross-Attention to limit semantic leakage, delivering edits in under 0.2 seconds with over 150 times speedup versus multi-step DDIM inversion on PIE-Bench.

What carries the argument

Cycle-Consistent One-Step Inversion (COSI) pipeline that enforces manifold alignment via cycle consistency, together with Background Shield for structural self-attention intervention and Sparsified Spatial Cross-Attention for semantic focus.

If this is right

  • Editing latency drops enough to support interactive applications such as live photo retouching.
  • Non-edited image regions retain structural details more reliably through the self-attention shield.
  • Semantic leakage is reduced so that text instructions affect only the targeted object or region.
  • The overall pipeline runs on consumer hardware without specialized accelerators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same one-step inversion idea could be tested on video sequences to enable frame-by-frame editing without accumulating drift.
  • The three components might be combined with other inversion-free editing methods to further cut compute.
  • If the speed holds on larger models, it could change the cost structure of batch image editing services.

Load-bearing premise

The one-step cycle-consistent inversion produces latents close enough to the data manifold that no further multi-step refinement is needed for good edit quality.

What would settle it

Direct comparison on PIE-Bench showing that the one-step COSI version produces visibly lower edit accuracy or background preservation than standard multi-step DDIM inversion under identical prompts and masks.

Figures

Figures reproduced from arXiv: 2509.22244 by Haotong Qin, Junyi Wu, Xiaokang Yang, Yulun Zhang, Zhiteng Li.

Figure 1
Figure 1. Figure 1: FlashEdit produces superior visual results for text-guided image editing, addressing background instability and semantic entan￾glement with an over 150× speedup against DDIM [36] + P2P [10]. Abstract Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We intro￾duce FlashEdit, a novel framework designed to e… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our One-Step Inversion-and-Editing framework, which introduces a direct image conditioning branch, trained via a two-stage “Anchor-and-Refine” strategy that uses direct supervision for synthetic data (Stage 1) and a teacher-student objective for real images (Stage 2). primary category is model quantization [20], which reduces memory footprint and computational load by converting full-precision … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of our Background Shield (BG-Shield) mechanism. The top of the figure illustrates the problem of background inconsistency in standard editing, while the bottom details the pipeline of our method designed to solve it. generator’s prior distribution, N (0, I), to ensure editability. While both constraints can be explicitly supervised when using synthetic data, the distributional constraint becom… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of our Sparsified Spatial Cross-Attention (SSCA) method resolving semantic entanglement. The top row demon￾strates how standard attention fails on precise edits, resulting in edit attenuation and attribute leakage. The bottom row details our SSCA mechanism, which prevents this by computing attention only over a subset of relevant text tokens to ensure a clean edit. that the teacher (ϕ) would h… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of editing results. Each row corresponds to a unique editing task, with the source image displayed in the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Text-guided image editing with diffusion models has achieved remarkable quality but often suffers from prohibitive latency. We introduce \textbf{FlashEdit}, a real-time localized image editing framework for the standard inversion-based editing setting. Its efficiency and precision stem from three key innovations: (1) a \textbf{Cycle-Consistent One-Step Inversion (COSI)} pipeline that encourages manifold-aligned one-step inversion through cycle consistency; (2) a \textbf{Background Shield (BG-Shield)} technique that improves preservation of non-edited regions via structural self-attention intervention; and (3) a \textbf{Sparsified Spatial Cross-Attention (SSCA)} mechanism that promotes precise edits by suppressing semantic leakage. Experiments on PIE-Bench demonstrate a strong preservation-efficiency trade-off, with edits completed in under 0.2 seconds and an over 150$\times$ speedup over DDIM-based multi-step editing. Our code will be made publicly available at \url{https://github.com/JunyiWuCode/FlashEdit}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FlashEdit, a real-time framework for localized text-guided image editing using diffusion models in the inversion-based setting. The approach decouples speed, structure, and semantics through three innovations: (1) Cycle-Consistent One-Step Inversion (COSI) that uses cycle consistency to encourage manifold-aligned latents for fast inversion; (2) Background Shield (BG-Shield) that uses structural self-attention intervention to preserve non-edited regions; and (3) Sparsified Spatial Cross-Attention (SSCA) that suppresses semantic leakage for precise localized edits. On the PIE-Bench dataset, the method achieves edits in under 0.2 seconds, representing a 150× speedup over standard DDIM-based multi-step editing while maintaining a strong preservation-efficiency trade-off.

Significance. If the central claims hold, particularly that COSI enables quality-preserving one-step edits, this would be a significant contribution toward practical real-time diffusion-based image editing. The planned public code release supports reproducibility and is a strength.

major comments (2)
  1. [Methods (COSI pipeline description)] The justification for COSI (described in the methods as the source of both speed and precision) relies on the assumption that cycle consistency produces manifold-aligned latents suitable for accurate text-conditioned localized editing. Cycle consistency enforces reconstruction fidelity but does not automatically guarantee the latent properties needed to avoid quality degradation in editing without multi-step refinement; this is load-bearing for the 150× speedup claim without unacknowledged quality cost.
  2. [Experiments and results section] The PIE-Bench experiments report strong preservation-efficiency results but omit error bars, full baseline implementation details, and ablations on free parameters (attention suppression threshold in SSCA, cycle-consistency loss weight). Without these, the robustness of the reported trade-off cannot be fully assessed.
minor comments (1)
  1. Notation for attention mechanisms in BG-Shield and SSCA could be made more explicit to aid readers not deeply familiar with inversion-based editing pipelines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing our response and indicating planned revisions to strengthen the paper.

read point-by-point responses
  1. Referee: The justification for COSI (described in the methods as the source of both speed and precision) relies on the assumption that cycle consistency produces manifold-aligned latents suitable for accurate text-conditioned localized editing. Cycle consistency enforces reconstruction fidelity but does not automatically guarantee the latent properties needed to avoid quality degradation in editing without multi-step refinement; this is load-bearing for the 150× speedup claim without unacknowledged quality cost.

    Authors: We thank the referee for this observation. Cycle consistency is used in COSI to encourage latents that remain close to the data manifold after one-step inversion, which our empirical results on PIE-Bench support through strong preservation metrics and visual quality comparable to multi-step baselines. However, we acknowledge that the current manuscript provides primarily empirical justification rather than a deeper theoretical analysis of the latent properties. In the revised version, we will expand the methods section with additional discussion on the role of cycle consistency in promoting manifold alignment, including references to related analyses of diffusion latent spaces, and add an ablation study isolating the effect of the cycle-consistency term on editing quality and speed. revision: partial

  2. Referee: The PIE-Bench experiments report strong preservation-efficiency results but omit error bars, full baseline implementation details, and ablations on free parameters (attention suppression threshold in SSCA, cycle-consistency loss weight). Without these, the robustness of the reported trade-off cannot be fully assessed.

    Authors: We agree that these elements are necessary for a complete evaluation of robustness. In the revised manuscript, we will add error bars (computed over multiple random seeds) to all quantitative results in the experiments section. We will also include detailed implementation specifications for all baselines and introduce new ablation studies on the attention suppression threshold in SSCA and the cycle-consistency loss weight. These additions will appear in the main paper or supplementary material to allow readers to assess the sensitivity of the reported trade-off. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper frames FlashEdit as a set of extensions to existing inversion-based diffusion editing pipelines, with COSI, BG-Shield, and SSCA introduced as concrete algorithmic interventions whose effects are then measured on PIE-Bench. The reported 150× speedup and sub-0.2 s latency are empirical outcomes of these interventions rather than quantities derived by construction from fitted hyperparameters or self-referential definitions. Cycle consistency is presented as a training objective that encourages (not defines) manifold alignment; the alignment claim is not used to tautologically justify the speed or precision results. No equation or self-citation chain reduces the central performance claims to the inputs by construction, satisfying the criteria for a non-circular derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that standard diffusion inversion can be reduced to one step while preserving editability through cycle consistency and attention masking; no new physical entities are introduced.

free parameters (2)
  • attention suppression threshold in SSCA
    Likely tuned to control semantic leakage; exact value not stated in abstract.
  • cycle-consistency loss weight
    Hyper-parameter balancing reconstruction and edit fidelity in COSI.
axioms (2)
  • domain assumption One-step inversion can be made manifold-aligned via cycle consistency
    Invoked as the foundation of the COSI pipeline.
  • domain assumption Structural self-attention intervention leaves non-edited regions unchanged
    Core premise of BG-Shield.

pith-pipeline@v0.9.0 · 5719 in / 1350 out tokens · 28209 ms · 2026-05-21T22:21:04.902056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    GHOST applies geometry-hierarchical online token eviction with hierarchical scoring, privilege protection, and layer-wise budget allocation to halve KV cache size while maintaining reconstruction quality and achieving...

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InCVPR, 2023. 1

  2. [2]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InICCV, 2023. 2, 7, 8

  3. [3]

    Dove: Efficient one- step diffusion model for real-world video super-resolution

    Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, and Yulun Zhang. Dove: Efficient one- step diffusion model for real-world video super-resolution. InNeurIPS, 2025. 4, 7

  4. [4]

    Diffedit: Diffusion-based semantic image editing with mask guidance,

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based seman- tic image editing with mask guidance.arXiv preprint arXiv:2210.11427, 2022. 3

  5. [5]

    Turboedit: Text-based image editing using few-step diffusion models

    Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, and Daniel Cohen-Or. Turboedit: Text-based image editing using few-step diffusion models. InSIGGRAPH Asia, 2024. 2, 7, 8

  6. [6]

    Prompt tuning inversion for text-driven image editing using diffusion models

    Wenkai Dong, Song Xue, Xiaoyue Duan, and Shumin Han. Prompt tuning inversion for text-driven image editing using diffusion models. InICCV, 2023. 1

  7. [7]

    Alphaedit: Null-space constrained knowledge editing for language models

    Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Al- phaedit: Null-space constrained knowledge editing for lan- guage models.arXiv preprint arXiv:2410.02355, 2024. 2

  8. [8]

    Renoise: Real image inversion through iterative noising

    Daniel Garibi, Or Patashnik, Andrey V oynov, Hadar Averbuch-Elor, and Daniel Cohen-Or. Renoise: Real image inversion through iterative noising. InECCV, 2024. 7, 8

  9. [9]

    Common- canvas: Open diffusion models trained on creative-commons images

    Aaron Gokaslan, A Feder Cooper, Jasmine Collins, Lan- dan Seguin, Austin Jacobson, Mihir Patel, Jonathan Fran- kle, Cory Stephenson, and V olodymyr Kuleshov. Common- canvas: Open diffusion models trained on creative-commons images. InCVPR, 2024. 7

  10. [10]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 1, 3, 7, 8

  11. [11]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2

  12. [12]

    Huynh-Thu and M

    Q. Huynh-Thu and M. Ghanbari. Scope of validity of psnr in image/video quality assessment.Electronics Letters, 2008. 7

  13. [13]

    Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based edit- ing with 3 lines of code.arXiv preprint arXiv:2310.01506,

  14. [14]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  15. [15]

    Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman- Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024. 2

  16. [16]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2

  17. [17]

    Large language model inference acceleration: A comprehen- sive hardware perspective.arXiv preprint arXiv:2410.04466,

    Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, et al. Large language model inference acceleration: A comprehen- sive hardware perspective.arXiv preprint arXiv:2410.04466,

  18. [18]

    Fast and efficient 2-bit llm inference on gpu: 2/4/16-bit in a weight matrix with asyn- chronous dequantization

    Jinhao Li, Jiaming Xu, Shiyao Li, Shan Huang, Jun Liu, Yaoxiu Lian, and Guohao Dai. Fast and efficient 2-bit llm inference on gpu: 2/4/16-bit in a weight matrix with asyn- chronous dequantization. InICCAD, 2024. 2

  19. [19]

    Radial attention:O(nlogn)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

    Xingyang Li*, Muyang Li*, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. Radial attention:O(nlogn)sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025. 2

  20. [20]

    Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663, 2025

    Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. Dvd-quant: Data-free video diffusion transformers quantization.arXiv preprint arXiv:2505.18663, 2025. 3

  21. [21]

    Arb-llm: Alternating refined binarizations for large language models

    Zhiteng Li, Xianglong Yan, Tianao Zhang, Haotong Qin, Dong Xie, Jiang Tian, zhongchao shi, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Arb-llm: Alternating refined binarizations for large language models. InICLR, 2025. 2

  22. [23]

    From reusing to forecasting: Accel- erating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accel- erating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

  23. [24]

    Deepcache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InCVPR, 2024. 3

  24. [25]

    Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models

    Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. InWACV, 2025. 2

  25. [26]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023. 7, 8

  26. [27]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023. 2

  27. [28]

    Swiftedit: Lightning fast text-guided image editing via one-step diffusion

    Trong-Tung Nguyen, Quang Nguyen, Khoi Nguyen, Anh Tran, and Cuong Pham. Swiftedit: Lightning fast text-guided image editing via one-step diffusion. InCVPR, 2025. 2, 4, 7, 8

  28. [29]

    Specdiff: Accelerating diffusion model inference with self- speculation.arxiv preprint 2509.13848, 2025

    Jiayi Pan, Jiaming Xu, Yongkang Zhou, and Guohao Dai. Specdiff: Accelerating diffusion model inference with self- speculation.arxiv preprint 2509.13848, 2025. 2

  29. [30]

    Zero-shot image-to-image translation

    Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. InSIGGRAPH, 2023. 7, 8

  30. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2 9

  31. [32]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 7

  32. [33]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution im- age synthesis with latent diffusion models.arXiv preprint arXiv:2112.10752, 2021. 2

  33. [34]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2

  34. [36]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 1, 3

  35. [37]

    Moma: Multimodal llm adapter for fast personalized image generation

    Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, and Xiao Yang. Moma: Multimodal llm adapter for fast personalized image generation. InECCV, 2024. 7

  36. [38]

    Journeydb: A benchmark for generative im- age understanding.NeurIPS, 2023

    Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative im- age understanding.NeurIPS, 2023. 7

  37. [39]

    Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning

    Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, and Guohao Dai. Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning. arxiv preprint 2509.05614, 2025. 2

  38. [40]

    OSDFace: One-step diffusion model for face restoration

    Jingkai Wang, Jue Gong, Lin Zhang, Zheng Chen, Xing Liu, Hong Gu, Yutong Liu, Yulun Zhang, and Xiaokang Yang. OSDFace: One-step diffusion model for face restoration. In CVPR, 2025. 7

  39. [41]

    Image editing with diffusion models: A sur- vey.arXiv preprint arXiv:2504.13226, 2025

    Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Xiaoming Wei, and Enhua Wu. Image editing with diffusion models: A sur- vey.arXiv preprint arXiv:2504.13226, 2025. 2

  40. [42]

    High-fidelity gan inversion for image attribute editing

    Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-fidelity gan inversion for image attribute editing. InCVPR, 2022. 3

  41. [43]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 7

  42. [44]

    Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation

    Junyi Wu, Zhiteng Li, Zheng Hui, Yulun Zhang, Linghe Kong, and Xiaokang Yang. Quantcache: Adaptive importance-guided quantization with hierarchical latent and layer caching for video generation. InICCV, 2025. 3

  43. [45]

    Specee: Accelerating large language model inference with specula- tive early exiting

    Jiaming Xu, Jiayi Pan, Yongkang Zhou, Siming Chen, Jin- hao Li, Yaoxiu Lian, Junyi Wu, and Guohao Dai. Specee: Accelerating large language model inference with specula- tive early exiting. InISCA, 2025. 2

  44. [46]

    Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads

    Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Xiaoyu Kong, Jintao Li, Oliver Deussen, and Tong-Yee Lee. Head- router: A training-free image editing framework for mm- dits by adaptively routing attention heads.arXiv preprint arXiv:2411.15034, 2024. 2

  45. [47]

    Recalkv: Low-rank kv cache compression via head reordering and offline calibra- tion.arXiv preprint arXiv:2505.24357, 2025

    Xianglong Yan, Zhiteng Li, Tianao Zhang, Linghe Kong, Yulun Zhang, and Xiaokang Yang. Recalkv: Low-rank kv cache compression via head reordering and offline calibra- tion.arXiv preprint arXiv:2505.24357, 2025. 2

  46. [48]

    Progressive binarization with semi-structured pruning for llms.arXiv preprint arXiv:2502.01705, 2025

    Xianglong Yan, Tianao Zhang, Zhiteng Li, and Yulun Zhang. Progressive binarization with semi-structured pruning for llms.arXiv preprint arXiv:2502.01705, 2025. 2

  47. [49]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arxiv:2308.06721,

  48. [50]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 4

  49. [51]

    Plug-and-play image restora- tion with deep denoiser prior.TPAMI, 2021

    Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte. Plug-and-play image restora- tion with deep denoiser prior.TPAMI, 2021. 3, 7, 8

  50. [52]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023. 7

  51. [53]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 7

  52. [54]

    In- domain gan inversion for real image editing

    Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In- domain gan inversion for real image editing. InECCV, 2020. 3

  53. [55]

    Generative visual manipulation on the natu- ral image manifold

    Jun-Yan Zhu, Philipp Kr ¨ahenb¨uhl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natu- ral image manifold. InECCV, 2016. 3 10