pith. machine review for the scientific record. sign in

arxiv: 2504.20690 · v3 · submitted 2025-04-29 · 💻 cs.CV

Recognition: 1 theorem link

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords instructional image editingdiffusion transformersin-context learningparameter-efficient fine-tuningimage editinggenerative modelsvision language models
0
0 comments X

The pith

Large Diffusion Transformers perform precise instructional image editing via in-context generation without major retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large-scale Diffusion Transformers already contain the comprehension and generation capabilities needed for high-quality instruction-based image editing. It demonstrates this through an in-context editing approach that requires no architectural changes, only minimal parameter-efficient fine-tuning, and an early filtering step at inference time using vision-language models. This combination delivers state-of-the-art results while using just 0.1 percent of the training data and 1 percent of the trainable parameters compared with earlier fine-tuning methods. The work therefore resolves the longstanding tradeoff between editing precision and computational cost in a scalable way.

Core claim

ICEdit shows that an in-context editing paradigm, applied directly to large Diffusion Transformers, enables precise modifications from natural language instructions by embedding the source image and editing prompt together in the generation process. Combined with lightweight fine-tuning on a tiny dataset and an inference-time selection of promising noise samples via VLMs, the method reaches superior editing performance without altering the underlying DiT architecture.

What carries the argument

The in-context editing paradigm that feeds the original image together with the editing instruction into the DiT's generation process as a single prompt.

If this is right

  • Instructional editing becomes feasible for users who lack access to massive compute or datasets.
  • Quality improves further by scaling the base DiT size rather than the fine-tuning data.
  • Inference cost drops because only high-quality noise samples are processed to completion.
  • The same in-context strategy can be applied to other generation tasks inside the same model family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing large generative models outside the DiT family could adopt similar in-context framing for editing tasks.
  • Extending the approach to video or 3D data would test whether the same efficiency gains hold in higher-dimensional domains.
  • Real-time editing interfaces become practical if the early filter step is accelerated with smaller VLMs.

Load-bearing premise

Large Diffusion Transformers already understand natural language editing instructions well enough to apply them accurately when the task is presented in an in-context format.

What would settle it

A controlled test on a held-out DiT model showing that the in-context method fails to outperform strong training-free baselines on diverse editing instructions would falsify the central claim.

read the original abstract

Instruction-based image editing enables precise modifications via natural language prompts, but existing methods face a precision-efficiency tradeoff: fine-tuning demands massive datasets (>10M) and computational resources, while training-free approaches suffer from weak instruction comprehension. We address this by proposing ICEdit, which leverages the inherent comprehension and generation abilities of large-scale Diffusion Transformers (DiTs) through three key innovations: (1) An in-context editing paradigm without architectural modifications; (2) Minimal parameter-efficient fine-tuning for quality improvement; (3) Early Filter Inference-Time Scaling, which uses VLMs to select high-quality noise samples for efficiency. Experiments show that ICEdit achieves state-of-the-art editing performance with only 0.1\% of the training data and 1\% trainable parameters compared to previous methods. Our approach establishes a new paradigm for balancing precision and efficiency in instructional image editing. Codes and demos can be found in https://river-zhang.github.io/ICEdit-gh-pages/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes ICEdit, an instructional image editing approach that applies an in-context generation paradigm to large-scale Diffusion Transformers without any architectural changes, augments it with minimal parameter-efficient fine-tuning, and introduces Early Filter Inference-Time Scaling via VLMs to select high-quality noise samples. It claims state-of-the-art editing performance while using only 0.1% of the training data and 1% trainable parameters relative to prior methods.

Significance. If the reported gains can be isolated to the in-context paradigm and minimal PEFT rather than differences in the underlying DiT backbone, the work would demonstrate a practical route to high-precision editing with dramatically reduced data and compute, shifting the precision-efficiency tradeoff in the field.

major comments (1)
  1. [Section 4] Section 4 and the experimental tables: the SOTA and efficiency claims (0.1% data, 1% trainable parameters) rest on comparisons against prior editing methods, yet the text does not explicitly confirm that every baseline was re-implemented and evaluated on the identical pretrained DiT weights and model scale used for ICEdit. Without this verification, the performance delta cannot be attributed solely to the proposed paradigm.
minor comments (1)
  1. [Abstract] Abstract: the claim of state-of-the-art performance is stated without any quantitative metrics, specific baselines, or key numbers; inserting one or two headline results (e.g., from the main table) would make the abstract self-contained.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the major comment regarding experimental comparisons below and will revise the manuscript to improve clarity.

read point-by-point responses
  1. Referee: Section 4 and the experimental tables: the SOTA and efficiency claims (0.1% data, 1% trainable parameters) rest on comparisons against prior editing methods, yet the text does not explicitly confirm that every baseline was re-implemented and evaluated on the identical pretrained DiT weights and model scale used for ICEdit. Without this verification, the performance delta cannot be attributed solely to the proposed paradigm.

    Authors: We agree that explicit confirmation is necessary to isolate the contribution of the in-context paradigm. All baselines were re-implemented and evaluated using the exact same pretrained DiT weights and model scale as ICEdit (the large-scale DiT backbone without architectural modifications). This ensures fair comparison under identical backbone conditions. However, we acknowledge that the manuscript text does not state this explicitly. We will revise Section 4 and the experimental setup description to include a clear statement confirming that every baseline uses the identical pretrained DiT weights and scale, allowing the performance gains to be attributed to the proposed in-context generation and minimal PEFT. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external comparisons without self-referential reductions

full rationale

The paper describes an in-context editing paradigm applied to existing large-scale Diffusion Transformers, using minimal PEFT and an inference-time filter, with no equations, derivations, or fitted parameters presented as predictions. The central claims (SOTA performance at 0.1% data and 1% trainable parameters) are supported by experimental tables comparing against prior methods; these are not reduced by construction to quantities defined within the paper itself. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are merely renamed. Baseline fairness concerns affect validity but do not create circularity under the specified criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that large DiTs already possess sufficient in-context comprehension for editing tasks; no free parameters or invented entities are explicitly introduced in the abstract.

free parameters (1)
  • trainable parameters fraction
    The method uses 1% trainable parameters via parameter-efficient fine-tuning, implying a small set of adjusted weights whose exact selection is not detailed.
axioms (1)
  • domain assumption Large-scale Diffusion Transformers possess inherent comprehension and generation abilities that support in-context instructional editing without architectural changes.
    Invoked to justify the no-modification paradigm and minimal fine-tuning approach.

pith-pipeline@v0.9.0 · 5475 in / 1184 out tokens · 40545 ms · 2026-05-16T16:04:06.609653+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.

  2. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  3. Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset

    cs.CV 2026-04 unverdicted novelty 7.0

    A new 100k triplet dataset and in-context diffusion framework ICTone enable state-of-the-art tone style transfer by jointly conditioning on content and reference images with scorer-based reward learning.

  4. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  5. A Unified and Controllable Framework for Layered Image Generation with Visual Effects

    cs.CV 2026-01 unverdicted novelty 7.0

    LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

  6. InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

    cs.CV 2025-12 unverdicted novelty 7.0

    InstructMoLE replaces per-token routing with instruction-guided global routing for mixture-of-low-rank-experts in diffusion transformers and adds an output-space orthogonality loss to improve multi-conditional image g...

  7. Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.

  8. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.

  9. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

  10. FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    cs.GR 2025-06 unverdicted novelty 6.0

    FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.

  11. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  12. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

  13. FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...

  14. LongCat-Image Technical Report

    cs.CV 2025-12 unverdicted novelty 5.0

    LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

  15. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  16. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  17. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  18. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  19. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 18 Pith papers · 13 internal anchors

  1. [1]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  2. [2]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024

  3. [3]

    Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks.arXiv preprint arXiv:2311.10089, 2023

  4. [4]

    Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2025

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2025

  5. [5]

    Guid- ing instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

    Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models.arXiv preprint arXiv:2309.17102, 2023

  6. [6]

    Ace: All-round creator and editor following instructions via diffusion transformer

    Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, and Jingren Zhou. Ace: All-round creator and editor following instructions via diffusion transformer. arXiv preprint arXiv:2410.00086, 2024

  7. [7]

    Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

    Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, and Jingren Zhou. Ace++: Instruction-based image creation and editing via context-aware content filling.arXiv preprint arXiv:2501.02487, 2025

  8. [8]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

  9. [9]

    Smartedit: Exploring complex instruction- based image editing with multimodal large language models

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction- based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8362–8371, 2024

  10. [10]

    Insightedit: Towards better instruction following for image editing.arXiv preprint arXiv:2411.17323, 2024

    Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, and Qiang Liu. Insightedit: Towards better instruction following for image editing.arXiv preprint arXiv:2411.17323, 2024

  11. [11]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.CoRR, abs/2208.01626, 2022

  12. [12]

    SDEdit: Guided image synthesis and editing with stochastic differential equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022

  13. [13]

    Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

    Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746, 2024

  14. [14]

    Stable flow: Vital layers for training-free image editing, 2024

    Omri Avrahami, Or Patashnik, Ohad Fried, Egor Nemchinov, Kfir Aberman, Dani Lischinski, and Daniel Cohen-Or. Stable flow: Vital layers for training-free image editing, 2024

  15. [15]

    Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

  16. [16]

    Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Direct inversion: Boosting diffusion-based editing with 3 lines of code.arXiv preprint arXiv:2310.01506, 2023. 22

  17. [17]

    Kv-edit: Training-free image editing for precise background preservation, 2025

    Tianrui Zhu, Shiyi Zhang, Jiawei Shao, and Yansong Tang. Kv-edit: Training-free image editing for precise background preservation, 2025

  18. [18]

    Pnp inversion: Boosting diffusion-based editing with 3 lines of code.International Conference on Learning Representa- tions (ICLR), 2024

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. Pnp inversion: Boosting diffusion-based editing with 3 lines of code.International Conference on Learning Representa- tions (ICLR), 2024

  19. [19]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, October 2023

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  21. [21]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

  22. [22]

    Scaling rectified flow transformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024

  23. [23]

    Flux: Official inference repository for flux.1 models, 2024

    Black Forest Labs. Flux: Official inference repository for flux.1 models, 2024. Accessed: 2024-11-12

  24. [24]

    Large-scale text-to- image model with inpainting is a zero-shot subject-driven image generator.arXiv preprint arXiv:2411.15466, 2024

    Chaehun Shin, Jooyoung Choi, Heeseung Kim, and Sungroh Yoon. Large-scale text-to- image model with inpainting is a zero-shot subject-driven image generator.arXiv preprint arXiv:2411.15466, 2024

  25. [25]

    Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

  26. [26]

    In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers.arXiv preprint arxiv:2410.23775, 2024

  27. [27]

    Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

    Tan Zhenxiong, Liu Songhua, Yang Xingyi, Xue Qiaochu, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

  28. [28]

    Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to- more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

  29. [29]

    Inference-time scaling for diffu- sion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Inference-time scaling for diffusion models beyond scaling denoising steps.arXiv preprint arXiv:2501.09732, 2025

  30. [30]

    Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024

    Zikai Zhou, Shitong Shao, Lichen Bai, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework.arXiv preprint arXiv:2411.09502, 2024

  31. [31]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

  32. [32]

    Null-text inversion for editing real images using guided diffusion models

    Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023

  33. [33]

    Plug-and-play diffusion features for text-driven image-to-image translation, 2022

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation, 2022. 23

  34. [34]

    Om- niedit: Building image editing generalist models through specialist supervision.arXiv preprint arXiv:2411.07199, 2024

    Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, and Wenhu Chen. Om- niedit: Building image editing generalist models through specialist supervision.arXiv preprint arXiv:2411.07199, 2024

  35. [35]

    Anyedit: Mastering unified high-quality image editing for any idea.arXiv preprint arXiv:2411.15738, 2024

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea.arXiv preprint arXiv:2411.15738, 2024

  36. [36]

    Flux.1-dev-edit-v0, 2025

    Sayak Paul. Flux.1-dev-edit-v0, 2025. Accessed: 2025-02-21

  37. [37]

    Group diffusion transformers are unsupervised multitask learners.arXiv preprint arxiv:2410.15027, 2024

    Lianghua Huang, Wei Wang, Zhi-Fan Wu, Huanzhang Dou, Yupeng Shi, Yutong Feng, Chen Liang, Yu Liu, and Jingren Zhou. Group diffusion transformers are unsupervised multitask learners.arXiv preprint arxiv:2410.15027, 2024

  38. [38]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073, 2021

  39. [39]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  40. [40]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  41. [41]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  42. [42]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  43. [43]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  44. [44]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

  45. [45]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  46. [46]

    Clip-fields: Weakly supervised semantic fields for robotic memory.arXiv preprint arXiv:2210.05663, 2022

    Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory.arXiv preprint arXiv:2210.05663, 2022

  47. [47]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computation...

  48. [48]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  49. [49]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

  50. [50]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  51. [51]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation.arXiv preprint arXiv:2312.14867, 2023

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation.arXiv preprint arXiv:2312.14867, 2023. 24

  52. [52]

    Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

  53. [53]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  54. [54]

    Multi-modal attention for speech emotion recognition.arXiv preprint arXiv:2009.04107, 2020

    Zexu Pan, Zhaojie Luo, Jichen Yang, and Haizhou Li. Multi-modal attention for speech emotion recognition.arXiv preprint arXiv:2009.04107, 2020

  55. [55]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  56. [56]

    Prodigy: An expeditiously adaptive parameter-free learner

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. InForty-first International Conference on Machine Learning, 2024

  57. [57]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022

  58. [58]

    Instruction-based image manipulation by watching how things move.arXiv preprint arXiv:2412.12087, 2024

    Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move.arXiv preprint arXiv:2412.12087, 2024

  59. [59]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  60. [60]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025. 25