pith. machine review for the scientific record. sign in

arxiv: 2604.10954 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingdiffusion modelsbounding box guidancefine-grained editingbackground preservationregion-based editinginstruction compliance
0
0 comments X

The pith

FineEdit injects bounding box guidance at multiple levels in diffusion models to localize edits precisely while keeping backgrounds unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion image editing models often change more than intended because text prompts alone cannot specify exact regions, forcing global regeneration that alters backgrounds. FineEdit solves this by supplying bounding boxes as explicit spatial cues and feeding them into the model at several stages so the diffusion process targets only the chosen area. The authors built a 1.2 million pair dataset with precise box annotations and a 1,000-image benchmark spanning ten subjects to train and test the approach. On their benchmark the method beats leading open models in following instructions and preserving untouched regions, with similar gains on other public tests.

Core claim

Injecting bounding box conditions at multiple levels during diffusion enables the model to localize the editing target accurately, thereby preserving background consistency without global regeneration.

What carries the argument

Multi-level bounding box injection, which supplies spatial box coordinates to the diffusion network at several processing stages to constrain changes to the indicated region.

If this is right

  • Users gain a direct way to mark exact edit locations instead of describing them in ambiguous text.
  • Background regions remain untouched because the model no longer regenerates the entire image.
  • Training on 1.2 million annotated pairs produces stronger generalization across different editing scenarios.
  • The same multi-level injection pattern can support related region-based tasks such as object removal or replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Drawing tools in consumer editors could adopt box input as a simpler alternative to prompt refinement.
  • The dataset construction pipeline might be reused to create training data for mask- or point-guided editing.
  • Extending the injection to video frames could enable spatially controlled video editing without full re-synthesis.

Load-bearing premise

That bounding box signals can be added at multiple diffusion stages without creating artifacts or requiring per-image tuning, and that the 1.2 million dataset plus ten-subject benchmark reflect typical user editing needs.

What would settle it

A controlled test on diverse editing tasks outside FineEdit-Bench where FineEdit produces more background changes or lower instruction compliance than current open-source baselines.

Figures

Figures reproduced from arXiv: 2604.10954 by Haohang Xu, Lin Liu, Qi Tian, Rong Cong, Xiaopeng Zhang, Zhibo Zhang.

Figure 1
Figure 1. Figure 1: Comparison between the SOTA closed source model Nano-Banana Pro, and our proposed FineEdit. The results (left: Input; [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FineEdit framework, which includes two training stages: (a) Pre-training stage establishes multi-level spatial priors [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Unified visual instructions for diverse editing settings. Three representative configurations are showcased: localized style transfer, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The U-GAF Pipeline., which consists of four synergistic stages for high-quality data synthesis: Data Curation, (2) Data Annotation, (3) Edit Generation, and (4) Data Refinement. 2 Huawei Proprietary - Restricted Distribution (a) (b) (d) Box Ratio Percentage (%) furniture; kitchen utensils clothing; (36.2%) animal; people (21.1%) building; landspace (9.9%) office; sports; music instruments (9.3%) others ( f… view at source ↗
Figure 5
Figure 5. Figure 5: Information of our purposed FineEdit dataset. (a) Train [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparision on FineEdit-1k Evaluation Bench. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Extended Qualitative Results on FineEdit-1k [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Results Across Diverse Editing Tasks [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of human evaluation win rates between Fi [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces FineEdit, a diffusion-based image editing approach that injects bounding box guidance at multiple levels to improve target localization and background consistency over prompt-only methods. It contributes the FineEdit-1.2M dataset (1.2 million editing pairs with precise bbox annotations) and FineEdit-Bench (1,000 images across 10 subjects) for evaluating region-based edits. The central claim is that FineEdit significantly outperforms open-source baselines such as Qwen-Image-Edit and LongCat-Image-Edit on instruction compliance and background preservation on FineEdit-Bench, with further gains shown on GEdit and ImgEdit Bench.

Significance. If the performance claims are supported under controlled conditions, the work advances controllable diffusion editing by showing the utility of explicit spatial cues for fine-grained tasks. The large-scale dataset and specialized benchmark constitute clear community resources that could standardize evaluation of localization and preservation in editing models.

major comments (1)
  1. [Experiments section (FineEdit-Bench and open-benchmark evaluations)] Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'significantly outperforms' would be strengthened by naming the concrete metrics (e.g., CLIP similarity, LPIPS, or human preference scores) used for instruction compliance and background preservation.
  2. [Method section] Method section: a diagram or pseudocode illustrating the multi-level bbox injection (e.g., which layers or attention modules receive the condition) would improve clarity and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper to improve clarity on the experimental setup.

read point-by-point responses
  1. Referee: Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.

    Authors: We appreciate the referee for identifying this ambiguity in the experimental description. The manuscript does not explicitly state the conditioning provided to the baselines. Qwen-Image-Edit and LongCat-Image-Edit are open-source models designed for prompt-based editing and do not natively accept bounding box inputs. They were therefore evaluated using only the provided textual instructions, without any spatial information derived from the bounding boxes in FineEdit-Bench or the other benchmarks. This reflects their standard usage and allows us to demonstrate the advantage of adding explicit multi-level bounding box guidance. The performance gains in instruction compliance and background preservation are intended to highlight the benefit of our spatial injection mechanism over conventional prompt-only approaches. We acknowledge that the comparison is not fully controlled for input modality and that adapting the baselines to accept boxes would require non-trivial modifications outside the scope of this work. We will revise the Experiments section to explicitly describe the baseline evaluation protocol, note the differing input modalities, and discuss this as a limitation of the current comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical method and benchmark claims are self-contained

full rationale

The paper introduces a multi-level bounding-box injection technique for diffusion editing, a new 1.2M-pair dataset with box annotations, and FineEdit-Bench for evaluation. All load-bearing claims are empirical (outperformance on instruction compliance and background preservation versus Qwen-Image-Edit and LongCat-Image-Edit). No equations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the provided text. The method is presented as an architectural choice trained on the new data; results are benchmark comparisons rather than first-principles derivations that reduce to their own inputs by construction. This is the expected non-finding for a standard applied CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the central claim rests on empirical performance of a neural architecture whose internal parameters are learned rather than hand-specified.

pith-pipeline@v0.9.0 · 5547 in / 1065 out tokens · 82496 ms · 2026-05-10T16:09:41.592387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 33 canonical work pages · 19 internal anchors

  1. [1]

    co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2

    Alibaba Cloud PAI Team: Z-image-turbo-fun- controlnet-union-2.1.https : / / huggingface . co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2 . 1(2025), accessed: 2026-01-12 8

  2. [2]

    In: ACM Transactions on Graphics (TOG)

    Avrahami, O., Lischinski, D., Fried, O.: Blended la- tent diffusion. In: ACM Transactions on Graphics (TOG). vol. 42, pp. 1–11 (2023) 2

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 3

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V o- leti, V ., Letts, A., et al.: Stable video diffusion: Scal- ing latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 1

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instruct- pix2pix: Learning to follow image editing instruc- tions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392– 18402 (2023) 1, 2, 8

  6. [6]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

    Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self-attention con- trol for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 22560–22570 (October 2023) 2

  7. [7]

    Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023

    Chen, J., Huang, Y ., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855 (2023) 7

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerg- ing properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 8, 16

  9. [9]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing

    Ge, Y ., Zhao, S., Li, C., Ge, Y ., Shan, Y .: Seed-data- edit technical report: A hybrid dataset for instruc- tional image editing. arXiv preprint arXiv:2405.04007 (2024) 2

  10. [10]

    Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu

    Gupta, A., Doll ´ar, P., Girshick, R.B.: LVIS: A dataset for large vocabulary instance segmentation. CoRR abs/1908.03195(2019),http : / / arxiv . org / abs/1908.031955

  11. [11]

    In: International Conference on Learning Representations (2022) 2

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: International Conference on Learning Representations (2022) 2

  12. [12]

    Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion prob- abilistic models. Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10

  13. [13]

    In: International Con- ference on Learning Representations 7

    Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adapta- tion of large language models. In: International Con- ference on Learning Representations 7

  14. [14]

    Composer: Creative and controllable im- age synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

    Huang, L., Chen, K., Chu, W., Liu, J., et al.: Com- poser: Creative and controllable image synthesis with composable chained diffusion forces. arXiv preprint arXiv:2302.09778 (2023) 2

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

    Huberman-Spiegelglas, I., Kulikov, V ., Michaeli, T.: An edit friendly DDPM noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 12469–12478 (2024) 2

  16. [16]

    arXiv preprint arXiv:2404.09990 , year=

    Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., Zhou, Y ., Xie, C.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 6

  17. [17]

    In: Euro- pean Conference on Computer Vision

    Ju, X., Liu, X., Wang, X., Bian, Y ., Shan, Y ., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: Euro- pean Conference on Computer Vision. pp. 150–168. Springer (2024) 2, 8

  18. [18]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based edit- ing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025) 2

  20. [20]

    Kuprashevich, M., Alekseenko, G., Tolstykh, I., Fe- dorov, G., Suleimanov, B., Dokholyan, V ., Gordeev, A.: Nohumansrequired: Autonomous high-quality image editing triplet mining (2025) 2, 7

  21. [21]

    com / black-forest-labs/flux(2024) 2, 8

    Labs, B.F.: Flux.https : / / github . com / black-forest-labs/flux(2024) 2, 8

  22. [23]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Con- sul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 1, 3, 6, 7, 8

  23. [24]

    Li, Y ., Bian, Y ., Ju, X., Zhang, Z., , Zhuang, J., Shan, Y ., Zou, Y ., Xu, Q.: Brushedit: All-in-one image in- painting and editing (2024) 3, 8

  24. [25]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition

    Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 22511–22521 (2023) 2

  25. [26]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y ., Ye, Y ., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y ., et al.: Uni- world: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 8

  26. [27]

    Microsoft COCO: Common Objects in Context

    Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft COCO: com- mon objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.03125

  27. [28]

    NIPS (2024) 8

    Liu, K., Zhu, Z., Li, C., Liu, H., Zeng, H., Hou, J.: Prefpaint: Aligning image inpainting diffusion model with human preference. NIPS (2024) 8

  28. [29]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y ., Fu, H., Han, C., et al.: Step1x- edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 6, 7, 8, 16

  29. [30]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Tim- ofte, R., Van Gool, L.: Repaint: Inpainting using de- noising diffusion probabilistic models. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11461–11471 (2022) 2

  30. [31]

    Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., liu, Z.: Editscore: Unlock- ing online rl for image editing via high-fidelity reward modeling (2025),https://arxiv.org/abs/ 2509.239096

  31. [32]

    ACM SIG- GRAPH (2024) 2

    Ma, J., Jiang, J., Zhou, H., Ji, X., et al.: Subject- diffusion: Open-domain personalized text-to-image generation without test-time fine-tuning. ACM SIG- GRAPH (2024) 2

  32. [33]

    Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

    Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware representation learning (2025),https://arxiv. org/abs/2508.076076

  33. [34]

    ICLR (2025) 3

    Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models. ICLR (2025) 3

  34. [35]

    https : / / huggingface

    Matsubara, O., Team, D.T.A.: Megalith-10M: A dataset of 10 million public-domain photographs. https : / / huggingface . co / datasets / madebyollin / megalith - 10m(2024), cC0/Flickr-Commons images; Florence-2 cap- tions available in the *megalith-10m-florence2* variant. 5 11

  35. [36]

    In: Interna- tional Conference on Learning Representations (2022) 1

    Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.Y ., Ermon, S.: Sdedit: Guided image synthesis and edit- ing with stochastic differential equations. In: Interna- tional Conference on Learning Representations (2022) 1

  36. [37]

    In: AAAI Conference on Artificial Intel- ligence (2024) 2

    Mou, C., Wang, X., Xie, L., Zhang, Y ., Qi, Z., Shan, Y ., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffu- sion models. In: AAAI Conference on Artificial Intel- ligence (2024) 2

  37. [38]

    com / index / image - generation - api/ (2025) 8

    OpenAI: Image generation api.https://openai. com / index / image - generation - api/ (2025) 8

  38. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2

  39. [40]

    Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealis- tic text-to-image diffusion models with deep language understanding. Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1

  40. [41]

    In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2

    Samuel, D., Meiri, B., Maron, H., Tewel, Y ., Dar- shan, N., Avidan, S., Chechik, G., Ben-Ari, R.: Lightning-fast image inversion and editing for text-to- image diffusion models. In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2

  41. [42]

    LAION-5B: An open large-scale dataset for training next generation image-text models

    Schuhmann, C., Beaumont, R., Vencu, Richard and. . .˜Jitsev, J.: LAION-Aesthetic v2 6+: A 6 million-image aesthetic filtered subset of laion- 5b.https : / / laion . ai / blog / laion - aesthetic/(2022), subset derived from LAION- 5B (arXiv:2210.08402). 5

  42. [43]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y ., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 1

  43. [44]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image caption- ing. In: Gurevych, I., Miyao, Y . (eds.) Proceed- ings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers). pp. 2556–2565. Association for Computa- tional Lingu...

  44. [45]

    In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912

    Sheynin, S., Polyak, A., Singer, U., Kirstain, Y ., Zohar, A., Ashual, O., Parikh, D., Taigman, Y .: Emu edit: Precise image editing via recognition and generation tasks. In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912

  45. [46]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation with- out text-video data. arXiv preprint arXiv:2209.14792 (2022) 1

  46. [47]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative mod- eling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 1

  47. [48]

    Longcat-image technical report

    Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y ., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y ., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025) 2, 8, 14

  48. [49]

    23 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al

    Team, S.I., Qiao, C., Hui, C., Li, C., Wang, C., Song, D., Zhang, J., Li, J., Xiang, Q., Wang, R., Sun, S., Zhu, W., Tang, X., Hu, Y ., Chen, Y ., Huang, Y ., Duan, Y ., Chen, Z., Guo, Z.: Firered-image-edit-1.0 technical report. arXiv preprint arXiv:2602.13344 (2026) 14, 16

  49. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug- and-play diffusion features for text-driven image-to- image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1921–1930 (2023) 2

  50. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative mod- els. arXiv preprint arXiv:2503.20314 (2025) 1

  51. [52]

    In: CVPR (2025) 8

    Wang, Y ., Cao, C., Yu, J., Fan, K., Xue, X., Fu, Y .: Towards enhanced image inpainting: Mitigating un- wanted object insertion and preserving color consis- tency. In: CVPR (2025) 8

  52. [53]

    Wang, Y ., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y ., Xie, C.: Gpt-image-edit-1.5m: A million- scale, gpt-generated image dataset (2025),https: //arxiv.org/abs/2507.210332, 6

  53. [54]

    ICLR (2025) 2, 6

    Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Building image editing generalist mod- els through specialist supervision. ICLR (2025) 2, 6

  54. [55]

    Wei, H., Xu, B., Liu, H., Wu, C., Liu, J., Peng, Y ., Wang, P., Liu, Z., He, J., Xietian, Y ., Tang, C., Wang, Z., Wei, Y ., Hu, L., Jiang, B., Li, W., He, Y ., Liu, Y ., Song, X., Li, E., Zhou, Y .: Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model (2025),https://arxiv.org/ abs/2509.045488

  55. [56]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y ., et al.: Qwen-image 12 technical report. arXiv preprint arXiv:2508.02324 (2025) 1, 2, 3, 5, 6, 7, 8, 9, 14

  56. [57]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., Li, W., Jiang, X., Liu, Y ., Zhou, J., et al.: Om- nigen2: Exploration to advanced multimodal genera- tion. arXiv preprint arXiv:2506.18871 (2025) 8, 16

  57. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

    Xiao, S., Wang, Y ., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Om- nigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 13294–13304 (2025) 8

  58. [59]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion

    Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smart- brush: Text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 22428–22437 (2023) 3

  59. [60]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2

  60. [61]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Ye, Y ., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified im- age editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025) 7

  61. [62]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y ., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y .: Anyedit: Mas- tering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 2, 8

  62. [63]

    arXiv preprint arXiv:2509.23082 (2025) 3, 8

    Yutao Shen, Junkun Yuan, T.A.H.N.Y .M.: Follow- your-preference: Towards preference-aligned image inpainting. arXiv preprint arXiv:2509.23082 (2025) 3, 8

  63. [64]

    Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Mag- icbrush: A manually annotated dataset for instruction- guided image editing. Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8

  64. [65]

    In: Pro- ceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 2, 3

  65. [66]

    In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

    Zhang, Z., Xie, J., Lu, Y ., Yang, Z., Yang, Y .: In- context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv preprint arXiv:2504.20690 (2025) 8

  66. [67]

    Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6

    Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6

  67. [68]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

    Zhao, S., Chen, D., Chen, Y .C., Bao, J., et al.: Uni- controlnet: All-in-one control to text-to-image diffu- sion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

  68. [69]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y .: Diffu- sionnft: Online diffusion reinforcement with forward process. arXiv preprint arXiv:2509.16117 (2025) 7

  69. [70]

    Zhipu AI Team: Glm-image: Open-sourced image generation model.https://z.ai/blog/glm- image(2026), accessed: 2026-01-14 8

  70. [71]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al.: In- ternvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 6

  71. [72]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Zong, Z., Song, G., Liu, Y .: Detrs with collabora- tive hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6748–6758 (October 2023) 6 13 Supplementary Material In the supplementary material, we first provide additional qualitative results for FineEdit and extend our comparisons with state-...

  72. [73]

    Extended Qualitative Results 7.1. Qualitative Results on FineEdit-1k While Figure 6 in the main text provides initial visual- izations on the FineEdit-1k benchmark, this section of- fers a more comprehensive suite of results. Due to space constraints in the main manuscript, we present ad- ditional qualitative comparisons between FineEdit, Qwen- Image-Edit...

  73. [74]

    Specifically, we include additional comparisons with classi- cal image editing baselines and the recent FireRed-Image- Edit [49]

    Extended Evaluation on the FineEdit-1k Benchmark While Table 1 in the main manuscript primarily com- pares FineEdit with Flux-based methods and several re- cently released open-source SOTA models, this section pro- vides a broader evaluation on the FineEdit-1k benchmark. Specifically, we include additional comparisons with classi- cal image editing baseli...

  74. [75]

    Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales

    Robustness to Bounding Box Variations In this section, we evaluate the robustness of FineEdit with respect to variations in bounding box (bbox) specifications. Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales. Furthermore, we conduct a sensitivity analysis by introducing spatial per- turbations to the bbo...

  75. [76]

    Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation

    Human Evaluation We conducted a human evaluation to perform a side-by-side comparison between FineEdit and several competitive base- lines, including Qwen-Image-Edit [56], Qwen-Image-Edit- 2509, and Longcat-Image-Edit [48]. Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation. ...