arxiv: 2604.10954 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Haohang Xu , Lin Liu , Zhibo Zhang , Rong Cong , Xiaopeng Zhang , Qi Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingdiffusion modelsbounding box guidancefine-grained editingbackground preservationregion-based editinginstruction compliance

0 comments

The pith

FineEdit injects bounding box guidance at multiple levels in diffusion models to localize edits precisely while keeping backgrounds unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion image editing models often change more than intended because text prompts alone cannot specify exact regions, forcing global regeneration that alters backgrounds. FineEdit solves this by supplying bounding boxes as explicit spatial cues and feeding them into the model at several stages so the diffusion process targets only the chosen area. The authors built a 1.2 million pair dataset with precise box annotations and a 1,000-image benchmark spanning ten subjects to train and test the approach. On their benchmark the method beats leading open models in following instructions and preserving untouched regions, with similar gains on other public tests.

Core claim

Injecting bounding box conditions at multiple levels during diffusion enables the model to localize the editing target accurately, thereby preserving background consistency without global regeneration.

What carries the argument

Multi-level bounding box injection, which supplies spatial box coordinates to the diffusion network at several processing stages to constrain changes to the indicated region.

If this is right

Users gain a direct way to mark exact edit locations instead of describing them in ambiguous text.
Background regions remain untouched because the model no longer regenerates the entire image.
Training on 1.2 million annotated pairs produces stronger generalization across different editing scenarios.
The same multi-level injection pattern can support related region-based tasks such as object removal or replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Drawing tools in consumer editors could adopt box input as a simpler alternative to prompt refinement.
The dataset construction pipeline might be reused to create training data for mask- or point-guided editing.
Extending the injection to video frames could enable spatially controlled video editing without full re-synthesis.

Load-bearing premise

That bounding box signals can be added at multiple diffusion stages without creating artifacts or requiring per-image tuning, and that the 1.2 million dataset plus ten-subject benchmark reflect typical user editing needs.

What would settle it

A controlled test on diverse editing tasks outside FineEdit-Bench where FineEdit produces more background changes or lower instruction compliance than current open-source baselines.

Figures

Figures reproduced from arXiv: 2604.10954 by Haohang Xu, Lin Liu, Qi Tian, Rong Cong, Xiaopeng Zhang, Zhibo Zhang.

**Figure 1.** Figure 1: Comparison between the SOTA closed source model Nano-Banana Pro, and our proposed FineEdit. The results (left: Input; [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of FineEdit framework, which includes two training stages: (a) Pre-training stage establishes multi-level spatial priors [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Unified visual instructions for diverse editing settings. Three representative configurations are showcased: localized style transfer, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The U-GAF Pipeline., which consists of four synergistic stages for high-quality data synthesis: Data Curation, (2) Data Annotation, (3) Edit Generation, and (4) Data Refinement. 2 Huawei Proprietary - Restricted Distribution (a) (b) (d) Box Ratio Percentage (%) furniture; kitchen utensils clothing; (36.2%) animal; people (21.1%) building; landspace (9.9%) office; sports; music instruments (9.3%) others ( f… view at source ↗

**Figure 5.** Figure 5: Information of our purposed FineEdit dataset. (a) Train [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparision on FineEdit-1k Evaluation Bench. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Extended Qualitative Results on FineEdit-1k [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative Results Across Diverse Editing Tasks [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of human evaluation win rates between Fi [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FineEdit adds multi-level bbox injection and a 1.2M dataset for diffusion editing, but the outperformance claims rest on comparisons that may not give baselines the same spatial input.

read the letter

The main takeaway is that this paper supplies a practical way to add bounding-box guidance to diffusion image editing plus a sizable new dataset, yet the central performance claims need tighter controls before they can be taken at face value. The multi-level injection scheme and the release of FineEdit-1.2M with precise box annotations are the concrete new pieces. The benchmark with 1,000 images across 10 subjects also gives the community something to measure against. These elements directly target the well-known weakness of text-only prompts, which often fail to localize edits and end up altering backgrounds. Releasing that volume of annotated pairs is useful on its own for anyone training or fine-tuning editing models. The reported gains on FineEdit-Bench and the open sets GEdit and ImgEdit Bench in instruction compliance and background preservation would be worthwhile if they hold under fair conditions. The soft spot sits in the evaluation setup. The abstract and available description give no sign that the baselines (Qwen-Image-Edit, LongCat-Image-Edit) received equivalent bounding-box conditioning. If those models ran on text prompts alone while FineEdit used boxes, the measured difference could come from the extra input modality rather than the injection technique itself. That is the least secure link in the argument and needs explicit clarification or re-runs in the full paper. Architecture diagrams, training details, and ablations are also missing from the summary, though they may appear later. This work is aimed at researchers and engineers who build controllable editing tools or need large-scale spatial supervision data. A reader focused on diffusion conditioning or dataset construction would extract value from the data release and the conditioning idea. It deserves peer review. The dataset and method are substantial enough to justify referee time, even if the evaluation protocol requires revision to isolate the contribution of the injection scheme.

Referee Report

1 major / 2 minor

Summary. The paper introduces FineEdit, a diffusion-based image editing approach that injects bounding box guidance at multiple levels to improve target localization and background consistency over prompt-only methods. It contributes the FineEdit-1.2M dataset (1.2 million editing pairs with precise bbox annotations) and FineEdit-Bench (1,000 images across 10 subjects) for evaluating region-based edits. The central claim is that FineEdit significantly outperforms open-source baselines such as Qwen-Image-Edit and LongCat-Image-Edit on instruction compliance and background preservation on FineEdit-Bench, with further gains shown on GEdit and ImgEdit Bench.

Significance. If the performance claims are supported under controlled conditions, the work advances controllable diffusion editing by showing the utility of explicit spatial cues for fine-grained tasks. The large-scale dataset and specialized benchmark constitute clear community resources that could standardize evaluation of localization and preservation in editing models.

major comments (1)

[Experiments section (FineEdit-Bench and open-benchmark evaluations)] Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.

minor comments (2)

[Abstract] Abstract: the claim of 'significantly outperforms' would be strengthened by naming the concrete metrics (e.g., CLIP similarity, LPIPS, or human preference scores) used for instruction compliance and background preservation.
[Method section] Method section: a diagram or pseudocode illustrating the multi-level bbox injection (e.g., which layers or attention modules receive the condition) would improve clarity and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper to improve clarity on the experimental setup.

read point-by-point responses

Referee: Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.

Authors: We appreciate the referee for identifying this ambiguity in the experimental description. The manuscript does not explicitly state the conditioning provided to the baselines. Qwen-Image-Edit and LongCat-Image-Edit are open-source models designed for prompt-based editing and do not natively accept bounding box inputs. They were therefore evaluated using only the provided textual instructions, without any spatial information derived from the bounding boxes in FineEdit-Bench or the other benchmarks. This reflects their standard usage and allows us to demonstrate the advantage of adding explicit multi-level bounding box guidance. The performance gains in instruction compliance and background preservation are intended to highlight the benefit of our spatial injection mechanism over conventional prompt-only approaches. We acknowledge that the comparison is not fully controlled for input modality and that adapting the baselines to accept boxes would require non-trivial modifications outside the scope of this work. We will revise the Experiments section to explicitly describe the baseline evaluation protocol, note the differing input modalities, and discuss this as a limitation of the current comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical method and benchmark claims are self-contained

full rationale

The paper introduces a multi-level bounding-box injection technique for diffusion editing, a new 1.2M-pair dataset with box annotations, and FineEdit-Bench for evaluation. All load-bearing claims are empirical (outperformance on instruction compliance and background preservation versus Qwen-Image-Edit and LongCat-Image-Edit). No equations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the provided text. The method is presented as an architectural choice trained on the new data; results are benchmark comparisons rather than first-principles derivations that reduce to their own inputs by construction. This is the expected non-finding for a standard applied CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the central claim rests on empirical performance of a neural architecture whose internal parameters are learned rather than hand-specified.

pith-pipeline@v0.9.0 · 5547 in / 1065 out tokens · 82496 ms · 2026-05-10T16:09:41.592387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 33 canonical work pages · 19 internal anchors

[1]

co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2

Alibaba Cloud PAI Team: Z-image-turbo-fun- controlnet-union-2.1.https : / / huggingface . co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2 . 1(2025), accessed: 2026-01-12 8

2025
[2]

In: ACM Transactions on Graphics (TOG)

Avrahami, O., Lischinski, D., Fried, O.: Blended la- tent diffusion. In: ACM Transactions on Graphics (TOG). vol. 42, pp. 1–11 (2023) 2

2023
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V o- leti, V ., Letts, A., et al.: Stable video diffusion: Scal- ing latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 1

work page internal anchor Pith review arXiv 2023
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instruct- pix2pix: Learning to follow image editing instruc- tions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392– 18402 (2023) 1, 2, 8

2023
[6]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self-attention con- trol for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 22560–22570 (October 2023) 2

2023
[7]

Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023

Chen, J., Huang, Y ., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855 (2023) 7

work page arXiv 2023
[8]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerg- ing properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 8, 16

work page internal anchor Pith review arXiv 2025
[9]

Seed-data-edit technical report: A hybrid dataset for instructional image editing

Ge, Y ., Zhao, S., Li, C., Ge, Y ., Shan, Y .: Seed-data- edit technical report: A hybrid dataset for instruc- tional image editing. arXiv preprint arXiv:2405.04007 (2024) 2

work page arXiv 2024
[10]

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu

Gupta, A., Doll ´ar, P., Girshick, R.B.: LVIS: A dataset for large vocabulary instance segmentation. CoRR abs/1908.03195(2019),http : / / arxiv . org / abs/1908.031955

work page arXiv 1908
[11]

In: International Conference on Learning Representations (2022) 2

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: International Conference on Learning Representations (2022) 2

2022
[12]

Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion prob- abilistic models. Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10

2020
[13]

In: International Con- ference on Learning Representations 7

Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adapta- tion of large language models. In: International Con- ference on Learning Representations 7
[14]

Composer: Creative and controllable im- age synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023

Huang, L., Chen, K., Chu, W., Liu, J., et al.: Com- poser: Creative and controllable image synthesis with composable chained diffusion forces. arXiv preprint arXiv:2302.09778 (2023) 2

work page arXiv 2023
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition

Huberman-Spiegelglas, I., Kulikov, V ., Michaeli, T.: An edit friendly DDPM noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 12469–12478 (2024) 2

2024
[16]

arXiv preprint arXiv:2404.09990 , year=

Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., Zhou, Y ., Xie, C.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 6

work page arXiv 2024
[17]

In: Euro- pean Conference on Computer Vision

Ju, X., Liu, X., Wang, X., Bian, Y ., Shan, Y ., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: Euro- pean Conference on Computer Vision. pp. 150–168. Springer (2024) 2, 8

2024
[18]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3

work page internal anchor Pith review Pith/arXiv arXiv 2013
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based edit- ing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025) 2

2025
[20]

Kuprashevich, M., Alekseenko, G., Tolstykh, I., Fe- dorov, G., Suleimanov, B., Dokholyan, V ., Gordeev, A.: Nohumansrequired: Autonomous high-quality image editing triplet mining (2025) 2, 7

2025
[21]

com / black-forest-labs/flux(2024) 2, 8

Labs, B.F.: Flux.https : / / github . com / black-forest-labs/flux(2024) 2, 8

2024
[23]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Con- sul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 1, 3, 6, 7, 8

work page internal anchor Pith review arXiv 2025
[24]

Li, Y ., Bian, Y ., Ju, X., Zhang, Z., , Zhuang, J., Shan, Y ., Zou, Y ., Xu, Q.: Brushedit: All-in-one image in- painting and editing (2024) 3, 8

2024
[25]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition

Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 22511–22521 (2023) 2

2023
[26]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y ., Ye, Y ., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y ., et al.: Uni- world: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 8

work page internal anchor Pith review arXiv 2025
[27]

Microsoft COCO: Common Objects in Context

Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft COCO: com- mon objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.03125

work page internal anchor Pith review arXiv 2014
[28]

NIPS (2024) 8

Liu, K., Zhu, Z., Li, C., Liu, H., Zeng, H., Hou, J.: Prefpaint: Aligning image inpainting diffusion model with human preference. NIPS (2024) 8

2024
[29]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y ., Fu, H., Han, C., et al.: Step1x- edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 6, 7, 8, 16

work page internal anchor Pith review arXiv 2025
[30]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Tim- ofte, R., Van Gool, L.: Repaint: Inpainting using de- noising diffusion probabilistic models. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11461–11471 (2022) 2

2022
[31]

Luo, X., Wang, J., Wu, C., Xiao, S., Jiang, X., Lian, D., Zhang, J., Liu, D., liu, Z.: Editscore: Unlock- ing online rl for image editing via high-fidelity reward modeling (2025),https://arxiv.org/abs/ 2509.239096

work page arXiv 2025
[32]

ACM SIG- GRAPH (2024) 2

Ma, J., Jiang, J., Zhou, H., Ji, X., et al.: Subject- diffusion: Open-domain personalized text-to-image generation without test-time fine-tuning. ACM SIG- GRAPH (2024) 2

2024
[33]

Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware representation learning (2025),https://arxiv. org/abs/2508.076076

work page arXiv 2025
[34]

ICLR (2025) 3

Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models. ICLR (2025) 3

2025
[35]

https : / / huggingface

Matsubara, O., Team, D.T.A.: Megalith-10M: A dataset of 10 million public-domain photographs. https : / / huggingface . co / datasets / madebyollin / megalith - 10m(2024), cC0/Flickr-Commons images; Florence-2 cap- tions available in the *megalith-10m-florence2* variant. 5 11

2024
[36]

In: Interna- tional Conference on Learning Representations (2022) 1

Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.Y ., Ermon, S.: Sdedit: Guided image synthesis and edit- ing with stochastic differential equations. In: Interna- tional Conference on Learning Representations (2022) 1

2022
[37]

In: AAAI Conference on Artificial Intel- ligence (2024) 2

Mou, C., Wang, X., Xie, L., Zhang, Y ., Qi, Z., Shan, Y ., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffu- sion models. In: AAAI Conference on Artificial Intel- ligence (2024) 2

2024
[38]

com / index / image - generation - api/ (2025) 8

OpenAI: Image generation api.https://openai. com / index / image - generation - api/ (2025) 8

2025
[39]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2

2022
[40]

Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealis- tic text-to-image diffusion models with deep language understanding. Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1

2022
[41]

In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2

Samuel, D., Meiri, B., Maron, H., Tewel, Y ., Dar- shan, N., Avidan, S., Chechik, G., Ben-Ari, R.: Lightning-fast image inversion and editing for text-to- image diffusion models. In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2

2025
[42]

LAION-5B: An open large-scale dataset for training next generation image-text models

Schuhmann, C., Beaumont, R., Vencu, Richard and. . .˜Jitsev, J.: LAION-Aesthetic v2 6+: A 6 million-image aesthetic filtered subset of laion- 5b.https : / / laion . ai / blog / laion - aesthetic/(2022), subset derived from LAION- 5B (arXiv:2210.08402). 5

work page internal anchor Pith review arXiv 2022
[43]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y ., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 1

work page internal anchor Pith review arXiv 2025
[44]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image caption- ing. In: Gurevych, I., Miyao, Y . (eds.) Proceed- ings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers). pp. 2556–2565. Association for Computa- tional Lingu...

work page doi:10.18653/v1/p18-1238 2018
[45]

In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912

Sheynin, S., Polyak, A., Singer, U., Kirstain, Y ., Zohar, A., Ashual, O., Parikh, D., Taigman, Y .: Emu edit: Precise image editing via recognition and generation tasks. In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912

2024
[46]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation with- out text-video data. arXiv preprint arXiv:2209.14792 (2022) 1

work page internal anchor Pith review arXiv 2022
[47]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative mod- eling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 1

work page internal anchor Pith review Pith/arXiv arXiv 2011
[48]

Longcat-image technical report

Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y ., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y ., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025) 2, 8, 14

work page arXiv 2025
[49]

23 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al

Team, S.I., Qiao, C., Hui, C., Li, C., Wang, C., Song, D., Zhang, J., Li, J., Xiang, Q., Wang, R., Sun, S., Zhu, W., Tang, X., Hu, Y ., Chen, Y ., Huang, Y ., Duan, Y ., Chen, Z., Guo, Z.: Firered-image-edit-1.0 technical report. arXiv preprint arXiv:2602.13344 (2026) 14, 16

work page arXiv 2026
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug- and-play diffusion features for text-driven image-to- image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1921–1930 (2023) 2

1921
[51]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative mod- els. arXiv preprint arXiv:2503.20314 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

In: CVPR (2025) 8

Wang, Y ., Cao, C., Yu, J., Fan, K., Xue, X., Fu, Y .: Towards enhanced image inpainting: Mitigating un- wanted object insertion and preserving color consis- tency. In: CVPR (2025) 8

2025
[53]

Wang, Y ., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y ., Xie, C.: Gpt-image-edit-1.5m: A million- scale, gpt-generated image dataset (2025),https: //arxiv.org/abs/2507.210332, 6

work page arXiv 2025
[54]

ICLR (2025) 2, 6

Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Building image editing generalist mod- els through specialist supervision. ICLR (2025) 2, 6

2025
[55]

Wei, H., Xu, B., Liu, H., Wu, C., Liu, J., Peng, Y ., Wang, P., Liu, Z., He, J., Xietian, Y ., Tang, C., Wang, Z., Wei, Y ., Hu, L., Jiang, B., Li, W., He, Y ., Liu, Y ., Song, X., Li, E., Zhou, Y .: Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model (2025),https://arxiv.org/ abs/2509.045488

work page arXiv 2025
[56]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y ., et al.: Qwen-image 12 technical report. arXiv preprint arXiv:2508.02324 (2025) 1, 2, 3, 5, 6, 7, 8, 9, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., Li, W., Jiang, X., Liu, Y ., Zhou, J., et al.: Om- nigen2: Exploration to advanced multimodal genera- tion. arXiv preprint arXiv:2506.18871 (2025) 8, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

Xiao, S., Wang, Y ., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Om- nigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 13294–13304 (2025) 8

2025
[59]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion

Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smart- brush: Text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 22428–22437 (2023) 3

2023
[60]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2

work page internal anchor Pith review arXiv 2023
[61]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Ye, Y ., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified im- age editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025) 7

work page internal anchor Pith review arXiv 2025
[62]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y ., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y .: Anyedit: Mas- tering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 2, 8

2025
[63]

arXiv preprint arXiv:2509.23082 (2025) 3, 8

Yutao Shen, Junkun Yuan, T.A.H.N.Y .M.: Follow- your-preference: Towards preference-aligned image inpainting. arXiv preprint arXiv:2509.23082 (2025) 3, 8

work page arXiv 2025
[64]

Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Mag- icbrush: A manually annotated dataset for instruction- guided image editing. Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8

2023
[65]

In: Pro- ceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 2, 3

2023
[66]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025

Zhang, Z., Xie, J., Lu, Y ., Yang, Z., Yang, Y .: In- context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv preprint arXiv:2504.20690 (2025) 8

work page arXiv 2025
[67]

Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6

Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6

2024
[68]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

Zhao, S., Chen, D., Chen, Y .C., Bao, J., et al.: Uni- controlnet: All-in-one control to text-to-image diffu- sion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2

2024
[69]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y .: Diffu- sionnft: Online diffusion reinforcement with forward process. arXiv preprint arXiv:2509.16117 (2025) 7

work page internal anchor Pith review arXiv 2025
[70]

Zhipu AI Team: Glm-image: Open-sourced image generation model.https://z.ai/blog/glm- image(2026), accessed: 2026-01-14 8

2026
[71]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al.: In- ternvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zong, Z., Song, G., Liu, Y .: Detrs with collabora- tive hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6748–6758 (October 2023) 6 13 Supplementary Material In the supplementary material, we first provide additional qualitative results for FineEdit and extend our comparisons with state-...

2023
[73]

Extended Qualitative Results 7.1. Qualitative Results on FineEdit-1k While Figure 6 in the main text provides initial visual- izations on the FineEdit-1k benchmark, this section of- fers a more comprehensive suite of results. Due to space constraints in the main manuscript, we present ad- ditional qualitative comparisons between FineEdit, Qwen- Image-Edit...
[74]

Specifically, we include additional comparisons with classi- cal image editing baselines and the recent FireRed-Image- Edit [49]

Extended Evaluation on the FineEdit-1k Benchmark While Table 1 in the main manuscript primarily com- pares FineEdit with Flux-based methods and several re- cently released open-source SOTA models, this section pro- vides a broader evaluation on the FineEdit-1k benchmark. Specifically, we include additional comparisons with classi- cal image editing baseli...
[75]

Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales

Robustness to Bounding Box Variations In this section, we evaluate the robustness of FineEdit with respect to variations in bounding box (bbox) specifications. Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales. Furthermore, we conduct a sensitivity analysis by introducing spatial per- turbations to the bbo...
[76]

Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation

Human Evaluation We conducted a human evaluation to perform a side-by-side comparison between FineEdit and several competitive base- lines, including Qwen-Image-Edit [56], Qwen-Image-Edit- 2509, and Longcat-Image-Edit [48]. Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation. ...