Recognition: unknown
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
Pith reviewed 2026-05-10 16:09 UTC · model grok-4.3
The pith
FineEdit injects bounding box guidance at multiple levels in diffusion models to localize edits precisely while keeping backgrounds unchanged.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Injecting bounding box conditions at multiple levels during diffusion enables the model to localize the editing target accurately, thereby preserving background consistency without global regeneration.
What carries the argument
Multi-level bounding box injection, which supplies spatial box coordinates to the diffusion network at several processing stages to constrain changes to the indicated region.
If this is right
- Users gain a direct way to mark exact edit locations instead of describing them in ambiguous text.
- Background regions remain untouched because the model no longer regenerates the entire image.
- Training on 1.2 million annotated pairs produces stronger generalization across different editing scenarios.
- The same multi-level injection pattern can support related region-based tasks such as object removal or replacement.
Where Pith is reading between the lines
- Drawing tools in consumer editors could adopt box input as a simpler alternative to prompt refinement.
- The dataset construction pipeline might be reused to create training data for mask- or point-guided editing.
- Extending the injection to video frames could enable spatially controlled video editing without full re-synthesis.
Load-bearing premise
That bounding box signals can be added at multiple diffusion stages without creating artifacts or requiring per-image tuning, and that the 1.2 million dataset plus ten-subject benchmark reflect typical user editing needs.
What would settle it
A controlled test on diverse editing tasks outside FineEdit-Bench where FineEdit produces more background changes or lower instruction compliance than current open-source baselines.
Figures
read the original abstract
Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FineEdit, a diffusion-based image editing approach that injects bounding box guidance at multiple levels to improve target localization and background consistency over prompt-only methods. It contributes the FineEdit-1.2M dataset (1.2 million editing pairs with precise bbox annotations) and FineEdit-Bench (1,000 images across 10 subjects) for evaluating region-based edits. The central claim is that FineEdit significantly outperforms open-source baselines such as Qwen-Image-Edit and LongCat-Image-Edit on instruction compliance and background preservation on FineEdit-Bench, with further gains shown on GEdit and ImgEdit Bench.
Significance. If the performance claims are supported under controlled conditions, the work advances controllable diffusion editing by showing the utility of explicit spatial cues for fine-grained tasks. The large-scale dataset and specialized benchmark constitute clear community resources that could standardize evaluation of localization and preservation in editing models.
major comments (1)
- [Experiments section (FineEdit-Bench and open-benchmark evaluations)] Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.
minor comments (2)
- [Abstract] Abstract: the claim of 'significantly outperforms' would be strengthened by naming the concrete metrics (e.g., CLIP similarity, LPIPS, or human preference scores) used for instruction compliance and background preservation.
- [Method section] Method section: a diagram or pseudocode illustrating the multi-level bbox injection (e.g., which layers or attention modules receive the condition) would improve clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper to improve clarity on the experimental setup.
read point-by-point responses
-
Referee: Experiments section (FineEdit-Bench and open-benchmark evaluations): the reported outperformance over Qwen-Image-Edit and LongCat-Image-Edit does not specify whether the baselines received equivalent bounding-box conditioning (via adapted prompts, masks, or other spatial input derived from the same boxes). Without this, gains in instruction compliance and background preservation cannot be isolated to the proposed multi-level injection rather than the mere presence of an additional spatial modality.
Authors: We appreciate the referee for identifying this ambiguity in the experimental description. The manuscript does not explicitly state the conditioning provided to the baselines. Qwen-Image-Edit and LongCat-Image-Edit are open-source models designed for prompt-based editing and do not natively accept bounding box inputs. They were therefore evaluated using only the provided textual instructions, without any spatial information derived from the bounding boxes in FineEdit-Bench or the other benchmarks. This reflects their standard usage and allows us to demonstrate the advantage of adding explicit multi-level bounding box guidance. The performance gains in instruction compliance and background preservation are intended to highlight the benefit of our spatial injection mechanism over conventional prompt-only approaches. We acknowledge that the comparison is not fully controlled for input modality and that adapting the baselines to accept boxes would require non-trivial modifications outside the scope of this work. We will revise the Experiments section to explicitly describe the baseline evaluation protocol, note the differing input modalities, and discuss this as a limitation of the current comparisons. revision: yes
Circularity Check
No circularity detected; empirical method and benchmark claims are self-contained
full rationale
The paper introduces a multi-level bounding-box injection technique for diffusion editing, a new 1.2M-pair dataset with box annotations, and FineEdit-Bench for evaluation. All load-bearing claims are empirical (outperformance on instruction compliance and background preservation versus Qwen-Image-Edit and LongCat-Image-Edit). No equations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the provided text. The method is presented as an architectural choice trained on the new data; results are benchmark comparisons rather than first-principles derivations that reduce to their own inputs by construction. This is the expected non-finding for a standard applied CV paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2
Alibaba Cloud PAI Team: Z-image-turbo-fun- controlnet-union-2.1.https : / / huggingface . co / alibaba - pai / Z - Image - Turbo - Fun - Controlnet - Union - 2 . 1(2025), accessed: 2026-01-12 8
2025
-
[2]
In: ACM Transactions on Graphics (TOG)
Avrahami, O., Lischinski, D., Fried, O.: Blended la- tent diffusion. In: ACM Transactions on Graphics (TOG). vol. 42, pp. 1–11 (2023) 2
2023
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V o- leti, V ., Letts, A., et al.: Stable video diffusion: Scal- ing latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 1
work page internal anchor Pith review arXiv 2023
-
[5]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Brooks, T., Holynski, A., Efros, A.A.: Instruct- pix2pix: Learning to follow image editing instruc- tions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392– 18402 (2023) 1, 2, 8
2023
-
[6]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)
Cao, M., Wang, X., Qi, Z., Shan, Y ., Qie, X., Zheng, Y .: Masactrl: Tuning-free mutual self-attention con- trol for consistent image synthesis and editing. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 22560–22570 (October 2023) 2
2023
-
[7]
Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023
Chen, J., Huang, Y ., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. arXiv preprint arXiv:2305.10855 (2023) 7
-
[8]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerg- ing properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 8, 16
work page internal anchor Pith review arXiv 2025
-
[9]
Seed-data-edit technical report: A hybrid dataset for instructional image editing
Ge, Y ., Zhao, S., Li, C., Ge, Y ., Shan, Y .: Seed-data- edit technical report: A hybrid dataset for instruc- tional image editing. arXiv preprint arXiv:2405.04007 (2024) 2
-
[10]
Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu
Gupta, A., Doll ´ar, P., Girshick, R.B.: LVIS: A dataset for large vocabulary instance segmentation. CoRR abs/1908.03195(2019),http : / / arxiv . org / abs/1908.031955
-
[11]
In: International Conference on Learning Representations (2022) 2
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y ., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. In: International Conference on Learning Representations (2022) 2
2022
-
[12]
Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion prob- abilistic models. Advances in neural information pro- cessing systems33, 6840–6851 (2020) 1 10
2020
-
[13]
In: International Con- ference on Learning Representations 7
Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adapta- tion of large language models. In: International Con- ference on Learning Representations 7
-
[14]
Huang, L., Chen, K., Chu, W., Liu, J., et al.: Com- poser: Creative and controllable image synthesis with composable chained diffusion forces. arXiv preprint arXiv:2302.09778 (2023) 2
-
[15]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition
Huberman-Spiegelglas, I., Kulikov, V ., Michaeli, T.: An edit friendly DDPM noise space: Inversion and manipulations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 12469–12478 (2024) 2
2024
-
[16]
arXiv preprint arXiv:2404.09990 , year=
Hui, M., Yang, S., Zhao, B., Shi, Y ., Wang, H., Wang, P., Zhou, Y ., Xie, C.: Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990 (2024) 6
-
[17]
In: Euro- pean Conference on Computer Vision
Ju, X., Liu, X., Wang, X., Bian, Y ., Shan, Y ., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: Euro- pean Conference on Computer Vision. pp. 150–168. Springer (2024) 2, 8
2024
-
[18]
Auto-Encoding Variational Bayes
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 3
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[19]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kulikov, V ., Kleiner, M., Huberman-Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based edit- ing using pre-trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19721–19730 (2025) 2
2025
-
[20]
Kuprashevich, M., Alekseenko, G., Tolstykh, I., Fe- dorov, G., Suleimanov, B., Dokholyan, V ., Gordeev, A.: Nohumansrequired: Autonomous high-quality image editing triplet mining (2025) 2, 7
2025
-
[21]
com / black-forest-labs/flux(2024) 2, 8
Labs, B.F.: Flux.https : / / github . com / black-forest-labs/flux(2024) 2, 8
2024
-
[23]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Con- sul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 1, 3, 6, 7, 8
work page internal anchor Pith review arXiv 2025
-
[24]
Li, Y ., Bian, Y ., Ju, X., Zhang, Z., , Zhuang, J., Shan, Y ., Zou, Y ., Xu, Q.: Brushedit: All-in-one image in- painting and editing (2024) 3, 8
2024
-
[25]
In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition
Li, Y ., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y .J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition. pp. 22511–22521 (2023) 2
2023
-
[26]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, B., Li, Z., Cheng, X., Niu, Y ., Ye, Y ., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y ., et al.: Uni- world: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025) 8
work page internal anchor Pith review arXiv 2025
-
[27]
Microsoft COCO: Common Objects in Context
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L.: Microsoft COCO: com- mon objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.03125
work page internal anchor Pith review arXiv 2014
-
[28]
NIPS (2024) 8
Liu, K., Zhu, Z., Li, C., Liu, H., Zeng, H., Hou, J.: Prefpaint: Aligning image inpainting diffusion model with human preference. NIPS (2024) 8
2024
-
[29]
Step1X-Edit: A Practical Framework for General Image Editing
Liu, S., Han, Y ., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y ., Fu, H., Han, C., et al.: Step1x- edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 6, 7, 8, 16
work page internal anchor Pith review arXiv 2025
-
[30]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Tim- ofte, R., Van Gool, L.: Repaint: Inpainting using de- noising diffusion probabilistic models. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11461–11471 (2022) 2
2022
- [31]
-
[32]
ACM SIG- GRAPH (2024) 2
Ma, J., Jiang, J., Zhou, H., Ji, X., et al.: Subject- diffusion: Open-domain personalized text-to-image generation without test-time fine-tuning. ACM SIG- GRAPH (2024) 2
2024
-
[33]
Ma, J., Zhu, X., Pan, Z., Peng, Q., Guo, X., Chen, C., Lu, H.: X2edit: Revisiting arbitrary-instruction image editing through self-constructed data and task-aware representation learning (2025),https://arxiv. org/abs/2508.076076
-
[34]
ICLR (2025) 3
Manukyan, H., Sargsyan, A., Atanyan, B., Wang, Z., Navasardyan, S., Shi, H.: Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models. ICLR (2025) 3
2025
-
[35]
https : / / huggingface
Matsubara, O., Team, D.T.A.: Megalith-10M: A dataset of 10 million public-domain photographs. https : / / huggingface . co / datasets / madebyollin / megalith - 10m(2024), cC0/Flickr-Commons images; Florence-2 cap- tions available in the *megalith-10m-florence2* variant. 5 11
2024
-
[36]
In: Interna- tional Conference on Learning Representations (2022) 1
Meng, C., He, Y ., Song, Y ., Song, J., Wu, J., Zhu, J.Y ., Ermon, S.: Sdedit: Guided image synthesis and edit- ing with stochastic differential equations. In: Interna- tional Conference on Learning Representations (2022) 1
2022
-
[37]
In: AAAI Conference on Artificial Intel- ligence (2024) 2
Mou, C., Wang, X., Xie, L., Zhang, Y ., Qi, Z., Shan, Y ., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffu- sion models. In: AAAI Conference on Artificial Intel- ligence (2024) 2
2024
-
[38]
com / index / image - generation - api/ (2025) 8
OpenAI: Image generation api.https://openai. com / index / image - generation - api/ (2025) 8
2025
-
[39]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2
2022
-
[40]
Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealis- tic text-to-image diffusion models with deep language understanding. Advances in neural information pro- cessing systems35, 36479–36494 (2022) 1
2022
-
[41]
In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2
Samuel, D., Meiri, B., Maron, H., Tewel, Y ., Dar- shan, N., Avidan, S., Chechik, G., Ben-Ari, R.: Lightning-fast image inversion and editing for text-to- image diffusion models. In: Proceedings of the In- ternational Conference on Learning Representations (ICLR) (2025) 2
2025
-
[42]
LAION-5B: An open large-scale dataset for training next generation image-text models
Schuhmann, C., Beaumont, R., Vencu, Richard and. . .˜Jitsev, J.: LAION-Aesthetic v2 6+: A 6 million-image aesthetic filtered subset of laion- 5b.https : / / laion . ai / blog / laion - aesthetic/(2022), subset derived from LAION- 5B (arXiv:2210.08402). 5
work page internal anchor Pith review arXiv 2022
-
[43]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Seedream, T., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y ., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 1
work page internal anchor Pith review arXiv 2025
-
[44]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image caption- ing. In: Gurevych, I., Miyao, Y . (eds.) Proceed- ings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers). pp. 2556–2565. Association for Computa- tional Lingu...
-
[45]
In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912
Sheynin, S., Polyak, A., Singer, U., Kirstain, Y ., Zohar, A., Ashual, O., Parikh, D., Taigman, Y .: Emu edit: Precise image editing via recognition and generation tasks. In: CVPR (2024),https: //api.semanticscholar.org/CorpusID: 2652213912
2024
-
[46]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation with- out text-video data. arXiv preprint arXiv:2209.14792 (2022) 1
work page internal anchor Pith review arXiv 2022
-
[47]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y ., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative mod- eling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020) 1
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[48]
Longcat-image technical report
Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y ., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y ., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025) 2, 8, 14
-
[49]
Team, S.I., Qiao, C., Hui, C., Li, C., Wang, C., Song, D., Zhang, J., Li, J., Xiang, Q., Wang, R., Sun, S., Zhu, W., Tang, X., Hu, Y ., Chen, Y ., Huang, Y ., Duan, Y ., Chen, Z., Guo, Z.: Firered-image-edit-1.0 technical report. arXiv preprint arXiv:2602.13344 (2026) 14, 16
-
[50]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion
Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug- and-play diffusion features for text-driven image-to- image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1921–1930 (2023) 2
1921
-
[51]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative mod- els. arXiv preprint arXiv:2503.20314 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
In: CVPR (2025) 8
Wang, Y ., Cao, C., Yu, J., Fan, K., Xue, X., Fu, Y .: Towards enhanced image inpainting: Mitigating un- wanted object insertion and preserving color consis- tency. In: CVPR (2025) 8
2025
- [53]
-
[54]
ICLR (2025) 2, 6
Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Building image editing generalist mod- els through specialist supervision. ICLR (2025) 2, 6
2025
-
[55]
Wei, H., Xu, B., Liu, H., Wu, C., Liu, J., Peng, Y ., Wang, P., Liu, Z., He, J., Xietian, Y ., Tang, C., Wang, Z., Wei, Y ., Hu, L., Jiang, B., Li, W., He, Y ., Liu, Y ., Song, X., Li, E., Zhou, Y .: Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model (2025),https://arxiv.org/ abs/2509.045488
-
[56]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y ., et al.: Qwen-image 12 technical report. arXiv preprint arXiv:2508.02324 (2025) 1, 2, 3, 5, 6, 7, 8, 9, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., Li, W., Jiang, X., Liu, Y ., Zhou, J., et al.: Om- nigen2: Exploration to advanced multimodal genera- tion. arXiv preprint arXiv:2506.18871 (2025) 8, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence
Xiao, S., Wang, Y ., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Om- nigen: Unified image generation. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 13294–13304 (2025) 8
2025
-
[59]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion
Xie, S., Zhang, Z., Lin, Z., Hinz, T., Zhang, K.: Smart- brush: Text and shape guided object inpainting with diffusion model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 22428–22437 (2023) 3
2023
-
[60]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 2
work page internal anchor Pith review arXiv 2023
-
[61]
ImgEdit: A Unified Image Editing Dataset and Benchmark
Ye, Y ., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified im- age editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025) 7
work page internal anchor Pith review arXiv 2025
-
[62]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y ., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y .: Anyedit: Mas- tering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26125–26135 (2025) 2, 8
2025
-
[63]
arXiv preprint arXiv:2509.23082 (2025) 3, 8
Yutao Shen, Junkun Yuan, T.A.H.N.Y .M.: Follow- your-preference: Towards preference-aligned image inpainting. arXiv preprint arXiv:2509.23082 (2025) 3, 8
-
[64]
Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y .: Mag- icbrush: A manually annotated dataset for instruction- guided image editing. Advances in Neural Information Processing Systems36, 31428–31449 (2023) 2, 8
2023
-
[65]
In: Pro- ceedings of the IEEE/CVF international conference on computer vision
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023) 2, 3
2023
-
[66]
Zhang, Z., Xie, J., Lu, Y ., Yang, Z., Yang, Y .: In- context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. arXiv preprint arXiv:2504.20690 (2025) 8
-
[67]
Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6
Zhao, H., Ma, X.S., Chen, L., Si, S., Wu, R., An, K., Yu, P., Zhang, M., Li, Q., Chang, B.: Ultraedit: Instruction-based fine-grained image editing at scale. Advances in Neural Information Processing Systems 37, 3058–3093 (2024) 2, 6
2024
-
[68]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2
Zhao, S., Chen, D., Chen, Y .C., Bao, J., et al.: Uni- controlnet: All-in-one control to text-to-image diffu- sion models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR) (2024) 2
2024
-
[69]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y .: Diffu- sionnft: Online diffusion reinforcement with forward process. arXiv preprint arXiv:2509.16117 (2025) 7
work page internal anchor Pith review arXiv 2025
-
[70]
Zhipu AI Team: Glm-image: Open-sourced image generation model.https://z.ai/blog/glm- image(2026), accessed: 2026-01-14 8
2026
-
[71]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al.: In- ternvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Zong, Z., Song, G., Liu, Y .: Detrs with collabora- tive hybrid assignments training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 6748–6758 (October 2023) 6 13 Supplementary Material In the supplementary material, we first provide additional qualitative results for FineEdit and extend our comparisons with state-...
2023
-
[73]
Extended Qualitative Results 7.1. Qualitative Results on FineEdit-1k While Figure 6 in the main text provides initial visual- izations on the FineEdit-1k benchmark, this section of- fers a more comprehensive suite of results. Due to space constraints in the main manuscript, we present ad- ditional qualitative comparisons between FineEdit, Qwen- Image-Edit...
-
[74]
Specifically, we include additional comparisons with classi- cal image editing baselines and the recent FireRed-Image- Edit [49]
Extended Evaluation on the FineEdit-1k Benchmark While Table 1 in the main manuscript primarily com- pares FineEdit with Flux-based methods and several re- cently released open-source SOTA models, this section pro- vides a broader evaluation on the FineEdit-1k benchmark. Specifically, we include additional comparisons with classi- cal image editing baseli...
-
[75]
Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales
Robustness to Bounding Box Variations In this section, we evaluate the robustness of FineEdit with respect to variations in bounding box (bbox) specifications. Specifically, we first analyze the model’s editing perfor- mance across a wide range of bbox scales. Furthermore, we conduct a sensitivity analysis by introducing spatial per- turbations to the bbo...
-
[76]
Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation
Human Evaluation We conducted a human evaluation to perform a side-by-side comparison between FineEdit and several competitive base- lines, including Qwen-Image-Edit [56], Qwen-Image-Edit- 2509, and Longcat-Image-Edit [48]. Specifically, we conducted independent blind trials on the FineEdit-1k benchmark, involving ten human volunteers for the evaluation. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.