GMO-E$^2$DIT: Grounded Multi-Operation Editing for E-Commerce Images

Cheng Wang; Jingling Fu; Junshi Huang; Lichen Ma; Luohang Liu; Shaojie Guo; Xiaoan Liu; Xiaolong Fu; Xinyuan Shan; Yan Li

arxiv: 2607.00920 · v1 · pith:WS465DT3new · submitted 2026-07-01 · 💻 cs.CV

GMO-E²DIT: Grounded Multi-Operation Editing for E-Commerce Images

Zipeng Guo , Xiaoan Liu , Lichen Ma , Cheng Wang , Yu He , Xiaolong Fu , Jingling Fu , Xinyuan Shan

show 4 more authors

Shaojie Guo Luohang Liu Junshi Huang Yan Li

This is my paper

Pith reviewed 2026-07-02 13:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords e-commerce image editingmulti-operation editingvision-language modelagentic frameworkgrounded editingreflection loopinstruction-driven editingEComEditBench

0 comments

The pith

A VLM agent builds region-grounded edit plans and a reflection loop executes them iteratively to handle multi-step e-commerce image edits from vague instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an agentic system that separates the task of deciding what edits to perform from the act of generating the edited pixels. A vision-language model first turns an ambiguous instruction into a sequence of operations tied to specific image regions. These operations are then carried out one by one using masks, with an inspection step that checks the current image, keeps any successful changes, and retries or corrects the rest. This structure is meant to avoid the partial failures common when a single model tries to resolve intent, locate regions, and synthesize changes all at once. The authors also supply a data pipeline and a benchmark to train and measure such systems.

Core claim

GMO-E²DIT couples a Vision-Language Model agent with a mask-conditioned image editor. The agent constructs a region-grounded edit agenda from underspecified instructions, decoupling cognitive reasoning from generative rendering. Sub-programs are executed via operation-aware masks and references inside a reflection-driven loop that inspects intermediate results, preserves safe partial progress, retries unfinished operations, and recovers from errors.

What carries the argument

The VLM agent constructing a region-grounded edit agenda, combined with a reflection-driven loop that inspects and corrects intermediate edit results.

If this is right

Instruction accuracy and edit fidelity exceed those of existing one-shot baselines on multi-operation tasks.
Safe partial progress is retained even when some operations initially fail.
Cognitive planning and pixel synthesis can be handled by separate modules without loss of overall performance.
A unified data pipeline can supply aligned supervision for planning, execution, and reflection stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of planning from rendering could extend to other tasks that require sequential localized changes, such as video or 3D model editing.
Performance gains would likely scale with improvements in the underlying vision-language model's ability to produce accurate agendas.
Commercial use would still benefit from optional human review for edits where mistakes carry high cost.
The introduced benchmark offers a concrete way to compare future multi-operation editors on instruction following and content preservation.

Load-bearing premise

The vision-language model can reliably turn vague instructions into correct region-specific operation sequences and the reflection loop can detect and fix errors without introducing new ones.

What would settle it

A collection of underspecified e-commerce edit instructions where the VLM produces wrong region assignments or the reflection step leaves visible errors or undoes prior correct changes.

Figures

Figures reproduced from arXiv: 2607.00920 by Cheng Wang, Jingling Fu, Junshi Huang, Lichen Ma, Luohang Liu, Shaojie Guo, Xiaoan Liu, Xiaolong Fu, Xinyuan Shan, Yan Li, Yu He, Zipeng Guo.

**Figure 2.** Figure 2: Overview of our framework. Training involves three stages: 1) SFT and rule-based GRPO [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the data synthesis pipeline. Stage 1 extracts structured visual and textual assets, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of training and test data. (a) Sample counts for single-task and multi-task instructions in training. (b) Evaluation set composition by instruction complexity, highlighting benchmark diversity. Unified Synthesis of Image Editing Data. We synthesize training samples by integrating three core components: the raw product image, a generatively-inpainted background, and structured elements extracte… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on image editing tasks. Highlighted regions indicate areas where [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: User study between our method and closed-source models regarding Instruction Following and Consistency [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: More visual comparison with closed-source methods across multiple tasks. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of single-turn vs. multi-turn instruction execution. (Example 1) [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of single-turn vs. multi-turn instruction execution. (Example 2) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of single-turn vs. multi-turn instruction execution. (Example 3) [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of single-turn vs. multi-turn instruction execution. (Example 4) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

read the original abstract

Real-world e-commerce image editing often requires multiple, localized, and auditable operations rather than global restyling. This compositional nature poses a dual challenge: models must precisely apply all requested edits to the correct regions while preserving unmodified content, even under ambiguous instructions. Existing one-shot editors conflate intent resolution, spatial grounding, and synthesis into a single step, frequently resulting in partial execution failures, which is unacceptable for commercial scenarios. To address this, we introduce GMO-E$^2$DIT, an agentic editing framework that couples a Vision-Language Model (VLM) with a mask-conditioned image editor to tackle structured multi-turn task completion. Given an underspecified instruction, the VLM agent constructs a region-grounded edit agenda, effectively decoupling cognitive reasoning from generative rendering. The framework then executes sub-programs via operation-aware masks and references, utilizing a reflection-driven loop to inspect intermediate results and determine the subsequent state. This iterative mechanism reliably preserves safe partial progress, retries unfinished operations, and recovers from errors. Furthermore, we develop a unified data pipeline providing aligned supervision for planning, execution, and reflection, alongside EComEditBench, a comprehensive benchmark for instruction-driven evaluation. Extensive experiments demonstrate that GMO-E$^2$DIT achieves competitive performance compared to strong closed-source models, yielding superior instruction accuracy and edit fidelity over existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a sensible agentic split between VLM planning and mask-based rendering with a reflection loop, but the abstract's performance claims have no numbers or setup details to back them up.

read the letter

The main takeaway is a framework that lets a VLM build a region-grounded edit plan first, then hands off to a mask-conditioned editor and uses a reflection step to check and retry. That decoupling plus the loop for handling partial progress is the concrete addition over one-shot editors.

It targets a real commercial pain point where instructions are vague and edits must stay localized without messing up the rest of the image. Building EComEditBench and a data pipeline that supplies supervision for planning, execution, and reflection shows they thought through the full pipeline rather than just the model.

The abstract states it reaches competitive results against closed-source models and beats baselines on accuracy and fidelity. But there are no metrics, no description of how EComEditBench was built or what its difficulty spread looks like, and no failure cases for the reflection loop. That makes it impossible to judge whether the claimed edge is real or tied to unstated choices in the test set.

This is for applied teams working on retail image tools who need auditable multi-step edits. If the full experiments section has proper baselines, released benchmark data, and clear ablation on the reflection component, the work is worth engaging. Otherwise the central claim stays untestable from the text available.

I would send it for review once the quantitative results are in place to check whether they actually support the superiority statements.

Referee Report

2 major / 2 minor

Summary. The paper introduces GMO-E²DIT, an agentic framework for multi-operation e-commerce image editing. A VLM constructs a region-grounded edit agenda from underspecified instructions, decoupling reasoning from rendering; sub-programs are executed via operation-aware masks and references, with a reflection-driven loop that inspects intermediates, preserves partial progress, retries, and recovers from errors. The authors also contribute a unified data pipeline for planning/execution/reflection supervision and EComEditBench for instruction-driven evaluation. The central claim is that GMO-E²DIT achieves competitive performance with strong closed-source models while delivering superior instruction accuracy and edit fidelity over existing baselines.

Significance. If the performance claims hold, the work would be significant for commercial image-editing pipelines by enabling auditable, compositional edits under ambiguous instructions. Explicit strengths include the introduction of EComEditBench as a new benchmark and the unified data pipeline that supplies aligned supervision for the three stages; these could serve as reusable resources for the community even if the specific agentic loop requires further validation.

major comments (2)

[Abstract] Abstract: the claim that GMO-E²DIT 'achieves competitive performance... yielding superior instruction accuracy and edit fidelity' is presented without any quantitative metrics, tables, baseline names, or error analysis, rendering the central empirical claim unverifiable from the manuscript.
[Experiments] Experiments (or equivalent section): no description is given of EComEditBench construction, its difficulty distribution, the comparison protocol for closed-source models, or failure-mode analysis of the reflection loop, all of which are load-bearing for assessing whether the reported edge is real or an artifact of benchmark design.

minor comments (2)

[Method] The term 'operation-aware masks' is used without an explicit definition or reference to its construction on first appearance.
[Figures] Figure captions could more explicitly link visual examples to the corresponding agenda steps and reflection outcomes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve transparency and verifiability of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that GMO-E²DIT 'achieves competitive performance... yielding superior instruction accuracy and edit fidelity' is presented without any quantitative metrics, tables, baseline names, or error analysis, rendering the central empirical claim unverifiable from the manuscript.

Authors: We agree that the abstract presents the performance claim at a high level without specific numbers or references. The manuscript body includes quantitative tables and baseline comparisons in the Experiments section. To address the concern, we will revise the abstract to incorporate key quantitative highlights (e.g., specific accuracy and fidelity metrics) along with pointers to the relevant tables and sections. revision: yes
Referee: [Experiments] Experiments (or equivalent section): no description is given of EComEditBench construction, its difficulty distribution, the comparison protocol for closed-source models, or failure-mode analysis of the reflection loop, all of which are load-bearing for assessing whether the reported edge is real or an artifact of benchmark design.

Authors: We acknowledge these details are insufficiently elaborated in the current manuscript. While EComEditBench and the data pipeline are introduced, the construction process, difficulty distribution, closed-source comparison protocol, and reflection-loop failure-mode analysis are not fully described. We will expand the Experiments section to provide these specifics, enabling readers to evaluate the benchmark and results more rigorously. revision: yes

Circularity Check

0 steps flagged

No equations, derivations, or self-referential reductions present in framework description.

full rationale

The paper introduces GMO-E²DIT as a new agentic framework that decouples VLM-based agenda construction from mask-conditioned editing and adds a reflection loop. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations are described. The construction is presented as an original system design with an accompanying data pipeline and benchmark; performance claims are empirical rather than derived from prior inputs by construction. This matches the default case of a self-contained descriptive paper with no circularity in any claimed derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5806 in / 945 out tokens · 21285 ms · 2026-07-02T13:54:47.779041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 27 canonical work pages · 12 internal anchors

[1]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023
[2]

Guid- ing instruction-based image editing via multimodal large language models

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024

2024
[3]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[4]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Insightedit: Towards better instruction following for image editing

Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, and Qiang Liu. Insightedit: Towards better instruction following for image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2025

2025
[7]

Imageedit-r1: Boosting multi-agent image editing via reinforcement learning.arXiv preprint arXiv:2603.08059, 2026

Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, and Trung Bui. Imageedit-r1: Boosting multi-agent image editing via reinforcement learning.arXiv preprint arXiv:2603.08059, 2026

work page arXiv 2026
[8]

Talk2image: A multi-agent system for multi-turn image generation and editing

Shichao Ma, Yunhe Guo, Jiahao Su, Qihe Huang, Zhengyang Zhou, and Yang Wang. Talk2image: A multi-agent system for multi-turn image generation and editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32437–32445, 2026

2026
[9]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[11]

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He, Yu Xu, Wenxun Dai, Yunlong Lin, Chunyu Wang, Qinglin Lu, et al. Meta-cot: Enhancing granularity and generalization in image editing.arXiv preprint arXiv:2604.24625, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Chargen: High accurate character-level visual text generation model with multimodal encoder.arXiv preprint arXiv:2412.17225, 2024

Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, and Jie Hu. Chargen: High accurate character-level visual text generation model with multimodal encoder.arXiv preprint arXiv:2412.17225, 2024

work page arXiv 2024
[13]

Objectmover: Generative object movement with video prior

Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, and Xiaojuan Qi. Objectmover: Generative object movement with video prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17682–17691, 2025

2025
[14]

Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color con- sistency

Yikai Wang, Chenjie Cao, Junqiu Yu, Ke Fan, Xiangyang Xue, and Yanwei Fu. Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color con- sistency. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23237–23248, 2025

2025
[15]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024. 11

2024
[16]

Turbofill: adapting few-step text- to-image model for fast image inpainting

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, et al. Turbofill: adapting few-step text- to-image model for fast image inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7613–7622, 2025

2025
[17]

Repainter: Empowering e-commerce object removal via spatial-matting reinforcement learning.arXiv preprint arXiv:2510.07721, 2025

Zipeng Guo, Lichen Ma, Xiaolong Fu, Gaojing Zhou, Lan Yang, Yuchen Zhou, Linkai Liu, Yu He, Ximan Liu, Shiping Dong, et al. Repainter: Empowering e-commerce object removal via spatial-matting reinforcement learning.arXiv preprint arXiv:2510.07721, 2025

work page arXiv 2025
[18]

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, and Junshi Huang. Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Replan: Reasoning-guided region planning for complex instruction-based image editing.arXiv preprint arXiv:2512.16864, 2025

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, and Jiaya Jia. Replan: Reasoning-guided region planning for complex instruction-based image editing.arXiv preprint arXiv:2512.16864, 2025

work page arXiv 2025
[20]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025

Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, and Hongsheng Li. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025

work page arXiv 2025
[21]

Mcie: Multimodal llm-driven complex instruction image editing with spatial guidance.arXiv preprint arXiv:2602.07993, 2026

Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang, and Jack Ma. Mcie: Multimodal llm-driven complex instruction image editing with spatial guidance.arXiv preprint arXiv:2602.07993, 2026

work page arXiv 2026
[22]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, and Si Liu. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

work page arXiv 2025
[23]

Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

work page arXiv 2026
[24]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026
[25]

Macro: Advancing multi-reference image generation with structured long-context data.arXiv preprint arXiv:2603.25319, 2026

Zhekai Chen, Yuqing Wang, Manyuan Zhang, and Xihui Liu. Macro: Advancing multi-reference image generation with structured long-context data.arXiv preprint arXiv:2603.25319, 2026

work page arXiv 2026
[26]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Diffedit: Diffusion- based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion- based semantic image editing with mask guidance. InInternational Conference on Learning Representations, 2023

2023
[28]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023

2023
[29]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. In Advances in Neural Information Processing Systems, 2025

2025
[30]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

FLUX.2: Next generation image generation

Black Forest Labs. FLUX.2: Next generation image generation. https://bfl.ai/models/ flux-2, 2025. Accessed: 2026-04-24

2025
[34]

Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing

Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, and Yu-Gang Jiang. Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing. arXiv preprint arXiv:2603.11593, 2026

work page arXiv 2026
[35]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

2024
[36]

Berg, Wan-Yen Lo, et al

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

2023
[37]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[38]

Fireedit: Fine-grained instruction-based image editing via region-aware vision language model

Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, and Xiaodan Liang. Fireedit: Fine-grained instruction-based image editing via region-aware vision language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[39]

Beyond simple edits: X-planner for complex instruction-based image editing.arXiv preprint arXiv:2507.05259, 2025

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, and Kr- ishna Kumar Singh. Beyond simple edits: X-planner for complex instruction-based image editing.arXiv preprint arXiv:2507.05259, 2025

work page arXiv 2025
[40]

Xing, et al

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

2023
[41]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025
[42]

Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, and Xue Yang. Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

work page arXiv 2026
[43]

Yolov10: Real-time end-to-end object detection.Advances in neural information processing systems, 37:107984–108011, 2024

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection.Advances in neural information processing systems, 37:107984–108011, 2024

2024
[44]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Pp-ocrv5: A specialized 5m-parameter model rivaling billion-parameter vision-language models on ocr tasks.arXiv preprint arXiv:2603.24373, 2026

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, et al. Pp-ocrv5: A specialized 5m-parameter model rivaling billion-parameter vision-language models on ocr tasks.arXiv preprint arXiv:2603.24373, 2026

work page arXiv 2026
[46]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Appendix In this work, we propose GMO-E 2DIT, an agentic framework for grounded multi-operation e- commerce image editing. Due to space constraints in the m...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

edit_fidelity: How well does the edited result satisfy the requested edit inside the target regions? For insertion tasks, also judge whether the inserted content matches the patch image
[49]

In the precise protocol, examine each bounding box and reference region indices in the reasoning

background_preservation: Outside the requested edit regions, is the edited image unchanged compared with the source image? Instructions.Read the benchmark instruction carefully. In the precise protocol, examine each bounding box and reference region indices in the reasoning. In the fuzzy protocol, identify the intended edit from the natural-language reque...
[50]

ungreenday,

Unless otherwise specified, all experiments are conducted on the same constructed training dataset and evaluated on the held-out EComEditBench split. All experiments are conducted on 16 NVIDIA B200 GPUs. E More Experiment Results E.1 Multi-turn Editing To evaluate our multi-turn joint RL and reflection-driven mechanism, we conduct experiments on a test se...

[1] [1]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

2023

[2] [2]

Guid- ing instruction-based image editing via multimodal large language models

Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guid- ing instruction-based image editing via multimodal large language models. InInternational Conference on Learning Representations, 2024

2024

[3] [3]

Anyedit: Mastering unified high-quality image editing for any idea

Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[4] [4]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Insightedit: Towards better instruction following for image editing

Yingjing Xu, Jie Kong, Jiazhi Wang, Xiao Pan, Bo Lin, and Qiang Liu. Insightedit: Towards better instruction following for image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2694–2703, 2025

2025

[7] [7]

Imageedit-r1: Boosting multi-agent image editing via reinforcement learning.arXiv preprint arXiv:2603.08059, 2026

Yiran Zhao, Yaoqi Ye, Xiang Liu, Michael Qizhe Shieh, and Trung Bui. Imageedit-r1: Boosting multi-agent image editing via reinforcement learning.arXiv preprint arXiv:2603.08059, 2026

work page arXiv 2026

[8] [8]

Talk2image: A multi-agent system for multi-turn image generation and editing

Shichao Ma, Yunhe Guo, Jiahao Su, Qihe Huang, Zhengyang Zhou, and Yang Wang. Talk2image: A multi-agent system for multi-turn image generation and editing. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 32437–32445, 2026

2026

[9] [9]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[11] [11]

Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Shiyi Zhang, Yiji Cheng, Tiankai Hang, Zijin Yin, Runze He, Yu Xu, Wenxun Dai, Yunlong Lin, Chunyu Wang, Qinglin Lu, et al. Meta-cot: Enhancing granularity and generalization in image editing.arXiv preprint arXiv:2604.24625, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Chargen: High accurate character-level visual text generation model with multimodal encoder.arXiv preprint arXiv:2412.17225, 2024

Lichen Ma, Tiezhu Yue, Pei Fu, Yujie Zhong, Kai Zhou, Xiaoming Wei, and Jie Hu. Chargen: High accurate character-level visual text generation model with multimodal encoder.arXiv preprint arXiv:2412.17225, 2024

work page arXiv 2024

[13] [13]

Objectmover: Generative object movement with video prior

Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, and Xiaojuan Qi. Objectmover: Generative object movement with video prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17682–17691, 2025

2025

[14] [14]

Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color con- sistency

Yikai Wang, Chenjie Cao, Junqiu Yu, Ke Fan, Xiangyang Xue, and Yanwei Fu. Towards enhanced image inpainting: Mitigating unwanted object insertion and preserving color con- sistency. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23237–23248, 2025

2025

[15] [15]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer, 2024. 11

2024

[16] [16]

Turbofill: adapting few-step text- to-image model for fast image inpainting

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou, et al. Turbofill: adapting few-step text- to-image model for fast image inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7613–7622, 2025

2025

[17] [17]

Repainter: Empowering e-commerce object removal via spatial-matting reinforcement learning.arXiv preprint arXiv:2510.07721, 2025

Zipeng Guo, Lichen Ma, Xiaolong Fu, Gaojing Zhou, Lan Yang, Yuchen Zhou, Linkai Liu, Yu He, Ximan Liu, Shiping Dong, et al. Repainter: Empowering e-commerce object removal via spatial-matting reinforcement learning.arXiv preprint arXiv:2510.07721, 2025

work page arXiv 2025

[18] [18]

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, and Junshi Huang. Um-text: A unified multimodal model for image understanding.arXiv preprint arXiv:2601.08321, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Replan: Reasoning-guided region planning for complex instruction-based image editing.arXiv preprint arXiv:2512.16864, 2025

Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, and Jiaya Jia. Replan: Reasoning-guided region planning for complex instruction-based image editing.arXiv preprint arXiv:2512.16864, 2025

work page arXiv 2025

[20] [20]

Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025

Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, and Hongsheng Li. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv preprint arXiv:2503.10639, 2025

work page arXiv 2025

[21] [21]

Mcie: Multimodal llm-driven complex instruction image editing with spatial guidance.arXiv preprint arXiv:2602.07993, 2026

Xuehai Bai, Xiaoling Gu, Akide Liu, Hangjie Yuan, YiFan Zhang, and Jack Ma. Mcie: Multimodal llm-driven complex instruction image editing with spatial guidance.arXiv preprint arXiv:2602.07993, 2026

work page arXiv 2026

[22] [22]

Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, Xunliang Cai, Linjiang Huang, Hongsheng Li, and Si Liu. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

work page arXiv 2025

[23] [23]

Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, and Deng Cai. Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing.arXiv preprint arXiv:2601.03467, 2026

work page arXiv 2026

[24] [24]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026

[25] [25]

Macro: Advancing multi-reference image generation with structured long-context data.arXiv preprint arXiv:2603.25319, 2026

Zhekai Chen, Yuqing Wang, Manyuan Zhang, and Xihui Liu. Macro: Advancing multi-reference image generation with structured long-context data.arXiv preprint arXiv:2603.25319, 2026

work page arXiv 2026

[26] [26]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Diffedit: Diffusion- based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion- based semantic image editing with mask guidance. InInternational Conference on Learning Representations, 2023

2023

[28] [28]

Magicbrush: A manually annotated dataset for instruction-guided image editing

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. InAdvances in Neural Information Processing Systems, 2023

2023

[29] [29]

In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer. In Advances in Neural Information Processing Systems, 2025

2025

[30] [30]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

FLUX.2: Next generation image generation

Black Forest Labs. FLUX.2: Next generation image generation. https://bfl.ai/models/ flux-2, 2025. Accessed: 2026-04-24

2025

[34] [34]

Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing

Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, and Yu-Gang Jiang. Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing. arXiv preprint arXiv:2603.11593, 2026

work page arXiv 2026

[35] [35]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Conference on Computer Vision, 2024

2024

[36] [36]

Berg, Wan-Yen Lo, et al

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023

2023

[37] [37]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[38] [38]

Fireedit: Fine-grained instruction-based image editing via region-aware vision language model

Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, and Xiaodan Liang. Fireedit: Fine-grained instruction-based image editing via region-aware vision language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[39] [39]

Beyond simple edits: X-planner for complex instruction-based image editing.arXiv preprint arXiv:2507.05259, 2025

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, and Kr- ishna Kumar Singh. Beyond simple edits: X-planner for complex instruction-based image editing.arXiv preprint arXiv:2507.05259, 2025

work page arXiv 2025

[40] [40]

Xing, et al

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023

2023

[41] [41]

Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing.arXiv preprint arXiv:2509.26346, 2025

work page arXiv 2025

[42] [42]

Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, and Xue Yang. Trust your critic: Robust reward modeling and reinforcement learning for faithful image editing and generation.arXiv preprint arXiv:2603.12247, 2026

work page arXiv 2026

[43] [43]

Yolov10: Real-time end-to-end object detection.Advances in neural information processing systems, 37:107984–108011, 2024

Ao Wang, Hui Chen, Lihao Liu, Kai Chen, Zijia Lin, Jungong Han, and Guiguang Ding. Yolov10: Real-time end-to-end object detection.Advances in neural information processing systems, 37:107984–108011, 2024

2024

[44] [44]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Pp-ocrv5: A specialized 5m-parameter model rivaling billion-parameter vision-language models on ocr tasks.arXiv preprint arXiv:2603.24373, 2026

Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, et al. Pp-ocrv5: A specialized 5m-parameter model rivaling billion-parameter vision-language models on ocr tasks.arXiv preprint arXiv:2603.24373, 2026

work page arXiv 2026

[46] [46]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 14 Appendix In this work, we propose GMO-E 2DIT, an agentic framework for grounded multi-operation e- commerce image editing. Due to space constraints in the m...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

edit_fidelity: How well does the edited result satisfy the requested edit inside the target regions? For insertion tasks, also judge whether the inserted content matches the patch image

[49] [49]

In the precise protocol, examine each bounding box and reference region indices in the reasoning

background_preservation: Outside the requested edit regions, is the edited image unchanged compared with the source image? Instructions.Read the benchmark instruction carefully. In the precise protocol, examine each bounding box and reference region indices in the reasoning. In the fuzzy protocol, identify the intended edit from the natural-language reque...

[50] [50]

ungreenday,

Unless otherwise specified, all experiments are conducted on the same constructed training dataset and evaluated on the held-out EComEditBench split. All experiments are conducted on 16 NVIDIA B200 GPUs. E More Experiment Results E.1 Multi-turn Editing To evaluate our multi-turn joint RL and reflection-driven mechanism, we conduct experiments on a test se...