pith. sign in

arxiv: 2507.01908 · v3 · pith:664KFPVSnew · submitted 2025-07-02 · 💻 cs.CV

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Pith reviewed 2026-05-19 05:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editingvisual reasoningdiffusion modelsmultimodal large language modelshypothetical instructionsReason50Kinstruction-based editing
0
0 comments X

The pith

ReasonBrain edits images from implicit hypothetical instructions by reasoning across physical, temporal, causal, and story scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets instruction-based image editing for complex implicit hypothetical instructions that demand inference of user intent and plausible visual outcomes, going beyond simple explicit commands like adding or swapping objects. Existing approaches lack support for such reasoning and fine-grained detail handling, so the authors introduce the Reason50K dataset of over 50K samples spanning four reasoning types along with the ReasonBrain framework. ReasonBrain pairs multimodal large language models for generating editing guidance with diffusion-based image synthesis, augmented by a Fine-grained Reasoning Cue Extraction module and a Cross-Modal Enhancer to retain detailed semantics. If the approach holds, AI editors could process natural, ambiguous user requests without requiring step-by-step specifications. Experiments indicate stronger results on reasoning tasks and zero-shot transfer to conventional editing.

Core claim

ReasonBrain reasons over and executes implicit hypothetical instructions for image editing by using MLLMs to produce editing guidance and a diffusion model for synthesis, with the Fine-grained Reasoning Cue Extraction module capturing detailed visual and textual semantics and the Cross-Modal Enhancer enabling rich interactions between those cues and MLLM features; this combination supports four reasoning scenarios and yields better performance than baselines on the Reason50K dataset while generalizing to standard IIE tasks.

What carries the argument

The Fine-grained Reasoning Cue Extraction (FRCE) module paired with the Cross-Modal Enhancer (CME), which together extract and preserve fine-grained visual and textual semantics to enable reasoning over implicit hypothetical instructions.

If this is right

  • Handles complex edits that require inferring user intent and plausible visual changes without explicit details.
  • Delivers stronger results than prior methods on physical, temporal, causal, and story reasoning scenarios.
  • Generalizes in zero-shot fashion to conventional instruction-based image editing tasks.
  • Supplies a large-scale dataset that enables training and evaluation of reasoning-aware editing models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cue-extraction and cross-modal interaction pattern could be tested on video sequences to support temporal reasoning in motion edits.
  • Design tools might adopt this style of guidance generation to interpret vague creative briefs from users.
  • Combining the approach with additional signals such as depth maps or segmentation masks could further reduce semantic drift in multi-step inferences.

Load-bearing premise

The Fine-grained Reasoning Cue Extraction module together with the Cross-Modal Enhancer successfully captures and preserves the detailed visual and textual semantics required to support implicit hypothetical instruction reasoning without semantic loss.

What would settle it

A test suite of hypothetical instructions where the model produces edits that omit or contradict key inferred elements, such as failing to apply causal consequences or temporal changes implied but not explicitly stated in the prompt.

Figures

Figures reproduced from arXiv: 2507.01908 by Chaoyi Wang, Chengjie Wang, Jiangning Zhang, Qingdong He, Xiangtai Li, Xiaobin Hu, Xueqin Chen, Yabiao Wang, Yanjie Pan, Zhenye Gan.

Figure 1
Figure 1. Figure 1: Current efforts fail to handle hypothetical instructions, producing incorrect results, while [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reasoning scenarios in Reason50K. The percentages in parentheses indicate the proportion [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall framework of ReasonBrain. Given an input image [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Network design of the (a) ID Controller and (b) Cross-Modal Enhancer. Cross-Modal Enhancer. To compensate for the potential loss of visual and textual details in Vˆ , we introduce a Cross-Modal Enhancer (CME). The CME consists of a visual-oriented enhancer and a textual-oriented enhancer, both implemented using the same bidirectional inter￾action mechanism. Specifically, each enhancer comprises five hybrid… view at source ↗
Figure 5
Figure 5. Figure 5: Results on ReasonEdit, EditWorld, and Reason50K for ReasonBrain and selected SOTA [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on Reason50K between ReasonBrain and selected SOTA meth [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of ablation variants in ReasonBrain. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of patch and re￾gion branches in VRCB. Impact of Each Visual Branch in VRCB. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Reason50K, a new dataset of over 50K samples spanning Physical, Temporal, Causal, and Story reasoning scenarios for hypothetical instruction-based image editing. It proposes ReasonBrain, a framework that combines MLLMs for editing guidance generation with a diffusion model for synthesis, augmented by a Fine-grained Reasoning Cue Extraction (FRCE) module and a Cross-Modal Enhancer (CME) to capture and preserve detailed semantics for implicit reasoning. The central empirical claim is that ReasonBrain outperforms state-of-the-art baselines on reasoning-aware editing tasks while showing strong zero-shot generalization to conventional IIE benchmarks.

Significance. If the quantitative results and ablations hold, the work meaningfully extends instruction-based image editing beyond explicit operations to implicit hypothetical instructions that require inferring user intent and plausible visual changes. The new dataset addresses a clear gap in existing benchmarks, and the architectural additions (FRCE + CME) target semantic preservation in a principled way. Public release of the dataset and code is a clear strength for reproducibility and follow-on research.

major comments (2)
  1. [§5.2, Table 2] §5.2, Table 2: The reported outperformance on the four reasoning scenarios (e.g., higher CLIP similarity and user preference scores) is central to the main claim, yet the paper provides no statistical significance tests, standard deviations across runs, or details on how the baselines were adapted to handle implicit hypothetical instructions; this weakens the reliability of the 'consistently outperforms' assertion.
  2. [§3.3] §3.3, CME description: The claim that the Cross-Modal Enhancer mitigates semantic loss between FRCE cues and MLLM features is load-bearing for the generalization results, but the exact conditioning mechanism into the diffusion U-Net (e.g., cross-attention layers or feature fusion equations) is underspecified, making it impossible to verify that the module actually preserves the fine-grained details required for causal and story reasoning.
minor comments (2)
  1. [Figure 3, §4.1] Figure 3 caption and §4.1: The dataset construction pipeline diagram is helpful but lacks explicit counts or percentages for how many samples fall into each of the four reasoning categories, which would aid readers in assessing balance.
  2. [Related Work] Related Work section: Several recent diffusion-based editing papers (e.g., post-2023 works on instruction tuning) are cited but not compared in the experimental tables; adding a brief discussion of why they were omitted from the zero-shot evaluation would improve context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive assessment of the work's significance. We address each major comment below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5.2, Table 2] §5.2, Table 2: The reported outperformance on the four reasoning scenarios (e.g., higher CLIP similarity and user preference scores) is central to the main claim, yet the paper provides no statistical significance tests, standard deviations across runs, or details on how the baselines were adapted to handle implicit hypothetical instructions; this weakens the reliability of the 'consistently outperforms' assertion.

    Authors: We agree that statistical tests and variance reporting would strengthen the reliability of the empirical claims. In the revised version, we will add standard deviations computed over multiple random seeds for all metrics in Table 2 and include paired t-test p-values to establish statistical significance of the reported improvements. We will also expand Section 5.2 with a dedicated paragraph detailing the exact adaptation procedure for each baseline (prompt reformulation to accommodate implicit instructions, any additional fine-tuning steps, and hyperparameter settings), ensuring reproducibility. revision: yes

  2. Referee: [§3.3] §3.3, CME description: The claim that the Cross-Modal Enhancer mitigates semantic loss between FRCE cues and MLLM features is load-bearing for the generalization results, but the exact conditioning mechanism into the diffusion U-Net (e.g., cross-attention layers or feature fusion equations) is underspecified, making it impossible to verify that the module actually preserves the fine-grained details required for causal and story reasoning.

    Authors: We acknowledge that the conditioning details of the Cross-Modal Enhancer (CME) in §3.3 are currently high-level. The CME performs cross-attention where FRCE-derived cues act as keys/values and MLLM features serve as queries; the resulting attended representation is then injected into the diffusion U-Net via additional cross-attention layers at multiple resolutions. We will revise §3.3 to include the precise mathematical formulation of the fusion operation and the layer-wise conditioning equations, together with a supplementary diagram, so that readers can verify how fine-grained semantics are preserved for causal and story reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical contribution: a newly collected dataset (Reason50K) spanning four reasoning scenarios and a new architecture (ReasonBrain) that combines MLLM guidance generation with a diffusion model, augmented by the FRCE module for cue extraction and the CME for cross-modal interaction. The central claims consist of experimental outperformance on reasoning tasks plus zero-shot generalization to standard IIE, evaluated against external baselines on the new dataset. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described framework. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on standard assumptions of diffusion models and MLLMs plus two newly introduced modules whose effectiveness is asserted rather than derived from first principles.

axioms (2)
  • domain assumption Multimodal large language models can generate reliable editing guidance from implicit hypothetical instructions when supplied with fine-grained visual and textual cues.
    Invoked in the description of ReasonBrain's use of MLLMs for editing guidance generation.
  • domain assumption Diffusion models can synthesize images that faithfully reflect the reasoned editing guidance without introducing artifacts that invalidate the reasoning.
    Implicit in the pipeline that feeds MLLM output into the diffusion model for final synthesis.
invented entities (2)
  • Fine-grained Reasoning Cue Extraction (FRCE) module no independent evidence
    purpose: Capture detailed visual and textual semantics essential for supporting instruction reasoning
    New component introduced to address lack of fine-grained detail extraction in prior architectures.
  • Cross-Modal Enhancer (CME) no independent evidence
    purpose: Enable rich interactions between fine-grained cues and MLLM-derived features to mitigate semantic loss
    New component introduced to reduce semantic loss between modalities.

pith-pipeline@v0.9.0 · 5830 in / 1559 out tokens · 36190 ms · 2026-05-19T05:43:57.457344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280, 2024

    Jinbin Bai, Wei Chow, Ling Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, and Shuicheng Yan. Humanedit: A high-quality human-rewarded dataset for instruction-based image editing.arXiv preprint arXiv:2412.04280,

  3. [3]

    Dynamiccontrol: Adaptive condition selection for improved text-to-image generation.arXiv preprint arXiv:2412.03255,

    Qingdong He, Jinlong Peng, Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Yong Liu, Yabiao Wang, Chengjie Wang, Xiangtai Li, et al. Dynamiccontrol: Adaptive condition selection for improved text-to-image generation.arXiv preprint arXiv:2412.03255,

  4. [4]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

  5. [5]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    ISSN 1046-8188. doi: 10.1145/3703155. URLhttps://doi.org/ 10.1145/3703155. Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, et al. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. InProceedings of the IEEE/CVF Confer- ence on ...

  6. [6]

    Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990,

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing.arXiv preprint arXiv:2404.09990,

  7. [7]

    Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, and Dahua Lin

    URLhttps: //openreview.net/forum?id=rsZwwjYHuD. Ying Jin, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, and Dahua Lin. Reasonpix2pix: instruction reasoning dataset for advanced image editing.arXiv preprint arXiv:2405.11190,

  8. [8]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything.arXiv:2304.02643,

  9. [9]

    Brushedit: All-in-one image inpainting and editing.arXiv preprint arXiv:2412.10316,

    Yaowei Li, Yuxuan Bian, Xuan Ju, Zhaoyang Zhang, Ying Shan, Yuexian Zou, and Qiang Xu. Brushedit: All-in-one image inpainting and editing.arXiv preprint arXiv:2412.10316,

  10. [10]

    Pixwizard: Versatile image-to-image visual assis- 15 tant with open-language instructions,

    Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Huan Teng, Junlin Xie, Yu Qiao, Peng Gao, et al. Pixwizard: Versatile image-to-image visual assistant with open- language instructions.arXiv preprint arXiv:2409.15278,

  11. [11]

    Decoupled Weight Decay Regularization

    Chang Liu, Xiangtai Li, and Henghui Ding. Referring image editing: Object-level image editing via referring expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13128–13138, 2024a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing sy...

  12. [12]

    Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

    Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tian- shuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

  13. [13]

    Instruction-guided editing controls for images and multimedia: A survey in llm era.arXiv preprint arXiv:2411.09955, 2024a

    Thanh Tam Nguyen, Zhao Ren, Trinh Pham, Thanh Trung Huynh, Phi Le Nguyen, Hongzhi Yin, and Quoc Viet Hung Nguyen. Instruction-guided editing controls for images and multimedia: A survey in llm era.arXiv preprint arXiv:2411.09955, 2024a. 11 Trong-Tung Nguyen, Duc-Anh Nguyen, Anh Tran, and Cuong Pham. Flexedit: Flexible and con- trollable diffusion-based ob...

  14. [14]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

  15. [15]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

  16. [16]

    Smartfreeedit: Mask-free spatial-aware image editing with complex instruction understanding.arXiv preprint arXiv:2504.12704,

    Qianqian Sun, Jixiang Luo, Dell Zhang, and Xuelong Li. Smartfreeedit: Mask-free spatial-aware image editing with complex instruction understanding.arXiv preprint arXiv:2504.12704,

  17. [17]

    Mige: A unified framework for multimodal instruction-based image generation and editing

    URLhttps://openreview. net/forum?id=zGb4WgCW5i. Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, and Huawei Shen. Mige: A unified framework for multimodal instruction-based image generation and editing.arXiv preprint arXiv:2502.21291,

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  19. [19]

    Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal rea- soning.arXiv preprint arXiv:2401.06805, 2024a. Yuhan Wang, Siwei Yang, Bingchen Zha...

  20. [20]

    12 Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu

    URLhttps:// arxiv.org/abs/2507.21033. 12 Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems, 37: 128374–128395, 2024b. Ling Yang, Bohan Zeng, Jiaming Liu, Hong Li, Minghao Xu, Wentao Zhang, and Shuicheng Yan. Editworld: Simulatin...

  21. [21]

    Complex- edit: Cot-like instruction generation for complexity-controllable image editing benchmark.arXiv preprint arXiv:2504.13143,

    Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie. Complex- edit: Cot-like instruction generation for complexity-controllable image editing benchmark.arXiv preprint arXiv:2504.13143,

  22. [22]

    Anyedit: Mastering unified high-quality image editing for any idea.arXiv preprint arXiv:2411.15738,

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea.arXiv preprint arXiv:2411.15738,

  23. [23]

    Fireedit: Fine-grained instruction-based image editing via region-aware vision language model.arXiv preprint arXiv:2503.19839,

    Jun Zhou, Jiahao Li, Zunnan Xu, Hanhui Li, Yiji Cheng, Fa-Ting Hong, Qin Lin, Qinglin Lu, and Xiaodan Liang. Fireedit: Fine-grained instruction-based image editing via region-aware vision language model.arXiv preprint arXiv:2503.19839,

  24. [24]

    Each initial instruction is then rewritten into a hypothetical form using prompt-based rewriting with GPT (Achiam et al., 2023)

    to generate target images along with their initial instructions. Each initial instruction is then rewritten into a hypothetical form using prompt-based rewriting with GPT (Achiam et al., 2023). In parallel, we use SpaCy 2 to perform named entity recognition (NER) on the initial instruction to extract candidate objects for source image generation. These ca...

  25. [25]

    The objective is defined as: LMLLM =− rX i=1 logp {θ∪θLoRA} ([IMGi]|IA(E I(I)), RV , RT ,E T (H),[IMG 1],

    for efficient adaptation. The objective is defined as: LMLLM =− rX i=1 logp {θ∪θLoRA} ([IMGi]|IA(E I(I)), RV , RT ,E T (H),[IMG 1], . . . ,[IMGi−1]), (A1) whereθ LoRA denotes the trainable parameters introduced by LoRA. This loss minimizes the negative log-likelihood of predicting each learnable token[IMG i]conditioned on the fine-grained features and pre...

  26. [26]

    To evaluate generaliza- tion on conventional understanding scenarios, we further test on theMagicBrush Test Set(Zhang et al.,

    andEditWorld(Yang et al., 2024). To evaluate generaliza- tion on conventional understanding scenarios, we further test on theMagicBrush Test Set(Zhang et al.,

  27. [27]

    and theEmu Edit Test Set(Sheynin et al., 2024). Metrics: To evaluate performance under reasoning scenarios, we adopt three metrics:CLIP Score(Radford et al., 2021),MLLM Score(Yang et al., 2024), andInstruction Alignment (Ins- Align)(Huang et al., 2024). Here, CLIP Score measures the semantic similarity between the edited image and the expected output text...

  28. [28]

    fine-tuning, with rank and alpha of 8 and 16, respectively. Fol- lowing (Huang et al., 2024; Fu et al., 2024), we expand the original LLM vocabulary with 32 new tokens, and the QFormer is composed of 6 transformer layers and 77 learnable query tokens. For the base editing model, we implement it with Flux (Labs,

  29. [29]

    Models for other qualitative results are implemented using SD series (CompVis, 2022; SimianLuo, 2024; Rombach et al., 2022a; AI,

    using FLUX.1-dev, which consists of 12B parameters. Models for other qualitative results are implemented using SD series (CompVis, 2022; SimianLuo, 2024; Rombach et al., 2022a; AI,

  30. [30]

    This indicates that even in instruction-based reasoning scenarios, a lightweight MLLM can not only accelerate inference but also effectively identify the editing target and execute precise edits, supported by its strong reasoning ability and rich world knowledge. 15 Method Inference Time (s) InstructPix2Pix 26 MagicBrush 28 MGIE 37 SmartEdit 33 UltraEdit ...

  31. [31]

    and GPT-Image-Edit (Wang et al., 2025)), relying primarily on synthetic data. While synthetic datasets provide semantic clarity and high visual quality, they may not fully capture the complexities, ar- tifacts, and temporal dynamics inherent in real-world video data. Collecting high-quality datasets from real-world video sources remains particularly chall...