arxiv: 2603.01586 · v3 · submitted 2026-03-02 · 💻 cs.CV

Recognition: no theorem link

InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan , Fan Li , Chunwei Wang , Hao Wu , Mingwen Shao , Wangmeng Zuo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editingspatial reasoningvisual groundingchain-of-thoughtmultimodal reasoningfine-grained editingbounding box generation

0 comments

The pith

InterCoG interleaves text-only spatial reasoning with visual grounding to achieve precise edits in complex multi-entity scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InterCoG as a framework that addresses fine-grained image editing when targets are not visually obvious and spatial relations among multiple objects matter. It begins with object position reasoning conducted entirely in text that incorporates spatial details to identify the target location and identity. This step is followed by visual grounding that produces bounding boxes and masks, after which the editing instruction is rewritten to match the intended result. Two auxiliary training modules enforce localization accuracy and reasoning consistency, and the approach is tested on a newly built dataset of 45,000 grounding-oriented editing samples.

Core claim

InterCoG is a text-vision interleaved chain-of-grounding framework that first performs object position reasoning solely within text containing spatial relation details to deduce the location and identity of the editing target, then conducts visual grounding by generating bounding boxes and masks in pixel space, and finally rewrites the editing description to specify intended outcomes. Multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment modules are added to improve spatial accuracy and interpretability. Experiments on GroundEdit-45K and GroundEdit-Bench show gains in precise edits under spatially intricate and multi-entity conditions.

What carries the argument

The interleaved chain-of-grounding process that sequences text-only spatial position reasoning, followed by generation of bounding boxes and masks for visual grounding, and ends with rewriting of the edit description.

If this is right

Fine-grained edits become possible in scenes where targets lack visual salience and must be located through spatial relations described in text.
The auxiliary supervision modules improve both pixel-level localization accuracy and the interpretability of the reasoning steps.
A dataset of 45K samples with detailed reasoning annotations can be used to train models that follow the same interleaved text-then-vision sequence.
Rewriting the edit description after grounding produces instructions that more closely match the spatial outcome intended by the user.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-first reasoning step could be applied to other multimodal tasks that require locating objects from relational language before acting on an image.
If text reasoning proves reliable across domains, early stages of editing pipelines might avoid loading the full image until after the target is identified.
The method suggests a general pattern for reducing visual ambiguity by front-loading spatial deduction in language before committing to pixel operations.

Load-bearing premise

Object position reasoning performed solely in text without any visual input can reliably identify the correct target location and identity in complex real-world scenes.

What would settle it

A test set of images with multiple similar objects where the text spatial reasoning step selects the wrong entity, causing the subsequent visual grounding and edit to be applied to an unintended region.

Figures

Figures reproduced from arXiv: 2603.01586 by Chunwei Wang, Fan Li, Hao Wu, Mingwen Shao, Wangmeng Zuo, Yecong Wan.

**Figure 1.** Figure 1: InterCoG, a novel framework that achieves spatially precise image editing in complex scenes via interleaved chain-of-grounding reasoning. InterCoG first conducts position reasoning (textual grounding), highlights the bounding boxes and masks on the image (visual grounding), and then rewrites the editing description to produce the final result. InterCoG achieves superior editing performance compared to stat… view at source ↗

**Figure 2.** Figure 2: GroundEdit-45K Dataset Construction Pipeline and Statistics. Left: The pipeline consists of three steps: (1) target selection and visual grounding generation; (2) instruction and text reasoning generation; and (3) context-preserving local target editing. Right: Our dataset contains 45K samples covering 8 categories of local editing types. editing instruction together with its structured textual reasoning.… view at source ↗

**Figure 3.** Figure 3: Overview of the Proposed InterCoG Framework. Left: Our framework performs text–vision interleaved chain-of-grounding reasoning to interpret and locate user-intended targets and formulate editing descriptions, ultimately producing precise and semantically aligned editing results. Right: Illustration of the proposed text–vision and vision–vision reasoning alignment schemes, designed to enforce coherent multi… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons on our proposed GroundEdit-Bench. Our method consistently delivers highly precise and spatially accurate edits, particularly in multi-entity and fine-grained reasoning scenarios. Best viewed at screen! Scene Description: The image shows a historical site with carved rock formations and stairs leading up to them. Several people are scattered around, some taking photos while others ob… view at source ↗

**Figure 5.** Figure 5: Visualization of the interleaved localization chain-of-thought reasoning processing. InterCoG first interprets user-intended referential targets via textual reasoning and then highlights the object via visualizing bounding boxes and masks in pixel space. By interleaving these multimodal grounding cues, InterCoG is able to precisely locate the editing regions and achieve spatially exact modifications. The e… view at source ↗

**Figure 8.** Figure 8: Visual comparison with or without our proposed multimodal grounding reasoning alignment. 4.4. Inference Complexity. Test-time scaling techniques inherently incur additional computational overhead versus non-reasoning models. As shown in Tab. 4, the chain-of-grounding paradigm introduces extra latency relative to the standard editing pipeline; nevertheless, this overhead remains acceptable given the perfo… view at source ↗

**Figure 6.** Figure 6: Visual comparison between text-only grounding reasoning and interleaved chain-of-grounding reasoning. Effect of Multimodal Grounding Reconstruction. Introducing mask decoding as an auxiliary supervision not only drives the generation branch to learn more profound grounding-aware visual representations but also enhances cross-modal alignment via a shared decoder. As shown in Tab. 3 and [PITH_FULL_IMAGE:f… view at source ↗

**Figure 7.** Figure 7: Visual comparison with or without our proposed multimodal grounding reconstruction supervision. Effect of Multimodal Grounding Reasoning Alignment. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Multi-object editing with our proposed interleaved chain-of-grounding paradigm. Make the second bread in the third row turn purple. 1 2 Remove the cluster of trees located centrally behind the pond. Ambiguous Boundary [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Failure cases with ambiguous position and boundary definition. 2025), UniWorld-v1 (Lin et al., 2025), Step1X-Edit (Liu et al., 2025), OmniGen2 (Wu et al., 2025b), FLUX.1 Kontext [Dev] (Batifol et al., 2025), Qwen-Image-Edit (Wu et al., 2025a), Bagel (Deng et al., 2025), and Bagel-think (Deng et al., 2025). The visual comparisons from [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 14 [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 16 [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparisons on our proposed GroundEdit-Bench. Best viewed at screen! 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Prompts employed for trivial-sample filtering as well as for instruction and CoT generation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

**Figure 20.** Figure 20: Prompts employed for instruction rewriting and quality verification. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

read the original abstract

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterCoG adds a text-first spatial reasoning step plus a new 45K dataset for precise multi-object editing, but the text-only deduction risks failing on ambiguous scenes.

read the letter

InterCoG's main move is to split fine-grained editing into three explicit stages: first deduce the target's identity and location from the instruction and spatial relations using only text, then generate boxes and masks for visual grounding, then rewrite the edit description. They support this with two auxiliary modules for multimodal grounding reconstruction and reasoning alignment, plus the GroundEdit-45K dataset that comes with detailed reasoning annotations and a new benchmark called GroundEdit-Bench. The dataset and the staged pipeline are the concrete new pieces that are not just re-packaged prior work. If the experiments show consistent gains on spatially intricate cases, the structure could help other editing models become more controllable without extra user input. The auxiliary supervision looks like a practical way to push the model toward better localization and interpretable steps. The soft spot is the opening text-only reasoning step. The method claims it can reliably identify the right object from spatial relations alone before any image is seen, yet nothing in the description demonstrates how this holds when multiple entities match the same relational description. The stress-test concern lands here: without clear error breakdowns or examples of disambiguation failures, the superiority claim for multi-entity scenes rests on an assumption that may not be robust. The abstract asserts better results but gives no metrics or baseline comparisons, so the full paper needs to supply those numbers and show they are not driven by dataset-specific fitting. This work is for people building or evaluating controllable image-editing systems who already care about grounding and reasoning chains. A reader who wants to try the dataset or adapt the staged pipeline would get direct value. It deserves peer review because the framework and the released data are real additions worth checking in detail, even if the text-reasoning component needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes InterCoG, a text-vision interleaved chain-of-grounding framework for fine-grained image editing in complex multi-entity scenes. The method first performs object position reasoning solely in text (using the editing instruction and spatial-relation details) to deduce target identity and location, then generates bounding boxes and masks for visual grounding, and finally rewrites the editing description. Two auxiliary training modules (multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment) are introduced to improve localization accuracy and interpretability. The authors also release GroundEdit-45K (45K grounding-oriented editing samples with reasoning annotations) and GroundEdit-Bench for evaluation, claiming extensive experiments demonstrate superiority for spatially intricate edits.

Significance. If the central claims hold, InterCoG would advance unified image-editing models by making spatial reasoning explicit and interleaved rather than implicit, particularly for non-salient targets in multi-object scenes. The new dataset and benchmark could serve as useful resources for future grounding-aware editing research.

major comments (2)

[§3.2] §3.2: The text-only position reasoning step deduces target identity and location solely from textual spatial relations before any visual input. In multi-entity scenes this step is load-bearing for the superiority claim, yet the manuscript provides no quantitative evaluation of reasoning accuracy (e.g., percentage of cases where the LLM selects the correct referent when multiple objects satisfy the same relational description) or analysis of failure modes under spatial ambiguity.
[Experiments] Experiments section (and abstract): The claim that 'extensive experiments substantiate the superiority' is not supported by any reported metrics, baselines, or error breakdowns in the provided description. Quantitative results (e.g., IoU for grounding, edit success rate, comparison tables against prior unified editors) are required to substantiate performance gains on spatially intricate scenes.

minor comments (2)

[Abstract] The abstract would benefit from one or two concrete quantitative highlights (e.g., average improvement on GroundEdit-Bench) rather than only qualitative assertions.
[§3] Notation for the interleaved reasoning stages (text reasoning → box/mask generation → description rewrite) should be formalized with consistent symbols or a diagram to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the need for stronger validation of the text-only reasoning component and more explicit quantitative support for our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2: The text-only position reasoning step deduces target identity and location solely from textual spatial relations before any visual input. In multi-entity scenes this step is load-bearing for the superiority claim, yet the manuscript provides no quantitative evaluation of reasoning accuracy (e.g., percentage of cases where the LLM selects the correct referent when multiple objects satisfy the same relational description) or analysis of failure modes under spatial ambiguity.

Authors: We agree that a direct quantitative evaluation of the text-only position reasoning accuracy would strengthen the paper, particularly for multi-entity scenes with spatial ambiguity. In the revised manuscript, we will add an analysis on a subset of GroundEdit-Bench samples featuring relational ambiguities, reporting referent selection accuracy and discussing failure modes such as insufficient textual cues or ambiguous spatial relations. revision: yes
Referee: [Experiments] Experiments section (and abstract): The claim that 'extensive experiments substantiate the superiority' is not supported by any reported metrics, baselines, or error breakdowns in the provided description. Quantitative results (e.g., IoU for grounding, edit success rate, comparison tables against prior unified editors) are required to substantiate performance gains on spatially intricate scenes.

Authors: We acknowledge that the superiority claims require more explicit quantitative backing with metrics and breakdowns. In the revision, we will expand the experiments section to include detailed tables with IoU for grounding, edit success rates, comparisons against prior unified editors, and error analysis focused on spatially intricate multi-entity scenes. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces an original interleaved text-vision reasoning framework, auxiliary multimodal training modules, and a newly constructed GroundEdit-45K dataset with annotations. Central claims rest on empirical superiority demonstrated via experiments on complex scenes rather than any self-definitional reduction, fitted-parameter renaming, or load-bearing self-citation chains. No equations or steps in the abstract or described method reduce the output to inputs by construction; the text-only position reasoning is presented as an independent design choice supported by new supervision signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities beyond the proposed framework itself are detailed. The approach assumes standard multimodal grounding techniques work when interleaved with text reasoning.

axioms (1)

domain assumption Existing vision-language models can perform accurate bounding box and mask generation once targets are identified in text
Invoked implicitly in the visual grounding stage described in the abstract.

invented entities (1)

InterCoG interleaved reasoning framework no independent evidence
purpose: To achieve spatially precise edits via text-first position reasoning followed by visual grounding
Newly introduced method without independent external validation in the abstract

pith-pipeline@v0.9.0 · 5526 in / 1262 out tokens · 26594 ms · 2026-05-15T18:32:40.902121+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Make the smiling person look older

It is observed that our method still achieves the best performance on PSNR, SSIM, and LPIPS metrics. In con- trast, the concurrent MURE model failed to fully leverage the pixel-level localization capabilities of visual reasoning and neglected the consistency between intermediate reason- ing and the final results, leading to suboptimal localization editing...

work page 2025
[2]

Precisely identify the red-boxed target’s **category**, **appearance details** (color, clothing, shape, posture, facial expression, etc.), and **position** (absolute, sequential, and relative)

work page
[3]

the person inside the red box

Use **reasoning-based target references** instead of literal phrases like “the person inside the red box.”

work page
[4]

Generate **one clear, natural, and executable editing instruction** that modifies the target or uniquely identifies it through reasoning

work page
[5]

Add a hat to the mother who is leading two children

Provide a detailed **Chain-of-Thought (CoT)** to guide an image editing model in locating and modifying the target accurately. ### Editing Type | Type | Description Example | |------|---------------------| | **Subject Addition** | "Add a hat to the mother who is leading two children." | Y oumust generate as diverse editing instructions as possible based o...

work page
[6]

**Scene Description** — a concise but detailed summary of the overall scene: - What is shown in the image? - What objects or entities are present? - What actions, colors, or relationships exist? (≈ 2-3 sentences)

work page
[7]

top right corner

**Target Localization Description** — detailed identification of the red-box target including: - **Category and appearance details:** color, shape, clothing, texture, posture, facial expression, etc. - **Position details:** - Absolute position (e.g., “top right corner”, “center left”) - Sequential order (e.g., “the first person from the left”) - Relative ...

work page
[8]

This should correspond to one of the editing types above and must not affect other objects

**Editing Description** — a clear explanation of what kind of change will be applied to the target and why. This should correspond to one of the editing types above and must not affect other objects

work page
[9]

Editing_Type

**Post-Editing Result Description** — describe what the resulting image should look like after the edit (what changed, what stayed the same, how it affects the scene context). ### Output Format (must strictly follow): { "Editing_Type": "<the given editing type>", "Editing_Instruction": "<One clear, natural, unambiguous editing instruction>", "CoT": { "1_S...

work page
[10]

Judge if the red mask accurately corresponds to one object and segmentation is precise

work page
[11]

If both are satisfied, output 1 else 0

Judge if this target is challenging for an editing model to locate (requires reasoning/understanding, not trivial). If both are satisfied, output 1 else 0. Figure 19.Prompts employed for trivial-sample filtering as well as for instruction and CoT generation. 22 InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning...

work page
[12]

A red bounding box in both images that marks the region of interest

work page
[13]

The editing instruction Important: Y oumust perform all reasoning internally without revealing your chain-of-thought. Y ouwill provide only two numerical scores in a strict, machine-readable format: Region Edit Accuracy Score (1.0–5.0) Visual Quality & Seamlessness Score (1.0–5.0) These two scores must always be produced, regardless of uncertainty, conten...

work page
[14]

Region Edit Accuracy Score (1.0 → 5.0) Evaluate only the content inside the red bounding box: • Does the edited object match the editing instruction? • Are the intended attributes (shape, color, material, identity, action, etc.) correctly changed? • Are irrelevant changes avoided? • Does the model preserve details in the original region unless instructed ...

work page
[15]

RegionEditAccuracy

Visual Quality & Seamlessness Score (1.0 → 5.0) Judge: • Local realism and quality inside the bounding box • Naturalness of blending between the edited region and its surroundings • Absence of artifacts, distortions, or unnatural boundaries • Global image coherence after the local edit • Style consistency with the original image unless instructed otherwis...

work page