pith. sign in

arxiv: 2509.25866 · v2 · submitted 2025-09-30 · 💻 cs.CV

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning

Pith reviewed 2026-05-18 13:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal reasoningvision-language modelsvisual thoughtsinterleaved image-text reasoningvisual embedding spacetool-free reasoningimage manipulation
0
0 comments X p. Extension

The pith

DeepSketcher enables vision-language models to generate visual thoughts by operating directly in the visual embedding space instead of using external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors are trying to show that vision-language models can internalize the process of visual manipulation so they generate and use edited images as thoughts during reasoning without needing outside tools. They do this by creating a dataset of 31k accurate trajectories and training a model to work directly in the visual embedding space. A sympathetic reader would care because this could lead to more efficient and capable systems for understanding images through iterative visual adjustments.

Core claim

The central discovery is a model that performs interleaved image-text reasoning and natively generates visual thoughts by operating directly in the visual embedding space. This is supported by a new dataset containing 31k reasoning trajectories with tool calls and resulting edited images, allowing the model to avoid repeated re-encoding of generated images and enabling tool-free thinking with images.

What carries the argument

Operating directly in the visual embedding space to generate visual thoughts during reasoning.

Load-bearing premise

That manipulations done directly in the visual embedding space can match the accuracy and fidelity of those performed by external visual tools without losing key information or adding artifacts.

What would settle it

A side-by-side evaluation of the visual outputs and reasoning outcomes produced by the internal model versus those from external tools on identical manipulation tasks from the benchmark.

Figures

Figures reproduced from arXiv: 2509.25866 by Chi Zhang, Haibo Qiu, Jing Zhang, Lin Ma, Qiming Zhang, Zhixiong Zeng.

Figure 1
Figure 1. Figure 1: In code space (right), edits are specified through rendering code, offering precision and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Wordcloud of visual manipulations. Rank Category Count Share (%) 1 Labeling/Annotation 12,340 20.9 2 Highlighting 10,437 17.7 3 Color Operations 7,383 12.5 4 Circle Drawing 6,942 11.8 5 Line Drawing 6,919 11.7 6 Point Marking 3,924 6.6 7 Area/Region Operations 2,641 4.5 8 Shape Drawing 2,549 4.3 9 Others 5,919 4.3 Others 4,853 10 Total 59,054 100.0 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Disciplinary coverage of our dataset [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the proposed DeepSketcher model. A query [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Difference map visualizations. Each example shows the input image (left), the program [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of img2code pipeline [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the DeepSketcher data curation pipeline. We first construct a dataset of VQA [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 11
Figure 11. Figure 11: C TRAINING DETAILS We adopt Qwen2.5-VL-7B (Bai et al., 2025) as the base model. Our implementation is built on LLaMA-Factory (Zheng et al., 2024). The training is carried out in three stages: first, the intermedi￾ate tool-calling model is trained on the seed data for 5 epochs with a learning rate of 5 × 10−6 ; next, the embedding editor is trained on the full dataset for 10 epochs with a learning rate of … view at source ↗
Figure 8
Figure 8. Figure 8: Difference map visualization from public benchmarks. (a) Alignment between attention [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The Solver LLM starting prompt. Writing assistance LLMs were utilized to support the preparation and refinement of this manuscript. Their assistance covered tasks such as proofreading for grammatical accuracy, im￾proving sentence flow and clarity, and rephrasing passages to enhance readability. All generated text was carefully reviewed, assessed, and revised by the authors to ensure the accuracy, consis￾te… view at source ↗
Figure 10
Figure 10. Figure 10: The prompt template when Solver LLM receives updated visual information. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt template for Code Editor LLM. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

The "thinking with images" paradigm represents a pivotal shift in the reasoning of Vision Language Models (VLMs), moving from text-dominant chain-of-thought to image-interactive reasoning. By invoking visual tools or generating intermediate visual representations, VLMs can iteratively attend to fine-grained regions, enabling deeper image understanding and more faithful multimodal reasoning. As an emerging paradigm, however, it still leaves substantial room for exploration in data construction accuracy, structural design, and broader application scenarios, which offer rich opportunities for advancing multimodal reasoning. To further advance this line of work, we present DeepSketcher, a comprehensive suite comprising both an image-text interleaved dataset and a self-contained model. The dataset contains 31k chain-of-thought (CoT) reasoning trajectories with diverse tool calls and resulting edited images, covering a wide range of data types and manipulation instructions with high annotation accuracy. Building on this resource, we design a model that performs interleaved image-text reasoning and natively generates "visual thoughts" by operating directly in the visual embedding space, rather than invoking external tools and repeatedly re-encoding generated images. This design enables tool-free and more flexible "thinking with images". Extensive experiments on multimodal reasoning benchmarks demonstrate strong performance, validating both the utility of the dataset and the effectiveness of the model design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DeepSketcher, a dataset of 31k image-text interleaved CoT reasoning trajectories involving diverse tool calls and resulting edited images (with claimed high annotation accuracy across data types and manipulation instructions), together with a self-contained VLM that performs interleaved image-text reasoning by natively generating visual thoughts through direct operations in the visual embedding space rather than external tools and repeated re-encoding.

Significance. If the central claims hold, the work would meaningfully advance the 'thinking with images' paradigm by demonstrating that visual manipulation can be internalized for tool-free, more flexible multimodal reasoning. The scale of the curated trajectory dataset and the self-contained model design represent concrete contributions that could support further research in visual CoT; credit is due for attempting to move beyond repeated external tool invocation.

major comments (2)
  1. [Model Design and Training] The core claim that operations performed directly in the visual embedding space faithfully replicate the accuracy and effects of the external manipulation tools used to generate the 31k dataset trajectories (without artifacts or loss of critical spatial/high-frequency information) is load-bearing for the asserted tool-free advantage, yet no quantitative fidelity metrics, pixel-level comparisons, or ablation studies contrasting internal visual thoughts against the ground-truth edited images appear to be reported. Visual embeddings are compressed representations that typically discard details essential for precise edits such as cropping or region drawing; without such verification the interleaved reasoning may rely on approximate rather than faithful visual simulation.
  2. [Experiments and Evaluation] The abstract asserts 'high annotation accuracy' for the 31k trajectories and 'strong performance' on multimodal reasoning benchmarks, but the provided text supplies no quantitative results, error bars, dataset construction details, ablation studies, or specific benchmark numbers to substantiate these assertions. This absence makes it impossible to evaluate whether the reported gains are robust or whether they reduce to the quality of the external-tool-generated supervision.
minor comments (1)
  1. [Abstract and Introduction] Notation for 'visual thoughts' and the precise mechanism of embedding-space operations could be clarified with a diagram or pseudocode to distinguish them from standard VLM image generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We provide detailed responses to each major comment and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Model Design and Training] The core claim that operations performed directly in the visual embedding space faithfully replicate the accuracy and effects of the external manipulation tools used to generate the 31k dataset trajectories (without artifacts or loss of critical spatial/high-frequency information) is load-bearing for the asserted tool-free advantage, yet no quantitative fidelity metrics, pixel-level comparisons, or ablation studies contrasting internal visual thoughts against the ground-truth edited images appear to be reported. Visual embeddings are compressed representations that typically discard details essential for precise edits such as cropping or region drawing; without such verification the interleaved reasoning may rely on approximate rather than faithful visual simulation.

    Authors: We agree that direct verification of fidelity is important for supporting the tool-free claim. The current manuscript does not report explicit quantitative fidelity metrics (e.g., PSNR, SSIM, or perceptual distances) or pixel-level ablations comparing internally generated visual thoughts to the external-tool ground truth. We will add these analyses and corresponding ablation studies in the revised manuscript to demonstrate preservation of spatial and high-frequency details. revision: yes

  2. Referee: [Experiments and Evaluation] The abstract asserts 'high annotation accuracy' for the 31k trajectories and 'strong performance' on multimodal reasoning benchmarks, but the provided text supplies no quantitative results, error bars, dataset construction details, ablation studies, or specific benchmark numbers to substantiate these assertions. This absence makes it impossible to evaluate whether the reported gains are robust or whether they reduce to the quality of the external-tool-generated supervision.

    Authors: We acknowledge that the current manuscript version does not present the quantitative results, error bars, ablation studies, or dataset construction details with sufficient prominence or completeness in the main text. We will revise the paper to include specific benchmark numbers with error bars, expanded ablation studies, and detailed dataset construction and annotation accuracy verification procedures. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external dataset construction and independent benchmark evaluation

full rationale

The paper constructs a 31k-trajectory dataset using external visual manipulation tools to produce edited images and CoT trajectories, then trains a model to perform interleaved reasoning by operating directly in the visual embedding space. Strong performance is reported on external multimodal reasoning benchmarks. No equations, fitted parameters, or self-citations are shown that reduce the claimed tool-free advantage or benchmark results to quantities defined by the model's own outputs or prior self-referential work. The chain is self-contained against independent test sets and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the unverified accuracy of the 31k dataset annotations and the untested equivalence between embedding-space operations and external tool effects; no explicit free parameters, standard mathematical axioms, or independently evidenced invented entities are detailed in the abstract.

invented entities (1)
  • visual thoughts generated in visual embedding space no independent evidence
    purpose: To replace external tool calls with internal, tool-free visual manipulation during reasoning
    Introduced as the core mechanism for native interleaved image-text reasoning without re-encoding steps.

pith-pipeline@v0.9.0 · 5766 in / 1283 out tokens · 39196 ms · 2026-05-18T13:04:27.245841+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SketchVLM: Vision language models can annotate images to explain thoughts and guide users

    cs.CV 2026-04 unverdicted novelty 7.0

    SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

  2. Mull-Tokens: Modality-Agnostic Latent Thinking

    cs.CV 2025-12 unverdicted novelty 6.0

    Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 2 Pith papers

  1. [1]

    thinking with images

    shows that augmenting input images with labeled cues significantly improves referring and localization performance in GPT-4V . Similarly, Sketchpad-style pipelines automatically compose visual prompts by leveraging a toolbox of detectors and segmenters (with lightweight Python glue) to draw boxes and masks either prior to or during inference, thereby stre...

  2. [2]

    Draw a red circle around triangleABC

    If you are less than 99% confident in your answer, you MUST call the Renderer by filling <ACTION_EXEC> with a specific drawing instruction (e.g., "Draw a red circle around triangleABC")

  3. [3]

    Do NOT attempt to answer yet

    In that case, <ANSWER> must be exactly "TBD". Do NOT attempt to answer yet

  4. [4]

    If you are 99% confident in your answer, set <tool_call> to "NONE" and fill <ANSWER> with the final answer

  5. [5]

    5.Any output that breaks these rules will be rejected by the grader

    <tool_call> must only contain visual drawing instructions — do NOT include textual, logical, or general suggestions. 5.Any output that breaks these rules will be rejected by the grader. Figure 9: TheSolverLLM starting prompt. Writing assistanceLLMs were utilized to support the preparation and refinement of this manuscript. Their assistance covered tasks s...

  6. [6]

    If they do: proceed with the next step of reasoning

    First, carefully check whether the visual edits match what you asked for. If they do: proceed with the next step of reasoning. If they do NOT match: adjust your drawing request in <tool_call> to correct it

  7. [7]

    Resume from the next step number

    Do NOT repeat earlier reasoning. Resume from the next step number. Use "Step k:" where k = last_step + 1

  8. [8]

    Use this exact format: <THINK> Step k: … Step k+1: … </THINK> <tool_call> … (new drawing instruction if still <99% confident, else write NONE) … </tool_call> <ANSWER> … (write the final answer if sure, or TBD if not) … </ANSWER> ⚠ RULES ⚠

  9. [9]

    If you are now ≥99% confident, set <tool_call> to NONE and provide the final <ANSWER>

  10. [10]

    Otherwise, revise or re-use your drawing request in <tool_call> and leave <ANSWER> as TBD

  11. [11]

    ```python\n

    Never repeat earlier steps. Always continue from the last step. Figure 10: The prompt template whenSolverLLM receives updated visual information. 20 Preprint Code editor LLM system prompt You are CodeEditor-GPT, a strict code-rewriting agent. Your job: update the entire source file so that it satisfies the natural-language instruction. ★ RESPONSE FORMAT (...

  12. [12]

    Do NOT output anything outside the python fenced block

  13. [13]

    Keep the programming language identical to CURRENT_CODE

  14. [14]

    Output the entire updated file; you may copy unchanged lines verbatim, but add, delete, or reorderanything needed to satisfy the instruction

  15. [15]

    Demo" to the plot. OUTPUT: ```python import matplotlib.pyplot as plt plt.figure() plt.title(

    If the request is impossible, reply exactly: CODE_ERROR. Here's an example of how to answer the question: CURRENT_CODE: ```python import matplotlib.pyplot as plt plt.figure(); # line-1 plt.show() # line-2 ```python INSTRUCTION: Add a title "Demo" to the plot. OUTPUT: ```python import matplotlib.pyplot as plt plt.figure() plt.title("Demo") plt.show() Figur...