SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

· 2026 · cs.CV · arXiv 2604.04911

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

citation-role summary

extension 1

citation-polarity summary

extend 1

representative citing papers

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

cs.CV · 2026-04-30 · unverdicted · novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

citing papers explorer

Showing 3 of 3 citing papers.

MindEdit-Bench: Benchmarking Object-Level Counterfactual Spatial Reasoning in VLMs from In-the-Wild Photos cs.CV · 2026-07-01 · unverdicted · none · ref 58 · internal anchor
MindEdit-Bench introduces six spatial reasoning tasks from 120 private indoor photo triplets, with two new counterfactual editing tasks where VLMs score 8-31% against 81-97% human accuracy.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling cs.CV · 2026-04-30 · unverdicted · none · ref 87 · internal anchor
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 92 · 2 links · internal anchor
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer