arxiv: 2604.04911 · v2 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Yicheng Xiao , Wenhu Zhang , Lin Song , Yukang Chen , Wenbo Li , Nan Jiang , Tianhe Ren , Haokun Lin

show 5 more authors

Wei Huang Haoyang Huang Xiu Li Nan Duan Xiaojuan Qi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial image editingfine-grained manipulationimage benchmarksynthetic datasetviewpoint reconstructiongeometric fidelityBlender renderingcamera control

0 comments

The pith

SpatialEdit provides a benchmark, 500k synthetic dataset, and baseline model for fine-grained spatial image editing that outperforms prior methods on geometry-driven tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialEdit-Bench to evaluate image edits that involve precise object layout changes and camera viewpoint shifts, using metrics that check both visual realism and geometric accuracy through viewpoint reconstruction and framing analysis. It creates SpatialEdit-500k, a large-scale synthetic dataset rendered with a controllable Blender pipeline that supplies exact ground-truth transformations for object-centric and camera-centric operations across varied backgrounds. Building on this data, the authors train SpatialEdit-16B, which matches existing models on general editing but substantially improves results on spatial manipulation benchmarks.

Core claim

SpatialEdit-Bench jointly measures perceptual plausibility and geometric fidelity for spatial editing, while SpatialEdit-500k supplies scalable training data with precise ground-truth transformations generated by a controllable Blender pipeline; the resulting SpatialEdit-16B model achieves competitive performance on general editing tasks and substantially outperforms prior methods on spatial manipulation tasks.

What carries the argument

SpatialEdit-Bench, which evaluates spatial edits by combining perceptual plausibility checks with geometric fidelity measured through viewpoint reconstruction and framing analysis on synthetic data with known transformations.

If this is right

Enables direct comparison of models on object layout and camera viewpoint changes using consistent geometric ground truth.
The 500k dataset removes the data bottleneck that previously limited training of spatial editing models.
Models trained on this data can be expected to handle both object-centric and camera-centric operations more reliably than before.
The benchmark separates general editing performance from spatial-specific performance, revealing where current methods fall short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthetic-to-real gap proves small, this pipeline could be extended to generate training data for other geometry-aware vision tasks such as novel view synthesis.
The framing analysis component might be adapted to evaluate composition quality in generative models beyond editing.
Combining the benchmark with real user preference studies could test whether the geometric metrics align with human judgments of edit quality.

Load-bearing premise

The synthetic Blender-generated images and the viewpoint-plus-framing metrics serve as adequate stand-ins for how well models will perform on real-world fine-grained spatial edits.

What would settle it

Run the SpatialEdit-16B model on a held-out set of real photographs containing manually verified object movements and camera shifts, then measure whether the drop in viewpoint reconstruction accuracy and framing scores exceeds the gains reported on the synthetic benchmark.

Figures

Figures reproduced from arXiv: 2604.04911 by Haokun Lin, Haoyang Huang, Lin Song, Nan Duan, Nan Jiang, Tianhe Ren, Wei Huang, Wenbo Li, Wenhu Zhang, Xiaojuan Qi, Xiu Li, Yicheng Xiao, Yukang Chen.

**Figure 1.** Figure 1: Illustration for image spatial editing. It comprises two components: (1) camera-centric view manipulation, including pitch, yaw, and zoom transformations; and (2) single-object manipulation, encompassing object rotation while preserving the background, as well as translation and scaling of objects specified via user-defined bounding boxes. ⋆ Equal contribution. BCorresponding author. arXiv:2604.04911v2 [cs… view at source ↗

**Figure 2.** Figure 2: Statistics of SpatialEdit-500k. (a) Distribution of camera-level data across seven sub-tasks in outdoor and intdoor scenes, where Y, P, and D denote Yaw, Pitch, and Distance, respectively. (b) Aspect ratio distribution of bounding boxes for the moving task at the object level. (c) Object category statistics across the entire dataset. 3 Image Spatial Editing 3.1 Revisiting Image Spatial Manipulation Image s… view at source ↗

**Figure 3.** Figure 3: SpatialEdit-500k data generation pipeline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of SpatialEdit. (2) specialize in image spatial editing scenario with LoRA post-tuning on our curated dataset, improving transformation control while preserving general priors. 5 Experiments 5.1 Training Details We pre-train the model on open-source editing datasets [53] and proprietary internal data, explicitly excluding spatial editing samples (see supplementary for details). Training uses th… view at source ↗

**Figure 5.** Figure 5: Comparison of camera view manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of object-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Serving as an enhancement tool for single-view reconstruction [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of object-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of camera-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialEdit adds a useful synthetic benchmark and dataset for geometric control in image editing, though real-world transfer remains unshown.

read the letter

The main thing here is a dedicated benchmark for fine-grained spatial image editing plus a large synthetic dataset to train on it. They generate SpatialEdit-500k using Blender with full control over object positions, camera moves, and backgrounds, giving exact ground truth for the transformations. SpatialEdit-Bench then measures edits by how well they match the target geometry through viewpoint reconstruction and framing analysis, on top of perceptual quality. This setup works well for the problem they target. Prior editing models struggle with precise layout changes, and the paper shows their baseline SpatialEdit-16B stays competitive on broad editing while doing better on the spatial tasks. The synthetic pipeline solves the data scarcity issue cleanly and makes the metrics more objective than pure human ratings. The soft spot is the lack of real-image testing. All performance numbers come from the Blender renders, and the concern about whether the gains transfer to photos with natural variations is fair. If the model relies on the perfect conditions in the synthetic data, the advantage might shrink on actual photographs. The abstract doesn't mention cross-domain tests, so that part of the story needs more evidence. The data construction looks solid with external ground truth, and citations are standard without issues. No load-bearing assumptions that seem circular. This paper is aimed at computer vision researchers focused on controllable image generation and editing. Anyone working on improving geometric consistency in edits would get practical value from the released benchmark and data. I would recommend putting it through peer review. The benchmark and dataset are new contributions that deserve feedback on their design and on how to strengthen the generalization claims.

Referee Report

1 major / 2 minor

Summary. The paper introduces SpatialEdit-Bench, a benchmark for fine-grained image spatial editing that jointly evaluates perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. It constructs SpatialEdit-500k, a 500k-image synthetic dataset generated via a controllable Blender pipeline providing ground-truth object- and camera-centric transformations, and presents SpatialEdit-16B, a baseline model trained on this data that achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources are to be released publicly.

Significance. If the results hold, the work would be significant for providing the first dedicated benchmark and large-scale controllable synthetic data for spatial editing, an area where current models lack fine-grained geometric control. The explicit ground-truth pipeline and public release are clear strengths that could enable reproducible progress.

major comments (1)

[Experiments] The central performance claim—that SpatialEdit-16B substantially outperforms priors specifically on spatial tasks—rests entirely on evaluations inside SpatialEdit-Bench, whose data are synthetic Blender renders. No experiments or analysis address transfer to real photographs (e.g., domain gap under natural lighting, texture variation, or sensor noise), which is load-bearing for the broader assertion of a useful advance in fine-grained spatial editing. (Experiments / Results section)

minor comments (2)

[Abstract] The abstract states that the benchmark 'jointly measures perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis' but supplies no equations, implementation details, or pseudocode for these metrics, making it difficult to assess how geometric fidelity is quantified.
[Abstract] The abstract claims 'competitive performance on general editing' and 'substantially outperforming' without any numerical results, tables, or baselines; the full manuscript should ensure all quantitative claims are accompanied by explicit numbers, standard deviations, and statistical tests.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of the benchmark and dataset. We respond to the major comment below.

read point-by-point responses

Referee: [Experiments] The central performance claim—that SpatialEdit-16B substantially outperforms priors specifically on spatial tasks—rests entirely on evaluations inside SpatialEdit-Bench, whose data are synthetic Blender renders. No experiments or analysis address transfer to real photographs (e.g., domain gap under natural lighting, texture variation, or sensor noise), which is load-bearing for the broader assertion of a useful advance in fine-grained spatial editing. (Experiments / Results section)

Authors: We agree that the absence of real-image transfer experiments limits the strength of claims about immediate practical utility. Our evaluations are intentionally restricted to SpatialEdit-Bench because the benchmark's primary value lies in enabling precise, ground-truth geometric metrics (viewpoint reconstruction and framing analysis) that cannot be obtained reliably on real photographs without additional supervision or assumptions. The performance claims in the paper are therefore scoped to this controlled synthetic setting, where SpatialEdit-16B demonstrates clear advantages on spatial tasks. To address the referee's concern, we will revise the Experiments section to add a dedicated subsection on domain considerations. This will include qualitative examples of the model applied to real photographs (sourced from public datasets), explicit discussion of expected gaps due to lighting, textures, and noise, and a statement in the conclusion framing real-world generalization as an important direction for future work. These additions will not change the core quantitative results but will better contextualize the scope of the contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark metrics grounded in external Blender ground-truth

full rationale

The paper introduces SpatialEdit-Bench with viewpoint reconstruction and framing metrics, SpatialEdit-500k synthetic data from a controllable Blender pipeline supplying explicit ground-truth transformations, and SpatialEdit-16B trained on that data. Performance claims compare model outputs against these independent ground-truths rather than reducing to self-defined quantities or fitted parameters by construction. No self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the provided text; the evaluation chain remains externally falsifiable via the rendering engine.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that controllable synthetic rendering can stand in for real photographic spatial editing and that the two proposed metrics adequately capture fine-grained geometric fidelity.

axioms (1)

domain assumption Blender-generated images with systematic camera trajectories and object placements provide a valid proxy for real-world spatial editing evaluation and training
Central to both the benchmark and the 500k dataset construction

pith-pipeline@v0.9.0 · 5499 in / 1271 out tokens · 37720 ms · 2026-05-10T19:20:11.765416+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We parameterize this space through three degrees of freedom: (i) Pitch & Yaw... (ii) Zoom... effectively modeling the scene as a 3D environment with explicitly defined camera and object states.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

64 extracted references · 32 canonical work pages · cited by 2 Pith papers · 16 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2411.18673 , year=

Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673 (2024)

work page arXiv 2024
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

2025
[4]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

1 kontext: Flow matching for in-context image generation and editing in latent space

Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025)

2025
[7]

GS-DiT: Ad- vancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Bian, W., Huang, Z., Shi, X., Li, Y., Wang, F.Y., Li, H.: Gs-dit: Advancing video generation with pseudo 4d gaussian fields through efficient dense 3d point tracking. arXiv preprint arXiv:2501.02690 (2025)

work page arXiv 2025
[8]

In: CVPR (2023)

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)

2023
[9]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page arXiv 2025
[10]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[14]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

work page internal anchor Pith review arXiv 2022
[15]

Seed-data-edit technical report: A hybrid dataset for instructional image editing

Ge, Y., Zhao, S., Li, C., Ge, Y., Shan, Y.: Seed-data-edit technical report: A hybrid dataset for instructional image editing. arXiv preprint arXiv:2405.04007 (2024) 20 Y. Xiao et al

work page arXiv 2024
[16]

https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-2-5- Flash-Model-Card.pdf, 2025

Google: Gemini 2.5 flash & 2.5 flash image model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-2-5- Flash-Model-Card.pdf, 2025. (2025)

2025
[17]

https://deepmind.google/models/veo/ (2025)

Google: Introducing veo 3, our video generation model with ex- panded creative controls – including native audio and extended videos. https://deepmind.google/models/veo/ (2025)

2025
[18]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., Kipf, T., Kundu, A., Lagun, D., Laradji, I., Liu, H.T.D., Meyer, H., Miao, Y., Nowrouzezahrai, D., Oztireli, C., Pot, E., Radwan, N., Rebain, D., Sabour, S., Sajjadi, M.S.M., Sela, M., Sitzmann, V., Stone, A., Sun, D., Vora, S....

2022
[19]

arXiv preprint arXiv:2501.03847 (2025)

Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., et al.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)

work page arXiv 2025
[20]

Cambridge university press (2003)

Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)

2003
[21]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

work page internal anchor Pith review arXiv 2024
[22]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

work page internal anchor Pith review arXiv 2022
[23]

Rout- ledge (2013)

Hess, R.: Blender foundations: The essential guide to learning blender 2.5. Rout- ledge (2013)

2013
[24]

In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

work page arXiv 2024
[25]

In: European Conference on Computer Vision

Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European Conference on Computer Vision. pp. 18–35. Springer (2024)

2024
[26]

Accessed Sept.30, 2024 [Online]https://kling.kuaishou

Kling: Kling.Kling. Accessed Sept.30, 2024 [Online]https://kling.kuaishou. com/en(2024),https://kling.kuaishou.com/en

2024
[27]

Collaborative video diffusion: Consistent multi-video generation with camera control

Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L., Wetzstein, G.: Collabora- tive video diffusion: Consistent multi-video generation with camera control. arXiv preprint arXiv:2405.17414 (2024)

work page arXiv 2024
[28]

Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

1955
[29]

NeurIPS (2023)

Li, D., Li, J., Hoi, S.: Blip-diffusion: Pre-trained subject representation for con- trollable text-to-image generation and editing. NeurIPS (2023)

2023
[30]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

work page arXiv 2025
[31]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review arXiv 2025
[32]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) SpatialEdit 21

work page internal anchor Pith review arXiv 2025
[33]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Luo, Y., Shi, X., Bai, J., Xia, M., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Camclonemaster: Enabling reference-based camera control for video generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–10 (2025)

2025
[35]

OpenAI: Gpt-4o image generation.https://openai.com/index/introducing-4o- image-generation/(2025)

2025
[36]

URL https://openai.com/index/introducing-4o-image- generation/ (2025)

OpenAI: Gpt-image-1. URL https://openai.com/index/introducing-4o-image- generation/ (2025)

2025
[37]

In: ICCV (2023)

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)

2023
[38]

In: CVPR (2022)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

2022
[39]

In: CVPR (2023)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

2023
[41]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

work page internal anchor Pith review arXiv 2025
[42]

Emu edit: Precise image editing via recognition and generation tasks

Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089 (2023)

work page arXiv 2023
[43]

In: Computer Vision: A Reference Guide, pp

Sturm, P.: Pinhole camera model. In: Computer Vision: A Reference Guide, pp. 983–986. Springer (2021)

2021
[45]

Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098 (2024)

work page arXiv 2024
[46]

Longcat-image technical report

Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y., Gao, L., Xiao, S., Wei, X., Ma, X., et al.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025)

work page arXiv 2025
[47]

Advancing open-source world models,

Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)

work page arXiv 2026
[48]

In: 2024 Asia Pacific Conference on Innovation in Technology (APCIT)

Thippeswamy, B., Ramachandra, H., Rohan, S., et al.: Textverse: A streamlit web application for advanced analysis of pdf and image files with and without language models. In: 2024 Asia Pacific Conference on Innovation in Technology (APCIT). pp. 1–6. IEEE (2024)

2024
[49]

In: European Conference on Computer Vision

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)

2024
[50]

https://www.vidu.cn/ (2024)

Vidu Team: Vidu: Ai video generator. https://www.vidu.cn/ (2024)

2024
[51]

Advances in neural information processing systems 37, 107984–108011 (2024)

Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., et al.: Yolov10: Real-time end-to-end object detection. Advances in neural information processing systems 37, 107984–108011 (2024)

2024
[52]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 22 Y. Xiao et al

2025
[53]

Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset

Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025)

work page arXiv 2025
[54]

In: ACM SIGGRAPH 2024 Conference Papers

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

2024
[55]

In: ICLR (2024)

Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Building image editing generalist models through specialist supervision. In: ICLR (2024)

2024
[56]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

arXiv preprint arXiv:2508.06080 (2025)

Xia, B., Liu, J., Zhang, Y., Peng, B., Chu, R., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamve: Unified instruction-based image and video editing. arXiv preprint arXiv:2508.06080 (2025)

work page arXiv 2025
[59]

In: ECCV (2024)

Xia, B., Wang, S., Tao, Y., Wang, Y., Jia, J.: Llmga: Multimodal large language model based generation assistant. In: ECCV (2024)

2024
[60]

In: CVPR (2025)

Xia, B., Zhang, Y., Li, J., Wang, C., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamomni: Unified image generation and editing. In: CVPR (2025)

2025
[61]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

2025
[62]

Mindomni: Unleashing reasoning generation in vision language models with rgpo.arXiv preprint arXiv:2505.13031, 2025

Xiao, Y., Song, L., Chen, Y., Luo, Y., Chen, Y., Gan, Y., Huang, W., Li, X., Qi, X., Shan, Y.: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXiv preprint arXiv:2505.13031 (2025)

work page arXiv 2025
[63]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Xiao,Y.,Wang,Q.,Zhang,S.,Xue,N.,Peng,S.,Shen,Y.,Zhou,X.:Spatialtracker: Tracking any 2d pixels in 3d space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20406–20417 (2024)

2024
[64]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review arXiv 2023
[65]

ImgEdit: A Unified Image Editing Dataset and Benchmark

Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025)

work page internal anchor Pith review arXiv 2025
[66]

Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

Zhang, D.J., Paiss, R., Zada, S., Karnad, N., Jacobs, D.E., Pritch, Y., Mosseri, I., Shou, M.Z., Wadhwa, N., Ruiz, N.: Recapture: Generative video camera con- trols for user-provided videos using masked video fine-tuning. arXiv preprint arXiv:2411.05003 (2024)

work page arXiv 2024
[67]

Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)

Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)

2023