pith. machine review for the scientific record. sign in

arxiv: 2604.04911 · v2 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial image editingfine-grained manipulationimage benchmarksynthetic datasetviewpoint reconstructiongeometric fidelityBlender renderingcamera control
0
0 comments X

The pith

SpatialEdit provides a benchmark, 500k synthetic dataset, and baseline model for fine-grained spatial image editing that outperforms prior methods on geometry-driven tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialEdit-Bench to evaluate image edits that involve precise object layout changes and camera viewpoint shifts, using metrics that check both visual realism and geometric accuracy through viewpoint reconstruction and framing analysis. It creates SpatialEdit-500k, a large-scale synthetic dataset rendered with a controllable Blender pipeline that supplies exact ground-truth transformations for object-centric and camera-centric operations across varied backgrounds. Building on this data, the authors train SpatialEdit-16B, which matches existing models on general editing but substantially improves results on spatial manipulation benchmarks.

Core claim

SpatialEdit-Bench jointly measures perceptual plausibility and geometric fidelity for spatial editing, while SpatialEdit-500k supplies scalable training data with precise ground-truth transformations generated by a controllable Blender pipeline; the resulting SpatialEdit-16B model achieves competitive performance on general editing tasks and substantially outperforms prior methods on spatial manipulation tasks.

What carries the argument

SpatialEdit-Bench, which evaluates spatial edits by combining perceptual plausibility checks with geometric fidelity measured through viewpoint reconstruction and framing analysis on synthetic data with known transformations.

If this is right

  • Enables direct comparison of models on object layout and camera viewpoint changes using consistent geometric ground truth.
  • The 500k dataset removes the data bottleneck that previously limited training of spatial editing models.
  • Models trained on this data can be expected to handle both object-centric and camera-centric operations more reliably than before.
  • The benchmark separates general editing performance from spatial-specific performance, revealing where current methods fall short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the synthetic-to-real gap proves small, this pipeline could be extended to generate training data for other geometry-aware vision tasks such as novel view synthesis.
  • The framing analysis component might be adapted to evaluate composition quality in generative models beyond editing.
  • Combining the benchmark with real user preference studies could test whether the geometric metrics align with human judgments of edit quality.

Load-bearing premise

The synthetic Blender-generated images and the viewpoint-plus-framing metrics serve as adequate stand-ins for how well models will perform on real-world fine-grained spatial edits.

What would settle it

Run the SpatialEdit-16B model on a held-out set of real photographs containing manually verified object movements and camera shifts, then measure whether the drop in viewpoint reconstruction accuracy and framing scores exceeds the gains reported on the synthetic benchmark.

Figures

Figures reproduced from arXiv: 2604.04911 by Haokun Lin, Haoyang Huang, Lin Song, Nan Duan, Nan Jiang, Tianhe Ren, Wei Huang, Wenbo Li, Wenhu Zhang, Xiaojuan Qi, Xiu Li, Yicheng Xiao, Yukang Chen.

Figure 1
Figure 1. Figure 1: Illustration for image spatial editing. It comprises two components: (1) camera-centric view manipulation, including pitch, yaw, and zoom transformations; and (2) single-object manipulation, encompassing object rotation while preserving the background, as well as translation and scaling of objects specified via user-defined bounding boxes. ⋆ Equal contribution. BCorresponding author. arXiv:2604.04911v2 [cs… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of SpatialEdit-500k. (a) Distribution of camera-level data across seven sub-tasks in outdoor and intdoor scenes, where Y, P, and D denote Yaw, Pitch, and Distance, respectively. (b) Aspect ratio distribution of bounding boxes for the moving task at the object level. (c) Object category statistics across the entire dataset. 3 Image Spatial Editing 3.1 Revisiting Image Spatial Manipulation Image s… view at source ↗
Figure 3
Figure 3. Figure 3: SpatialEdit-500k data generation pipeline [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of SpatialEdit. (2) specialize in image spatial editing sce￾nario with LoRA post-tuning on our cu￾rated dataset, improving transformation control while preserving general priors. 5 Experiments 5.1 Training Details We pre-train the model on open-source editing datasets [53] and proprietary in￾ternal data, explicitly excluding spatial editing samples (see supplementary for details). Training uses th… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of camera view manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of object-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Serving as an enhancement tool for single-view reconstruction [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of object-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of camera-level manipulation across various methods [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Image spatial editing performs geometry-driven transformations, allowing precise control over object layout and camera viewpoints. Current models are insufficient for fine-grained spatial manipulations, motivating a dedicated assessment suite. Our contributions are listed: (i) We introduce SpatialEdit-Bench, a complete benchmark that evaluates spatial editing by jointly measuring perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. (ii) To address the data bottleneck for scalable training, we construct SpatialEdit-500k, a synthetic dataset generated with a controllable Blender pipeline that renders objects across diverse backgrounds and systematic camera trajectories, providing precise ground-truth transformations for both object- and camera-centric operations. (iii) Building on this data, we develop SpatialEdit-16B, a baseline model for fine-grained spatial editing. Our method achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources will be made public at https://github.com/EasonXiao-888/SpatialEdit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SpatialEdit-Bench, a benchmark for fine-grained image spatial editing that jointly evaluates perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis. It constructs SpatialEdit-500k, a 500k-image synthetic dataset generated via a controllable Blender pipeline providing ground-truth object- and camera-centric transformations, and presents SpatialEdit-16B, a baseline model trained on this data that achieves competitive performance on general editing while substantially outperforming prior methods on spatial manipulation tasks. All resources are to be released publicly.

Significance. If the results hold, the work would be significant for providing the first dedicated benchmark and large-scale controllable synthetic data for spatial editing, an area where current models lack fine-grained geometric control. The explicit ground-truth pipeline and public release are clear strengths that could enable reproducible progress.

major comments (1)
  1. [Experiments] The central performance claim—that SpatialEdit-16B substantially outperforms priors specifically on spatial tasks—rests entirely on evaluations inside SpatialEdit-Bench, whose data are synthetic Blender renders. No experiments or analysis address transfer to real photographs (e.g., domain gap under natural lighting, texture variation, or sensor noise), which is load-bearing for the broader assertion of a useful advance in fine-grained spatial editing. (Experiments / Results section)
minor comments (2)
  1. [Abstract] The abstract states that the benchmark 'jointly measures perceptual plausibility and geometric fidelity via viewpoint reconstruction and framing analysis' but supplies no equations, implementation details, or pseudocode for these metrics, making it difficult to assess how geometric fidelity is quantified.
  2. [Abstract] The abstract claims 'competitive performance on general editing' and 'substantially outperforming' without any numerical results, tables, or baselines; the full manuscript should ensure all quantitative claims are accompanied by explicit numbers, standard deviations, and statistical tests.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential impact of the benchmark and dataset. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Experiments] The central performance claim—that SpatialEdit-16B substantially outperforms priors specifically on spatial tasks—rests entirely on evaluations inside SpatialEdit-Bench, whose data are synthetic Blender renders. No experiments or analysis address transfer to real photographs (e.g., domain gap under natural lighting, texture variation, or sensor noise), which is load-bearing for the broader assertion of a useful advance in fine-grained spatial editing. (Experiments / Results section)

    Authors: We agree that the absence of real-image transfer experiments limits the strength of claims about immediate practical utility. Our evaluations are intentionally restricted to SpatialEdit-Bench because the benchmark's primary value lies in enabling precise, ground-truth geometric metrics (viewpoint reconstruction and framing analysis) that cannot be obtained reliably on real photographs without additional supervision or assumptions. The performance claims in the paper are therefore scoped to this controlled synthetic setting, where SpatialEdit-16B demonstrates clear advantages on spatial tasks. To address the referee's concern, we will revise the Experiments section to add a dedicated subsection on domain considerations. This will include qualitative examples of the model applied to real photographs (sourced from public datasets), explicit discussion of expected gaps due to lighting, textures, and noise, and a statement in the conclusion framing real-world generalization as an important direction for future work. These additions will not change the core quantitative results but will better contextualize the scope of the contribution. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark metrics grounded in external Blender ground-truth

full rationale

The paper introduces SpatialEdit-Bench with viewpoint reconstruction and framing metrics, SpatialEdit-500k synthetic data from a controllable Blender pipeline supplying explicit ground-truth transformations, and SpatialEdit-16B trained on that data. Performance claims compare model outputs against these independent ground-truths rather than reducing to self-defined quantities or fitted parameters by construction. No self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the provided text; the evaluation chain remains externally falsifiable via the rendering engine.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that controllable synthetic rendering can stand in for real photographic spatial editing and that the two proposed metrics adequately capture fine-grained geometric fidelity.

axioms (1)
  • domain assumption Blender-generated images with systematic camera trajectories and object placements provide a valid proxy for real-world spatial editing evaluation and training
    Central to both the benchmark and the 500k dataset construction

pith-pipeline@v0.9.0 · 5499 in / 1271 out tokens · 37720 ms · 2026-05-10T19:20:11.765416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  2. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

Reference graph

Works this paper leans on

64 extracted references · 32 canonical work pages · cited by 2 Pith papers · 16 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    arXiv preprint arXiv:2411.18673 , year=

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673 (2024)

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [6]

    1 kontext: Flow matching for in-context image generation and editing in latent space

    Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., En- glish, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025)

  6. [7]

    GS-DiT: Ad- vancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

    Bian, W., Huang, Z., Shi, X., Li, Y., Wang, F.Y., Li, H.: Gs-dit: Advancing video generation with pseudo 4d gaussian fields through efficient dense 3d point tracking. arXiv preprint arXiv:2501.02690 (2025)

  7. [8]

    In: CVPR (2023)

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: CVPR (2023)

  8. [9]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

  9. [10]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  10. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  11. [12]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025)

  12. [13]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  13. [14]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image gener- ation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)

  14. [15]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing

    Ge, Y., Zhao, S., Li, C., Ge, Y., Shan, Y.: Seed-data-edit technical report: A hybrid dataset for instructional image editing. arXiv preprint arXiv:2405.04007 (2024) 20 Y. Xiao et al

  15. [16]

    https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-2-5- Flash-Model-Card.pdf, 2025

    Google: Gemini 2.5 flash & 2.5 flash image model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-2-5- Flash-Model-Card.pdf, 2025. (2025)

  16. [17]

    https://deepmind.google/models/veo/ (2025)

    Google: Introducing veo 3, our video generation model with ex- panded creative controls – including native audio and extended videos. https://deepmind.google/models/veo/ (2025)

  17. [18]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., Kipf, T., Kundu, A., Lagun, D., Laradji, I., Liu, H.T.D., Meyer, H., Miao, Y., Nowrouzezahrai, D., Oztireli, C., Pot, E., Radwan, N., Rebain, D., Sabour, S., Sajjadi, M.S.M., Sela, M., Sitzmann, V., Stone, A., Sun, D., Vora, S....

  18. [19]

    arXiv preprint arXiv:2501.03847 (2025)

    Gu, Z., Yan, R., Lu, J., Li, P., Dou, Z., Si, C., Dong, Z., Liu, Q., Lin, C., Liu, Z., et al.: Diffusion as shader: 3d-aware video diffusion for versatile video generation control. arXiv preprint arXiv:2501.03847 (2025)

  19. [20]

    Cambridge university press (2003)

    Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)

  20. [21]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

  21. [22]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  22. [23]

    Rout- ledge (2013)

    Hess, R.: Blender foundations: The essential guide to learning blender 2.5. Rout- ledge (2013)

  23. [24]

    In-context lora for diffusion transformers.arXiv preprint arXiv:2410.23775, 2024a

    Huang,L.,Wang,W.,Wu,Z.F.,Shi,Y.,Dou,H.,Liang,C.,Feng,Y.,Liu,Y.,Zhou, J.: In-context lora for diffusion transformers. arXiv preprint arXiv:2410.23775 (2024)

  24. [25]

    In: European Conference on Computer Vision

    Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European Conference on Computer Vision. pp. 18–35. Springer (2024)

  25. [26]

    Accessed Sept.30, 2024 [Online]https://kling.kuaishou

    Kling: Kling.Kling. Accessed Sept.30, 2024 [Online]https://kling.kuaishou. com/en(2024),https://kling.kuaishou.com/en

  26. [27]

    Collaborative video diffusion: Consistent multi-video generation with camera control

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L., Wetzstein, G.: Collabora- tive video diffusion: Consistent multi-video generation with camera control. arXiv preprint arXiv:2405.17414 (2024)

  27. [28]

    Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

    Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly2(1–2), 83–97 (1955)

  28. [29]

    NeurIPS (2023)

    Li, D., Li, J., Hoi, S.: Blip-diffusion: Pre-trained subject representation for con- trollable text-to-image generation and editing. NeurIPS (2023)

  29. [30]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

  30. [31]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  31. [32]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) SpatialEdit 21

  32. [33]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  33. [34]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

    Luo, Y., Shi, X., Bai, J., Xia, M., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Camclonemaster: Enabling reference-based camera control for video generation. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–10 (2025)

  34. [35]

    OpenAI: Gpt-4o image generation.https://openai.com/index/introducing-4o- image-generation/(2025)

  35. [36]

    URL https://openai.com/index/introducing-4o-image- generation/ (2025)

    OpenAI: Gpt-image-1. URL https://openai.com/index/introducing-4o-image- generation/ (2025)

  36. [37]

    In: ICCV (2023)

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: ICCV (2023)

  37. [38]

    In: CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

  38. [39]

    In: CVPR (2023)

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023)

  39. [41]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

  40. [42]

    Emu edit: Precise image editing via recognition and generation tasks

    Sheynin, S., Polyak, A., Singer, U., Kirstain, Y., Zohar, A., Ashual, O., Parikh, D., Taigman, Y.: Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089 (2023)

  41. [43]

    In: Computer Vision: A Reference Guide, pp

    Sturm, P.: Pinhole camera model. In: Computer Vision: A Reference Guide, pp. 983–986. Springer (2021)

  42. [45]

    Ominicontrol: Minimal and uni- versal control for diffusion transformer.arXiv preprint arXiv:2411.15098, 2024

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. arXiv preprint arXiv:2411.15098 (2024)

  43. [46]

    Longcat-image technical report

    Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.Y., Gao, L., Xiao, S., Wei, X., Ma, X., et al.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025)

  44. [47]

    Advancing open-source world models,

    Team, R., Gao, Z., Wang, Q., Zeng, Y., Zhu, J., Cheng, K.L., Li, Y., Wang, H., Xu, Y., Ma, S., et al.: Advancing open-source world models. arXiv preprint arXiv:2601.20540 (2026)

  45. [48]

    In: 2024 Asia Pacific Conference on Innovation in Technology (APCIT)

    Thippeswamy, B., Ramachandra, H., Rohan, S., et al.: Textverse: A streamlit web application for advanced analysis of pdf and image files with and without language models. In: 2024 Asia Pacific Conference on Innovation in Technology (APCIT). pp. 1–6. IEEE (2024)

  46. [49]

    In: European Conference on Computer Vision

    Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)

  47. [50]

    https://www.vidu.cn/ (2024)

    Vidu Team: Vidu: Ai video generator. https://www.vidu.cn/ (2024)

  48. [51]

    Advances in neural information processing systems 37, 107984–108011 (2024)

    Wang, A., Chen, H., Liu, L., Chen, K., Lin, Z., Han, J., et al.: Yolov10: Real-time end-to-end object detection. Advances in neural information processing systems 37, 107984–108011 (2024)

  49. [52]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 22 Y. Xiao et al

  50. [53]

    Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset

    Wang, Y., Yang, S., Zhao, B., Zhang, L., Liu, Q., Zhou, Y., Xie, C.: Gpt- image-edit-1.5 m: A million-scale, gpt-generated image dataset. arXiv preprint arXiv:2507.21033 (2025)

  51. [54]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  52. [55]

    In: ICLR (2024)

    Wei, C., Xiong, Z., Ren, W., Du, X., Zhang, G., Chen, W.: Omniedit: Building image editing generalist models through specialist supervision. In: ICLR (2024)

  53. [56]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  54. [57]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

  55. [58]

    arXiv preprint arXiv:2508.06080 (2025)

    Xia, B., Liu, J., Zhang, Y., Peng, B., Chu, R., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamve: Unified instruction-based image and video editing. arXiv preprint arXiv:2508.06080 (2025)

  56. [59]

    In: ECCV (2024)

    Xia, B., Wang, S., Tao, Y., Wang, Y., Jia, J.: Llmga: Multimodal large language model based generation assistant. In: ECCV (2024)

  57. [60]

    In: CVPR (2025)

    Xia, B., Zhang, Y., Li, J., Wang, C., Wang, Y., Wu, X., Yu, B., Jia, J.: Dreamomni: Unified image generation and editing. In: CVPR (2025)

  58. [61]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13294–13304 (2025)

  59. [62]

    Mindomni: Unleashing reasoning generation in vision language models with rgpo.arXiv preprint arXiv:2505.13031, 2025

    Xiao, Y., Song, L., Chen, Y., Luo, Y., Chen, Y., Gan, Y., Huang, W., Li, X., Qi, X., Shan, Y.: Mindomni: Unleashing reasoning generation in vision language models with rgpo. arXiv preprint arXiv:2505.13031 (2025)

  60. [63]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xiao,Y.,Wang,Q.,Zhang,S.,Xue,N.,Peng,S.,Shen,Y.,Zhou,X.:Spatialtracker: Tracking any 2d pixels in 3d space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20406–20417 (2024)

  61. [64]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  62. [65]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025)

  63. [66]

    Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning

    Zhang, D.J., Paiss, R., Zada, S., Karnad, N., Jacobs, D.E., Pritch, Y., Mosseri, I., Shou, M.Z., Wadhwa, N., Ruiz, N.: Recapture: Generative video camera con- trols for user-provided videos using masked video fine-tuning. arXiv preprint arXiv:2411.05003 (2024)

  64. [67]

    Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023)