pith. machine review for the scientific record. sign in

arxiv: 2604.07230 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords image editing3D geometric simulationobject manipulationphysical accuracygenerative modelsdepth supervisionRealManip-10K datasetManipEval benchmark
0
0 comments X

The pith

PhyEdit achieves physically accurate object manipulation in image editing by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing visual generative models fail at precise spatial manipulations such as correct object scaling and positioning because they lack explicit 3D geometry and perspective projection. To address this, PhyEdit incorporates a plug-and-play 3D prior derived from geometric simulation to supply contextual visual guidance during editing. The framework pairs this prior with joint supervision from both 2D images and 3D data to raise physical accuracy and manipulation consistency. The authors support the approach with the RealManip-10K real-world dataset containing paired images and depth annotations plus the ManipEval benchmark that scores 3D spatial control and geometric consistency. Experiments indicate the method exceeds prior approaches, including closed-source models, on these metrics.

Core claim

PhyEdit is an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, the method effectively improves physical accuracy and manipulation consistency. The work is backed by a new paired dataset and a multi-dimensional evaluation benchmark.

What carries the argument

The plug-and-play 3D prior obtained from explicit geometric simulation, which supplies 3D-aware visual guidance integrated through joint 2D-3D supervision.

If this is right

  • Physically correct scaling and positioning of objects after editing
  • Higher consistency in manipulation results across varied inputs
  • Improved 3D geometric accuracy measurable on the ManipEval benchmark
  • Outperformance relative to existing generative editing methods including closed-source systems

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 3D guidance mechanism could be applied frame-by-frame to produce physically consistent edits in video sequences
  • The same prior injection might support more reliable object insertion in augmented reality without per-scene tuning
  • The approach suggests a route to parameter-free spatial corrections in other generative tasks that currently suffer from perspective errors

Load-bearing premise

Explicit geometric simulation can be injected as reliable contextual guidance without introducing new inconsistencies or requiring scene-specific calibration that the pipeline does not address.

What would settle it

A set of edited images from real scenes in which the manipulated object's depth map and perspective projection deviate from the expected physical outcome for its new position and scale.

Figures

Figures reproduced from arXiv: 2604.07230 by Dewei Zhou, Fan Ma, Ruihang Xu, Xiaolong Shen, Yi Yang.

Figure 1
Figure 1. Figure 1: Comparison of image editing results on three physical scenarios between our PhyEdit and Nano Banana Pro [ [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PhyEdit. User and GT inputs are first processed via the 3D transformation module. The resulting [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Samples from RealManip-10K. The objects are annotated with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset Construction Pipeline. The workflow consists of four key stages: data source filtering, camera-static clip [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on ManipEval. The top two rows highlight manipulation accuracy and geometric consistency; [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Continuous object manipulation along a trajectory. Given the initial image and a trajectory, PhyEdit performs [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Architectures of different depth supervision meth [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative results on ManipEval. For baseline methods, misplaced objects and objects still left at the source are [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PhyEdit, an image editing framework that incorporates explicit geometric simulation as plug-and-play 3D-aware contextual guidance for object manipulation. Combined with joint 2D-3D supervision, the approach is claimed to improve physical accuracy and manipulation consistency. The authors contribute the RealManip-10K real-world dataset with paired images and depth annotations, plus the ManipEval benchmark with multi-dimensional metrics for 3D spatial control and geometric consistency. Experiments report outperformance over baselines and closed-source models in geometric accuracy and consistency.

Significance. If the central claims hold, this work offers a practical route to injecting 3D geometric priors into generative editing pipelines, addressing a clear limitation in current visual models for precise spatial manipulations. The RealManip-10K dataset and ManipEval benchmark are concrete, reusable contributions that can support future research on physically grounded editing. The emphasis on real-world paired data and multi-metric evaluation is a strength.

major comments (2)
  1. [Method section] Method section (plug-and-play 3D prior description): The central claim that explicit geometric simulation (depth maps and perspective projection) can be injected as reliable contextual guidance without per-scene calibration or new inconsistencies is load-bearing. Real-world depth estimation is inherently noisy, yet the manuscript provides no robustness analysis or explicit handling of alignment errors between the prior and input image; this risks overstating gains on RealManip-10K if mismatches are masked rather than resolved by joint supervision.
  2. [Experiments section] Experiments section: The reported outperformance lacks accompanying ablation studies isolating the contribution of the 3D prior versus joint 2D-3D supervision, and no error analysis or failure-case quantification is referenced. Without these, it is impossible to confirm that the physical accuracy improvements are attributable to the proposed mechanism rather than other pipeline choices.
minor comments (2)
  1. [Abstract] Abstract: The statement that the method 'outperforms existing methods' would be strengthened by including at least one key quantitative metric (e.g., average improvement on ManipEval) to give readers an immediate sense of effect size.
  2. [Figures] Figure captions and visualizations: Side-by-side comparisons of input, edited output, and ground-truth depth or projected geometry would better illustrate the claimed geometric consistency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of PhyEdit, the RealManip-10K dataset, and the ManipEval benchmark. We address each major comment below and describe the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Method section] Method section (plug-and-play 3D prior description): The central claim that explicit geometric simulation (depth maps and perspective projection) can be injected as reliable contextual guidance without per-scene calibration or new inconsistencies is load-bearing. Real-world depth estimation is inherently noisy, yet the manuscript provides no robustness analysis or explicit handling of alignment errors between the prior and input image; this risks overstating gains on RealManip-10K if mismatches are masked rather than resolved by joint supervision.

    Authors: We agree that robustness to noisy depth estimates is critical for validating the plug-and-play claim. Although the joint 2D-3D supervision is intended to help the model reconcile minor misalignments, the manuscript would be strengthened by explicit analysis. In the revised version, we will add a dedicated robustness subsection with quantitative experiments that introduce controlled noise and alignment perturbations to the depth priors, measuring effects on editing accuracy and consistency. revision: yes

  2. Referee: [Experiments section] Experiments section: The reported outperformance lacks accompanying ablation studies isolating the contribution of the 3D prior versus joint 2D-3D supervision, and no error analysis or failure-case quantification is referenced. Without these, it is impossible to confirm that the physical accuracy improvements are attributable to the proposed mechanism rather than other pipeline choices.

    Authors: We appreciate this observation. While the original experiments include baseline comparisons, dedicated ablations isolating the 3D prior from joint supervision were not presented. We will revise the Experiments section to include two new ablation studies: one that removes the 3D geometric guidance while retaining joint 2D-3D supervision, and another that uses only the 3D prior. We will also add an error analysis subsection that quantifies failure cases, such as those involving large depth errors or complex scenes, to better attribute performance gains to the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity; method claims rest on external benchmarks and explicit geometric priors without self-referential reduction.

full rationale

The paper introduces PhyEdit as a framework that injects explicit geometric simulation (depth maps, perspective projection) as plug-and-play 3D guidance, trained with joint 2D-3D supervision, and evaluates on the newly introduced RealManip-10K dataset plus ManipEval metrics. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs (e.g., no scale fitted from data then called a prediction of the same scale). The central claim of improved physical accuracy is supported by comparative experiments against baselines rather than by any self-definitional loop or self-citation chain that forbids alternatives. The 3D prior is described as external simulation, not derived from the model's own outputs, keeping the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the 3D prior is described as plug-and-play geometric simulation, treated as standard computer-vision tooling.

pith-pipeline@v0.9.0 · 5486 in / 993 out tokens · 24436 ms · 2026-05-10T18:50:00.216568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark

    cs.CV 2026-04 unverdicted novelty 7.0

    A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.

Reference graph

Works this paper leans on

63 extracted references · 32 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, and Matthias Grundmann. 2020. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. arXiv:2012.09988 [cs.CV] https://arxiv.org/ abs/2012.09988

  2. [2]

    Barrow, Jay M

    Harry G. Barrow, Jay M. Tenenbaum, Robert C. Bolles, and Helen C. Wolf. 1977. Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching. In International Joint Conference on Artificial Intelligence

  3. [3]

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. 2023. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. ArXiv abs/2310.10639 (2023)

  4. [4]

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dol- lár, and Christoph Feichtenhofer. 2025. Perception Encoder: The best visual embeddings are not at the output of the network. arXiv (2025)

  5. [5]

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al . 2024. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning

  6. [6]

    ByteDance Seed Team. 2026. Deeper Thinking, More Accurate Generation | Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper- thinking-more-accurate-generation-introducing-seedream-5-0-lite

  7. [7]

    Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. 2024. Instruction-based Image Manipulation by Watching How Things Move. arXiv:2412.12087 [cs.CV] https://arxiv.org/abs/2412.12087

  8. [8]

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Lilian...

  9. [9]

    Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. 2025. Byte- Morph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Mo- tions. arXiv preprint arXiv:2506.03107 (2025)

  10. [10]

    Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. 2023. GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting. arXiv:2311.14521 [cs.CV]

  11. [11]

    Google DeepMind. 2025. Gemini 3 Pro Image – Nano Banana Pro Model Card. https://deepmind.google/models/gemini-image/pro. Accessed: March 2026

  12. [12]

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023)

  13. [13]

    Zheng-Peng Duan, Jiawei Zhang, Siyu Liu, Zheng Lin, Chun-Le Guo, Dongqing Zou, Jimmy Ren, and Chongyi Li. 2025. A Diffusion-Based Framework for Occluded Object Movement. AAAI

  14. [14]

    David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Neural Information Processing Systems

  15. [15]

    Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96). 373–382

  16. [16]

    Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. 2020. LaSOT: A High-quality Large-scale Single Object Tracking Bench- mark. arXiv:2009.03465 [cs.CV] https://arxiv.org/abs/2009.03465

  17. [17]

    Gunnar Farnebäck. 2003. Two-Frame Motion Estimation Based on Polynomial Expansion. InImage Analysis, Josef Bigun and Tomas Gustavsson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 363–370

  18. [18]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

  19. [19]

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2021. GOT-10k: A Large High- Diversity Benchmark for Generic Object Tracking in the Wild.IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2021), 1562–1577. doi:10. 1109/TPAMI.2019.2957464

  20. [20]

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long

  21. [21]

    Huang, J

    Vid2World: Crafting Video Diffusion Models to Interactive World Models. arXiv preprint arXiv:2505.14357 (2025)

  22. [22]

    Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Niu. 2025. PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation. In Proceedings of the AAAI Conference on Artificial Intelligence

  23. [23]

    Pengfei Jiang, Mingbao Lin, and Fei Chao. 2024. Move and Act: En- hanced Object Manipulation and Background Integrity for Image Editing. arXiv:2407.17847 [cs.CV] https://arxiv.org/abs/2407.17847

  24. [24]

    Glenn Jocher and Jing Qiu. 2026. Ultralytics YOLO26. https://github.com/ ultralytics/ultralytics

  25. [25]

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. 2025. How Far Is Video Generation from World Model: A Physical Law Perspective. In International Conference on Machine Learning. PMLR, 28991–29017

  26. [26]

    Black Forest Labs. 2025. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/ flux-2

  27. [27]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth Anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  28. [28]

    Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, and Jinjin Zheng. 2024. Freedrag: Feature dragging for reliable point-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6860–6870

  29. [29]

    Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations

  30. [30]

    David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. In International Journal of Computer Vision, Vol. 60. 91–110. doi:10.1023/B: VISI.0000029664.99615.94

  31. [31]

    Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kem- bhavi, and Tanmay Gupta. 2023. OBJECT 3DIT: Language-guided 3D-aware Image Editing. arXiv:2307.11073 [cs.CV] https://arxiv.org/abs/2307.11073

  32. [32]

    Matthias Müller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. arXiv:1803.10794 [cs.CV] https://arxiv.org/abs/1803.10794

  33. [33]

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation. arXiv preprint arXiv:2407.02371 (2024)

  34. [34]

    OpenAI. 2025. The new ChatGPT Images is here. https://openai.com/index/new- chatgpt-images-is-here

  35. [35]

    Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. 2023. Drag Your GAN: Interactive Point-based Manipu- lation on the Generative Image Manifold. In ACM SIGGRAPH 2023 Conference Proceedings

  36. [36]

    Karran Pandey, Paul Guerrero, Metheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J. Mitra. 2024. Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D. CVPR (2024)

  37. [37]

    Ye Pang. 2025. Image Generation as a Visual Planner for Robotic Manipulation. arXiv:2512.00532 [cs.CV] https://arxiv.org/abs/2512.00532

  38. [38]

    William Peebles and Saining Xie. 2022. Scalable Diffusion Models with Trans- formers. arXiv preprint arXiv:2212.09748 (2022)

  39. [39]

    Qwen Team. 2025. Qwen-Image-Edit-2511: Improve Consistency. https://qwen. ai/blog?id=qwen-image-edit-2511

  40. [40]

    Qwen Team. 2026. Qwen-Image-2.0: Professional infographics, exquisite photorealism. https://qwen.ai/blog?id=qwen-image-2.0

  41. [41]

    Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5 Trovato et al

  42. [42]

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision Transformers for Dense Prediction. ArXiv preprint (2021)

  43. [43]

    Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, and Srinath Sridhar. 2025. GeoDiffuser: Geometry-Based Image Editing with Diffusion Models. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

  44. [44]

    Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent YF Tan, and Jiashi Feng. 2024. LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos. arXiv preprint arXiv:2405.13722 (2024)

  45. [45]

    Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. 2023. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv preprint arXiv:2306.14435 (2023)

  46. [46]

    Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ra- mamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  47. [47]

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. 2024. VIDGEN-1M: A LARGE-SCALE DATASET FOR TEXT-TO-VIDEO GENERATION.arXiv preprint arXiv:2408.02629 (2024)

  48. [48]

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. 2025. SAM 3D: 3Dfy Anything in Images. (2025...

  49. [49]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  50. [50]

    Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, and Dong Yu. 2025. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing. arXiv preprint arXiv:2512.10284 (2025)

  51. [51]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. VGGT: Visual Geometry Grounded Trans- former. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  52. [52]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. 2025.𝜋 3: Permutation- Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025)

  53. [53]

    Zihan Wang, Songlin Li, Lingyan Hao, Xinyu Hu, and Bowen Song. 2024. What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality. arXiv:2411.13609 [cs.CV] https://arxiv.org/abs/2411. 13609

  54. [54]

    Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. FAST-VQA: Efficient End-to-end Video Qual- ity Assessment with Fragment Sampling. Proceedings of European Conference of Computer Vision (ECCV) (2022)

  55. [55]

    Chronoedit: Towards temporal reasoning for image editing and world simulation.arXiv preprint arXiv:2510.04290, 2025

    Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, and Huan Ling. 2025. ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation. arXiv preprint arXiv:2510.04290 (2025)

  56. [56]

    Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, and Thomas Kipf

    Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A. Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, and Thomas Kipf. 2024. Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models. In Advances in Neural Information Processing Systems

  57. [57]

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. 2025. Teach- ing Large Language Models to Regress Accurate Image Quality Scores using Score Distribution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14483–14494

  58. [58]

    Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, and Xiaojuan Qi. 2025. ObjectMover: Generative Object Movement with Video Prior. arXiv:2503.08037 [cs.GR] https://arxiv.org/abs/2503.08037

  59. [59]

    Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, and Ceyuan Yang. 2024. 3DitScene: Editing Any Scene via Language- guided Disentangled Gaussian Splatting. In arXiv

  60. [60]

    Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. 2025. GoodDrag: To- wards Good Practices for Drag Editing with Diffusion Models. InThe Thirteenth International Conference on Learning Representations

  61. [61]

    Ruisi Zhao, Zechuan Zhang, Zongxin Yang, and Yi Yang. 2025. 3D Object Ma- nipulation in a Single Image using Generative Models. arXiv:2501.12935 [cs.CV] https://arxiv.org/abs/2501.12935

  62. [62]

    Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. 2020. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (Apr. 2020), 12993–13000. doi:10.1609/aaai.v34i07.6999

  63. [63]

    Chaoran Zhu, Hengyi Wang, Yik Lung Pang, and Changjae Oh. 2025. LaVA-Man: Learning Visual Action Representations for Robot Manipulation. arXiv:2508.19391 [cs.RO] https://arxiv.org/abs/2508.19391 PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing A Experimental Details A.1 Training Details This section adds training detai...