arxiv: 2604.07230 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

Ruihang Xu , Dewei Zhou , Xiaolong Shen , Fan Ma , Yi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords image editing3D geometric simulationobject manipulationphysical accuracygenerative modelsdepth supervisionRealManip-10K datasetManipEval benchmark

0 comments

The pith

PhyEdit achieves physically accurate object manipulation in image editing by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing visual generative models fail at precise spatial manipulations such as correct object scaling and positioning because they lack explicit 3D geometry and perspective projection. To address this, PhyEdit incorporates a plug-and-play 3D prior derived from geometric simulation to supply contextual visual guidance during editing. The framework pairs this prior with joint supervision from both 2D images and 3D data to raise physical accuracy and manipulation consistency. The authors support the approach with the RealManip-10K real-world dataset containing paired images and depth annotations plus the ManipEval benchmark that scores 3D spatial control and geometric consistency. Experiments indicate the method exceeds prior approaches, including closed-source models, on these metrics.

Core claim

PhyEdit is an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, the method effectively improves physical accuracy and manipulation consistency. The work is backed by a new paired dataset and a multi-dimensional evaluation benchmark.

What carries the argument

The plug-and-play 3D prior obtained from explicit geometric simulation, which supplies 3D-aware visual guidance integrated through joint 2D-3D supervision.

If this is right

Physically correct scaling and positioning of objects after editing
Higher consistency in manipulation results across varied inputs
Improved 3D geometric accuracy measurable on the ManipEval benchmark
Outperformance relative to existing generative editing methods including closed-source systems

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 3D guidance mechanism could be applied frame-by-frame to produce physically consistent edits in video sequences
The same prior injection might support more reliable object insertion in augmented reality without per-scene tuning
The approach suggests a route to parameter-free spatial corrections in other generative tasks that currently suffer from perspective errors

Load-bearing premise

Explicit geometric simulation can be injected as reliable contextual guidance without introducing new inconsistencies or requiring scene-specific calibration that the pipeline does not address.

What would settle it

A set of edited images from real scenes in which the manipulated object's depth map and perspective projection deviate from the expected physical outcome for its new position and scale.

Figures

Figures reproduced from arXiv: 2604.07230 by Dewei Zhou, Fan Ma, Ruihang Xu, Xiaolong Shen, Yi Yang.

**Figure 1.** Figure 1: Comparison of image editing results on three physical scenarios between our PhyEdit and Nano Banana Pro [ [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of PhyEdit. User and GT inputs are first processed via the 3D transformation module. The resulting [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Samples from RealManip-10K. The objects are annotated with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dataset Construction Pipeline. The workflow consists of four key stages: data source filtering, camera-static clip [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on ManipEval. The top two rows highlight manipulation accuracy and geometric consistency; [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Continuous object manipulation along a trajectory. Given the initial image and a trajectory, PhyEdit performs [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Architectures of different depth supervision meth [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: More qualitative results on ManipEval. For baseline methods, misplaced objects and objects still left at the source are [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhyEdit adds a 3D geometric prior to image editing and ships a new real-world dataset plus benchmark, but the plug-and-play claim looks shaky and the quantitative support is thin.

read the letter

The paper's core move is to inject explicit 3D simulation (depth, perspective) as guidance for object manipulation in images, then train with joint 2D-3D losses. That direction makes sense for fixing the scaling and placement errors that plague current editors. They also release RealManip-10K, a paired real-world dataset with depth annotations, and ManipEval, a multi-metric benchmark focused on geometric consistency. Those two artifacts are the parts that could actually stick around even if the method gets replaced later. The abstract says the approach beats baselines including closed-source models on physical accuracy, which is the kind of claim worth checking against the full numbers and ablations. The soft spot is the assumption that the 3D prior is reliable and calibration-free. Depth estimates and camera parameters are noisy in real scenes, and if the pipeline does not show how mismatches propagate or get corrected, the reported gains on RealManip-10K could be partly artifactual rather than fundamental. The stress-test note about implicit calibration is worth pressing on in review. This work is aimed at people building world models or robotics simulators who need better spatial control in generated images. It is coherent on its own terms and engages the right prior literature, so it deserves a serious referee even though the current evidence is not yet decisive. I would send it out for review with instructions to focus on the 3D integration details and the benchmark construction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PhyEdit, an image editing framework that incorporates explicit geometric simulation as plug-and-play 3D-aware contextual guidance for object manipulation. Combined with joint 2D-3D supervision, the approach is claimed to improve physical accuracy and manipulation consistency. The authors contribute the RealManip-10K real-world dataset with paired images and depth annotations, plus the ManipEval benchmark with multi-dimensional metrics for 3D spatial control and geometric consistency. Experiments report outperformance over baselines and closed-source models in geometric accuracy and consistency.

Significance. If the central claims hold, this work offers a practical route to injecting 3D geometric priors into generative editing pipelines, addressing a clear limitation in current visual models for precise spatial manipulations. The RealManip-10K dataset and ManipEval benchmark are concrete, reusable contributions that can support future research on physically grounded editing. The emphasis on real-world paired data and multi-metric evaluation is a strength.

major comments (2)

[Method section] Method section (plug-and-play 3D prior description): The central claim that explicit geometric simulation (depth maps and perspective projection) can be injected as reliable contextual guidance without per-scene calibration or new inconsistencies is load-bearing. Real-world depth estimation is inherently noisy, yet the manuscript provides no robustness analysis or explicit handling of alignment errors between the prior and input image; this risks overstating gains on RealManip-10K if mismatches are masked rather than resolved by joint supervision.
[Experiments section] Experiments section: The reported outperformance lacks accompanying ablation studies isolating the contribution of the 3D prior versus joint 2D-3D supervision, and no error analysis or failure-case quantification is referenced. Without these, it is impossible to confirm that the physical accuracy improvements are attributable to the proposed mechanism rather than other pipeline choices.

minor comments (2)

[Abstract] Abstract: The statement that the method 'outperforms existing methods' would be strengthened by including at least one key quantitative metric (e.g., average improvement on ManipEval) to give readers an immediate sense of effect size.
[Figures] Figure captions and visualizations: Side-by-side comparisons of input, edited output, and ground-truth depth or projected geometry would better illustrate the claimed geometric consistency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of PhyEdit, the RealManip-10K dataset, and the ManipEval benchmark. We address each major comment below and describe the revisions planned for the manuscript.

read point-by-point responses

Referee: [Method section] Method section (plug-and-play 3D prior description): The central claim that explicit geometric simulation (depth maps and perspective projection) can be injected as reliable contextual guidance without per-scene calibration or new inconsistencies is load-bearing. Real-world depth estimation is inherently noisy, yet the manuscript provides no robustness analysis or explicit handling of alignment errors between the prior and input image; this risks overstating gains on RealManip-10K if mismatches are masked rather than resolved by joint supervision.

Authors: We agree that robustness to noisy depth estimates is critical for validating the plug-and-play claim. Although the joint 2D-3D supervision is intended to help the model reconcile minor misalignments, the manuscript would be strengthened by explicit analysis. In the revised version, we will add a dedicated robustness subsection with quantitative experiments that introduce controlled noise and alignment perturbations to the depth priors, measuring effects on editing accuracy and consistency. revision: yes
Referee: [Experiments section] Experiments section: The reported outperformance lacks accompanying ablation studies isolating the contribution of the 3D prior versus joint 2D-3D supervision, and no error analysis or failure-case quantification is referenced. Without these, it is impossible to confirm that the physical accuracy improvements are attributable to the proposed mechanism rather than other pipeline choices.

Authors: We appreciate this observation. While the original experiments include baseline comparisons, dedicated ablations isolating the 3D prior from joint supervision were not presented. We will revise the Experiments section to include two new ablation studies: one that removes the 3D geometric guidance while retaining joint 2D-3D supervision, and another that uses only the 3D prior. We will also add an error analysis subsection that quantifies failure cases, such as those involving large depth errors or complex scenes, to better attribute performance gains to the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity; method claims rest on external benchmarks and explicit geometric priors without self-referential reduction.

full rationale

The paper introduces PhyEdit as a framework that injects explicit geometric simulation (depth maps, perspective projection) as plug-and-play 3D guidance, trained with joint 2D-3D supervision, and evaluates on the newly introduced RealManip-10K dataset plus ManipEval metrics. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs (e.g., no scale fitted from data then called a prediction of the same scale). The central claim of improved physical accuracy is supported by comparative experiments against baselines rather than by any self-definitional loop or self-citation chain that forbids alternatives. The 3D prior is described as external simulation, not derived from the model's own outputs, keeping the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the 3D prior is described as plug-and-play geometric simulation, treated as standard computer-vision tooling.

pith-pipeline@v0.9.0 · 5486 in / 993 out tokens · 24436 ms · 2026-05-10T18:50:00.216568+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverages explicit geometric simulation as contextual 3D-aware visual guidance... 3D transformation module... P_o = Unproj(I_src, M_o, D, R, t); P'_o = P_o + Δp_o; I_prev = Proj(P'_o, R, t)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint 2D–3D supervision... L = L_noise + λ_d L_depth with SILog on depth maps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

A new framework called ERR decomposes UHD image restoration into three frequency stages with specialized sub-networks and introduces the LSUHDIR benchmark dataset of over 82,000 images.

Reference graph

Works this paper leans on

63 extracted references · 32 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, and Matthias Grundmann. 2020. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. arXiv:2012.09988 [cs.CV] https://arxiv.org/ abs/2012.09988

work page arXiv 2020
[2]

Barrow, Jay M

Harry G. Barrow, Jay M. Tenenbaum, Robert C. Bolles, and Helen C. Wolf. 1977. Parametric Correspondence and Chamfer Matching: Two New Techniques for Image Matching. In International Joint Conference on Artificial Intelligence

1977
[3]

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. 2023. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. ArXiv abs/2310.10639 (2023)

work page arXiv 2023
[4]

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dol- lár, and Christoph Feichtenhofer. 2025. Perception Encoder: The best visual embeddings are not at the output of the network. arXiv (2025)

2025
[5]

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al . 2024. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning

2024
[6]

ByteDance Seed Team. 2026. Deeper Thinking, More Accurate Generation | Introducing Seedream 5.0 Lite. https://seed.bytedance.com/en/blog/deeper- thinking-more-accurate-generation-introducing-seedream-5-0-lite

2026
[7]

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. 2024. Instruction-based Image Manipulation by Watching How Things Move. arXiv:2412.12087 [cs.CV] https://arxiv.org/abs/2412.12087

work page arXiv 2024
[8]

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, An- drew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Lilian...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. 2025. Byte- Morph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Mo- tions. arXiv preprint arXiv:2506.03107 (2025)

work page arXiv 2025
[10]

Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. 2023. GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting. arXiv:2311.14521 [cs.CV]

work page arXiv 2023
[11]

Google DeepMind. 2025. Gemini 3 Pro Image – Nano Banana Pro Model Card. https://deepmind.google/models/gemini-image/pro. Accessed: March 2026

2025
[12]

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. 2023. Objaverse-XL: A Universe of 10M+ 3D Objects. arXiv preprint arXiv:2307.05663 (2023)

work page arXiv 2023
[13]

Zheng-Peng Duan, Jiawei Zhang, Siyu Liu, Zheng Lin, Chun-Le Guo, Dongqing Zou, Jimmy Ren, and Chongyi Li. 2025. A Diffusion-Based Framework for Occluded Object Movement. AAAI

2025
[14]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Neural Information Processing Systems

2014
[15]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density- Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96). 373–382

1996
[16]

Heng Fan, Hexin Bai, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Harshit, Mingzhen Huang, Juehuan Liu, Yong Xu, Chunyuan Liao, Lin Yuan, and Haibin Ling. 2020. LaSOT: A High-quality Large-scale Single Object Tracking Bench- mark. arXiv:2009.03465 [cs.CV] https://arxiv.org/abs/2009.03465

work page arXiv 2020
[17]

Gunnar Farnebäck. 2003. Two-Frame Motion Estimation Based on Polynomial Expansion. InImage Analysis, Josef Bigun and Tomas Gustavsson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 363–370

2003
[18]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

2022
[19]

Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2021. GOT-10k: A Large High- Diversity Benchmark for Generic Object Tracking in the Wild.IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 5 (2021), 1562–1577. doi:10. 1109/TPAMI.2019.2957464

work page arXiv 2021
[20]

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long
[21]

Huang, J

Vid2World: Crafting Video Diffusion Models to Interactive World Models. arXiv preprint arXiv:2505.14357 (2025)

work page arXiv 2025
[22]

Liyao Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Niu. 2025. PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation. In Proceedings of the AAAI Conference on Artificial Intelligence

2025
[23]

Pengfei Jiang, Mingbao Lin, and Fei Chao. 2024. Move and Act: En- hanced Object Manipulation and Background Integrity for Image Editing. arXiv:2407.17847 [cs.CV] https://arxiv.org/abs/2407.17847

work page arXiv 2024
[24]

Glenn Jocher and Jing Qiu. 2026. Ultralytics YOLO26. https://github.com/ ultralytics/ultralytics

2026
[25]

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. 2025. How Far Is Video Generation from World Model: A Physical Law Perspective. In International Conference on Machine Learning. PMLR, 28991–29017

2025
[26]

Black Forest Labs. 2025. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/ flux-2

2025
[27]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. 2025. Depth Anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review arXiv 2025
[28]

Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin, and Jinjin Zheng. 2024. Freedrag: Feature dragging for reliable point-based image editing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6860–6870

2024
[29]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. In International Conference on Learning Representations

2017
[30]

David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. In International Journal of Computer Vision, Vol. 60. 91–110. doi:10.1023/B: VISI.0000029664.99615.94

work page doi:10.1023/b: 2004
[31]

Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kem- bhavi, and Tanmay Gupta. 2023. OBJECT 3DIT: Language-guided 3D-aware Image Editing. arXiv:2307.11073 [cs.CV] https://arxiv.org/abs/2307.11073

work page arXiv 2023
[32]

Matthias Müller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi, and Bernard Ghanem. 2018. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. arXiv:1803.10794 [cs.CV] https://arxiv.org/abs/1803.10794

work page arXiv 2018
[33]

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation. arXiv preprint arXiv:2407.02371 (2024)

work page internal anchor Pith review arXiv 2024
[34]

OpenAI. 2025. The new ChatGPT Images is here. https://openai.com/index/new- chatgpt-images-is-here

2025
[35]

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. 2023. Drag Your GAN: Interactive Point-based Manipu- lation on the Generative Image Manifold. In ACM SIGGRAPH 2023 Conference Proceedings

2023
[36]

Karran Pandey, Paul Guerrero, Metheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, and Niloy J. Mitra. 2024. Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D. CVPR (2024)

2024
[37]

Ye Pang. 2025. Image Generation as a Visual Planner for Robotic Manipulation. arXiv:2512.00532 [cs.CV] https://arxiv.org/abs/2512.00532

work page arXiv 2025
[38]

William Peebles and Saining Xie. 2022. Scalable Diffusion Models with Trans- formers. arXiv preprint arXiv:2212.09748 (2022)

work page internal anchor Pith review arXiv 2022
[39]

Qwen Team. 2025. Qwen-Image-Edit-2511: Improve Consistency. https://qwen. ai/blog?id=qwen-image-edit-2511

2025
[40]

Qwen Team. 2026. Qwen-Image-2.0: Professional infographics, exquisite photorealism. https://qwen.ai/blog?id=qwen-image-2.0

2026
[41]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5 Trovato et al

2026
[42]

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision Transformers for Dense Prediction. ArXiv preprint (2021)

2021
[43]

Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, and Srinath Sridhar. 2025. GeoDiffuser: Geometry-Based Image Editing with Diffusion Models. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

2025
[44]

Yujun Shi, Jun Hao Liew, Hanshu Yan, Vincent YF Tan, and Jiashi Feng. 2024. LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos. arXiv preprint arXiv:2405.13722 (2024)

work page arXiv 2024
[45]

Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vincent YF Tan, and Song Bai. 2023. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv preprint arXiv:2306.14435 (2023)

work page arXiv 2023
[46]

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ra- mamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. 2024. VIDGEN-1M: A LARGE-SCALE DATASET FOR TEXT-TO-VIDEO GENERATION.arXiv preprint arXiv:2408.02629 (2024)

work page arXiv 2024
[48]

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. 2025. SAM 3D: 3Dfy Anything in Images. (2025...

work page internal anchor Pith review arXiv 2025
[49]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Yixin Wan, Lei Ke, Wenhao Yu, Kai-Wei Chang, and Dong Yu. 2025. MotionEdit: Benchmarking and Learning Motion-Centric Image Editing. arXiv preprint arXiv:2512.10284 (2025)

work page arXiv 2025
[51]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. VGGT: Visual Geometry Grounded Trans- former. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2025
[52]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. 2025.𝜋 3: Permutation- Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347 (2025)

work page internal anchor Pith review arXiv 2025
[53]

Zihan Wang, Songlin Li, Lingyan Hao, Xinyu Hu, and Bowen Song. 2024. What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality. arXiv:2411.13609 [cs.CV] https://arxiv.org/abs/2411. 13609

work page arXiv 2024
[54]

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. FAST-VQA: Efficient End-to-end Video Qual- ity Assessment with Fragment Sampling. Proceedings of European Conference of Computer Vision (ECCV) (2022)

2022
[55]

Chronoedit: Towards temporal reasoning for image editing and world simulation.arXiv preprint arXiv:2510.04290, 2025

Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, and Huan Ling. 2025. ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation. arXiv preprint arXiv:2510.04290 (2025)

work page arXiv 2025
[56]

Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, and Thomas Kipf

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A. Hudson, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey Allen, and Thomas Kipf. 2024. Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models. In Advances in Neural Information Processing Systems

2024
[57]

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. 2025. Teach- ing Large Language Models to Regress Accurate Image Quality Scores using Score Distribution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14483–14494

2025
[58]

Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, and Xiaojuan Qi. 2025. ObjectMover: Generative Object Movement with Video Prior. arXiv:2503.08037 [cs.GR] https://arxiv.org/abs/2503.08037

work page arXiv 2025
[59]

Qihang Zhang, Yinghao Xu, Chaoyang Wang, Hsin-Ying Lee, Gordon Wetzstein, Bolei Zhou, and Ceyuan Yang. 2024. 3DitScene: Editing Any Scene via Language- guided Disentangled Gaussian Splatting. In arXiv

2024
[60]

Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. 2025. GoodDrag: To- wards Good Practices for Drag Editing with Diffusion Models. InThe Thirteenth International Conference on Learning Representations

2025
[61]

Ruisi Zhao, Zechuan Zhang, Zongxin Yang, and Yi Yang. 2025. 3D Object Ma- nipulation in a Single Image using Generative Models. arXiv:2501.12935 [cs.CV] https://arxiv.org/abs/2501.12935

work page arXiv 2025
[62]

Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. 2020. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proceedings of the AAAI Conference on Artificial Intelligence 34, 07 (Apr. 2020), 12993–13000. doi:10.1609/aaai.v34i07.6999

work page doi:10.1609/aaai.v34i07.6999 2020
[63]

Chaoran Zhu, Hengyi Wang, Yik Lung Pang, and Changjae Oh. 2025. LaVA-Man: Learning Visual Action Representations for Robot Manipulation. arXiv:2508.19391 [cs.RO] https://arxiv.org/abs/2508.19391 PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing A Experimental Details A.1 Training Details This section adds training detai...

work page arXiv 2025