pith. sign in

arxiv: 2606.26551 · v2 · pith:HAOMG7SCnew · submitted 2026-06-25 · 💻 cs.CV

PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

Pith reviewed 2026-06-29 05:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords physics-aware image editingbenchmark datasetvideo generationtest-time scalingreal-world instancesphysics reasoningimage manipulationlatent reduction
0
0 comments X

The pith

Current image editing models show major gaps in physics reasoning on real scenarios, as shown by a new benchmark of video-derived cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhyEditBench to test whether instruction-based image editing models can handle physical dynamics in authentic settings. It assembles 238 high-resolution real-world instances drawn from videos across a taxonomy of four primary classes and twelve subclasses, plus thirty-five synthetic anti-physics cases. Evaluation of existing state-of-the-art methods reveals clear shortfalls in respecting physical constraints. The authors also introduce PhyWorld, a training-free method that applies test-time scaling and latent reduction drawn from video generation, which outperforms comparable approaches and indicates that video processes can function as a reasoning aid for editing tasks.

Core claim

PhyEditBench supplies a hierarchical taxonomy with four primary classes and twelve subclasses, populated by 238 real instances extracted from videos to reflect genuine physical dynamics together with 35 synthetic anti-physics examples. Empirical tests demonstrate that current SOTA editing methods possess substantial limitations in physics-based reasoning. PhyWorld, which employs test-time scaling and a latent reduction strategy derived from video generation, outperforms comparable models and thereby suggests that the video generation process can serve as an effective reasoning mechanism for image editing.

What carries the argument

PhyEditBench benchmark of video-extracted instances under a hierarchical physics taxonomy, paired with PhyWorld baseline that repurposes video generation via test-time scaling and latent reduction.

If this is right

  • SOTA editing methods possess substantial limitations in physics-based reasoning on real scenarios.
  • Video generation processes can serve as a reasoning mechanism for image editing tasks.
  • PhyWorld outperforms comparable models on the benchmark using only test-time scaling and latent reduction.
  • Benchmarks for image editing must incorporate explicit evaluation of physical dynamics rather than visual consistency alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Editing pipelines could embed short video-simulation steps to improve physical consistency without full retraining.
  • The taxonomy of classes and subclasses could guide targeted improvements in specific failure modes such as object trajectories or interactions.
  • Future benchmarks might add quantitative metrics for violation severity to distinguish minor from catastrophic physics errors.

Load-bearing premise

The 238 instances pulled from videos are taken to capture authentic physical dynamics in a representative way for testing editing models.

What would settle it

An independent human rating of physical plausibility on the 238 benchmark edits, where PhyWorld shows no consistent advantage over existing SOTA methods.

Figures

Figures reproduced from arXiv: 2606.26551 by Chaoyue Meng, Qi Fan, Shaofeng Zhang, Shaokang He, Shengbin Guo, Shengpeng Xiao, Xunzhi Xiang.

Figure 1
Figure 1. Figure 1: High-quality, high-resolution, real-world examples from PhyEditBench, which encompasses a diverse range of complex physical processes. Abstract. While instruction-based image editing, enabled by multi￾modal generative models, has advanced significantly, existing bench￾marks lack comprehensive evaluation of physics-based reasoning—a criti￾cal capability for handling real-world scenarios. To address this, we… view at source ↗
Figure 2
Figure 2. Figure 2: (a) shows our benchmark taxonomy and data volume. (b) illustrates the data construction pipeline. This profound temporal and physical understanding presents a novel pathway for solving complex image editing tasks. Instead of treating editing as a static, single-step pixel transformation, recent state-of-the-art approaches have begun to formulate image editing as a temporal generation process. Methods such … view at source ↗
Figure 3
Figure 3. Figure 3: (a) shows the form of normal data points, including pictures, editing instruc￾tions, explanations, and invariants. (b) depicts the data point form of anti-physics, including original images, editing instructions that violate physics, and expected phe￾nomena. (c) illustrates our benchmark scoring pipeline. Fluid Dynamics. This class focuses on complex liquid motion and multi-phase phenomena, where realistic… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed PhyWorld pipeline. The editing process begins by ini￾tializing multiple Gaussian noise samples. During generation, a latent reduction strat￾egy dynamically drops intermediate frames to compress the sequence and improve efficiency. Finally, a Video Reward Model evaluates all generated candidates, selecting the optimal sequence whose final frame is then extracted as the editing resul… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on physical reasoning tasks across five distinct cate￾gories. Each row presents two editing variations (Type A/E). PhyWorld demonstrates superior physical plausibility and closer alignment with ground truth, notably excelling in the anti-physics scenario. 5.2 Main Results Overall Performance. The empirical findings summarized in Tab. 2 encompass performance across both normal and ant… view at source ↗
Figure 6
Figure 6. Figure 6: Correlation between human and VLM evaluations across different physical stages and metrics. We report the Kendall τ rank correlation coefficient on four metrics. and physical plausibility reaches up to 0.65 and 0.56, respectively, during the final output stage, logically increasing as the physical transformations become more visually pronounced. Furthermore, the evaluator maintains robust agreement when as… view at source ↗
read the original abstract

While instruction-based image editing, enabled by multi-modal generative models, has advanced significantly, existing benchmarks lack a comprehensive evaluation of physics-based reasoning, a critical capability for handling real-world scenarios. To address this, we introduce PhyEditBench, a benchmark designed to assess the physical understanding of editing models. Guided by a hierarchical taxonomy, we establish 4 primary classes and 12 subclasses. It comprises 238 high-quality, high-resolution, real-world instances meticulously extracted from videos to capture authentic physical dynamics, alongside 35 synthetic Anti-Physics instances. Our empirical analysis of current SOTA editing methods exposes substantial limitations in their physics-based reasoning. We further propose a training-free baseline named PhyWorld that uses test-time scaling and a latent reduction strategy. PhyWorld outperforms comparable models and suggests that the video generation process can effectively serve as a reasoning mechanism for image editing. The project page is available at https://github.com/Previsior/PhyEditBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhyEditBench, a benchmark for physics-aware image editing consisting of 238 real-world instances extracted from videos (organized into 4 primary classes and 12 subclasses) plus 35 synthetic Anti-Physics instances. It reports that current SOTA instruction-based editing methods exhibit substantial limitations in physics-based reasoning and proposes PhyWorld, a training-free baseline that applies test-time scaling and a latent reduction strategy on video generation models; PhyWorld is shown to outperform comparable models, suggesting video generation can serve as a reasoning mechanism for image editing.

Significance. If the benchmark construction is shown to be representative and unbiased, the work would usefully document gaps in physical reasoning for editing models and provide an initial demonstration that video-generation pipelines can be repurposed for editing. The training-free nature of PhyWorld and the multi-stage taxonomy are positive features that could be built upon.

major comments (2)
  1. [§3] §3 (Benchmark Construction): the claim that the 238 instances 'meticulously extracted from videos to capture authentic physical dynamics' is load-bearing for all subsequent empirical claims, yet the manuscript provides no explicit selection criteria, quantitative diversity statistics (e.g., distribution across subclasses, motion magnitude histograms), or protocol for deriving editing instructions from video frames. Without these, it is impossible to rule out selection bias toward salient but non-representative cases.
  2. [§4, §5] §4 (Empirical Analysis) and §5 (PhyWorld): the reported outperformance of PhyWorld and the conclusion that 'SOTA methods expose substantial limitations' rest on the assumption that the evaluation isolates physics adherence; the text does not describe how ground-truth physics correctness is scored independently of visual quality or instruction following, nor does it provide per-subclass breakdowns or inter-annotator agreement for the real-world portion.
minor comments (2)
  1. [Abstract, §1] The abstract and introduction should explicitly state the total number of editing instructions per instance and whether multiple instructions are evaluated per image.
  2. [Figure 2] Figure 2 (taxonomy diagram) would benefit from an accompanying table listing the exact number of instances per subclass to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below, indicating revisions where appropriate to improve transparency in benchmark construction and evaluation.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): the claim that the 238 instances 'meticulously extracted from videos to capture authentic physical dynamics' is load-bearing for all subsequent empirical claims, yet the manuscript provides no explicit selection criteria, quantitative diversity statistics (e.g., distribution across subclasses, motion magnitude histograms), or protocol for deriving editing instructions from video frames. Without these, it is impossible to rule out selection bias toward salient but non-representative cases.

    Authors: We agree that additional documentation is needed to support the representativeness claim. In the revision we will expand §3 to include: (1) explicit video/frame selection criteria (e.g., minimum motion magnitude, occlusion thresholds, and class balance constraints), (2) quantitative diversity statistics (subclass counts, motion magnitude histograms, and scene-type distributions), and (3) the step-by-step protocol used to derive editing instructions from source video frames. These additions will allow readers to assess potential selection bias. revision: yes

  2. Referee: [§4, §5] §4 (Empirical Analysis) and §5 (PhyWorld): the reported outperformance of PhyWorld and the conclusion that 'SOTA methods expose substantial limitations' rest on the assumption that the evaluation isolates physics adherence; the text does not describe how ground-truth physics correctness is scored independently of visual quality or instruction following, nor does it provide per-subclass breakdowns or inter-annotator agreement for the real-world portion.

    Authors: We acknowledge the need for greater clarity on the evaluation methodology. The revised §4 will: (1) detail the independent physics-correctness scoring rubric (separate from visual quality and instruction adherence), (2) report per-subclass performance tables, and (3) include inter-annotator agreement statistics (e.g., Fleiss' kappa) computed on the real-world annotations. These changes will strengthen the isolation of physics reasoning claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and baseline are externally constructed.

full rationale

The paper introduces PhyEditBench as a new dataset of 238 video-extracted instances plus 35 synthetic cases, guided by a taxonomy, and evaluates existing SOTA editing methods plus a proposed training-free PhyWorld baseline. No equations, fitted parameters, or derivations are present. No self-citation chains, self-definitional steps, or renamings of known results appear in the provided text. The evaluation is presented as independent of the benchmark construction itself, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5715 in / 1020 out tokens · 24295 ms · 2026-06-29T05:14:22.644807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 44 canonical work pages · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 3

  2. [2]

    In: European conference on computer vi- sion

    Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text- driven layered image and video editing. In: European conference on computer vi- sion. pp. 707–723. Springer (2022) 2

  3. [3]

    arXiv preprint arXiv:2106.08261 (2021) 4, 6

    Bear, D.M., Wang, E., Mrowca, D., Binder, F.J., Tung, H.Y.F., Pramod, R., Hold- away, C., Tao, S., Smith, K., Sun, F.Y., et al.: Physion: Evaluating physical predic- tion from vision in humans and machines. arXiv preprint arXiv:2106.08261 (2021) 4, 6

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023) 2, 4, 6

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 2, 3, 11

  6. [6]

    OpenAI Technical Report (2024),https://openai.com/research/ video-generation-models-as-world-simulators2, 4, 6

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Ramesh, A., Klar, M., Sohl-Dickstein, J., et al.: Video generation models as world simulators. OpenAI Technical Report (2024),https://openai.com/research/ video-generation-models-as-world-simulators2, 4, 6

  7. [7]

    PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

    Cai, Y., Li, K., Jia, M., Wang, J., Sun, J., Liang, F., Chen, W., Juefei-Xu, F., Wang, C., Thabet, A., et al.: Phygdpo: Physics-aware groupwise direct prefer- ence optimization for physically consistent text-to-video generation. arXiv preprint arXiv:2512.24551 (2025) 4

  8. [8]

    arXiv preprint arXiv:2601.10592 (2026) 4

    Chen, D., Kasarla, T., Bang, Y., Shukor, M., Chung, W., Yu, J., Bolourchi, A., Moutakanni, T., Fung, P.: Action100m: A large-scale video action dataset. arXiv preprint arXiv:2601.10592 (2026) 4

  9. [9]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, Z., Zhou, Q., Shen, Y., Hong, Y., Sun, Z., Gutfreund, D., Gan, C.: Visual chain-of-thought prompting for knowledge-based visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 1254–1262 (2024) 6

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 2, 11 16 S. Guo et al

  11. [11]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 2, 11

  12. [12]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7346–7356 (2023) 2

  13. [13]

    Nature Machine In- telligence2(11), 665–673 (2020) 7

    Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Machine In- telligence2(11), 665–673 (2020) 7

  14. [14]

    Advances in Neural Information Processing Systems36, 52132–52152 (2023) 2

    Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems36, 52132–52152 (2023) 2

  15. [15]

    something something

    Goyal, R., Ebrahimi Kahou, S., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al.: The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision. pp. 5842– 5850 (2017) 4

  16. [16]

    arXiv preprint arXiv:2506.03596 (2025) 3

    Han, F., Jiao, Y., Chen, S., Xu, J., Chen, J., Jiang, Y.G.: Controlthinker: Unveiling latent semantics for controllable image generation through visual reasoning. arXiv preprint arXiv:2506.03596 (2025) 3

  17. [17]

    arXiv preprint arXiv:2511.01295 (2025) 2, 3, 4, 8

    Han, F., Wang, Y., Li, C., Liang, Z., Wang, D., Jiao, Y., Wei, Z., Gong, C., Jin, C., Chen, J., et al.: Unireditbench: A unified reasoning-based image editing benchmark. arXiv preprint arXiv:2511.01295 (2025) 2, 3, 4, 8

  18. [18]

    arXiv preprint arXiv:2505.17618 (2025) 3, 9, 10

    He, H., Liang, J., Wang, X., Wan, P., Zhang, D., Gai, K., Pan, L.: Scaling image and video generation via test-time evolutionary search. arXiv preprint arXiv:2505.17618 (2025) 3, 9, 10

  19. [19]

    arXiv preprint arXiv:2512.24165 (2025) 2

    He, Z., Qu, X., Li, Y., Zhu, T., Huang, S., Cheng, Y.: Diffthinker: Towards genera- tive multimodal reasoning with diffusion models. arXiv preprint arXiv:2512.24165 (2025) 2

  20. [20]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022) 2

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Y., Xie, L., Wang, X., Yuan, Z., Cun, X., Ge, Y., Zhou, J., Dong, C., Huang,R.,Zhang,R.,etal.:Smartedit:Exploringcomplexinstruction-basedimage editing with multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8362–8371 (2024) 3

  22. [22]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 2, 11

  23. [23]

    arXiv preprint arXiv:2512.23568 (2025) 2

    Jiao, S., Lin, Y., Zhong, Y., She, Q., Zhou, W., Lan, X., Huang, Z., Yu, F., Yu, Y., Zhao, Y., et al.: Thinkgen: Generalized thinking for visual generation. arXiv preprint arXiv:2512.23568 (2025) 2

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., Mosseri, I., Irani, M.: Imagic: Text-based real image editing with diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6007–6017 (2023) 2

  25. [25]

    Biometrika33(3), 239– 251 (1945) 13

    Kendall, M.G.: The treatment of ties in ranking problems. Biometrika33(3), 239– 251 (1945) 13

  26. [26]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V., Yan, J., Chiu, M.C., et al.: Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125 (2023) 2, 4 PhyEditBench 17

  27. [27]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dock- horn, T., English, J., English, Z., Esser, P., et al.: Flux. 1 kontext: Flow match- ing for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 2, 11

  28. [28]

    arXiv preprint arXiv:2601.03467 (2026) 2

    Li, H., Jiang, L., Yan, Q., Song, Y., Kang, H., Liu, Z., Lu, X., Wu, B., Cai, D.: Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing. arXiv preprint arXiv:2601.03467 (2026) 2

  29. [29]

    arXiv preprint arXiv:2512.05965 (2025) 2, 3

    Li, H., Zhang, M., Zheng, D., Guo, Z., Jia, Y., Feng, K., Yu, H., Liu, Y., Feng, Y., Pei, P., et al.: Editthinker: Unlocking iterative reasoning for any image editor. arXiv preprint arXiv:2512.05965 (2025) 2, 3

  30. [30]

    arXiv preprint arXiv:2512.13276 (2025) 2

    Li, Y., Liu, L., Zhang, X., Xue, W., Luo, W., Guo, Y., Tian, Q.: Cogniedit: Dense gradient flow optimization for fine-grained image editing. arXiv preprint arXiv:2512.13276 (2025) 2

  31. [31]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025) 2, 11

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024) 3

  33. [33]

    Improving Video Generation with Human Feedback

    Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Qin, W., Xia, M., et al.: Improving video generation with human feedback. arXiv preprint arXiv:2501.13918 (2025) 3, 9, 10

  34. [34]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025) 2, 11

  35. [35]

    Marcus, G.: The next decade in ai: Four steps towards robust artificial intelligence. arxiv. arXiv preprint arXiv:2002.06177 (2020) 7

  36. [36]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021) 2

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text inver- sion for editing real images using guided diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6038–6047 (2023) 2

  38. [38]

    arXiv preprint arXiv:2512.00387 (2025) 2, 3, 4, 8

    Pan, K., Chen, W., Qiu, H., Yu, Q., Bu, W., Wang, Z., Zhu, Y., Li, J., Tang, S.: Wiseedit: Benchmarking cognition-and creativity-informed image editing. arXiv preprint arXiv:2512.00387 (2025) 2, 3, 4, 8

  39. [39]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 3

  40. [40]

    Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision,

    Qin, L., Gong, J., Sun, Y., Li, T., Yang, M., Yang, X., Qu, C., Tan, Z., Li, H.: Uni-cot: Towards unified chain-of-thought reasoning across text and vision. arXiv preprint arXiv:2508.05606 (2025) 2

  41. [41]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022) 2, 4

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 2, 3 18 S. Guo et al

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rotstein, N., Yona, G., Silver, D., Velich, R., Bensaïd, D., Kimmel, R.: Pathways on the image manifold: Image editing via video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7857– 7866 (2025) 2, 5, 9, 11

  44. [44]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) 2, 11

  45. [45]

    Cognitive science14(1), 29–56 (1990) 6

    Spelke, E.S.: Principles of object perception. Cognitive science14(1), 29–56 (1990) 6

  46. [46]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 3

  47. [47]

    Team, Q.: Qwen3.5: Accelerating productivity with native multimodal agents (February 2026),https://qwen.ai/blog?id=qwen3.59

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., Ross, C.: Winoground: Probing vision and language models for visio-linguistic composition- ality. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 5238–5248 (2022) 7

  49. [49]

    arXiv preprint arXiv:2511.09611 (2025) 2

    Tian, Y., Yang, L., Yang, J., Wang, A., Tian, Y., Zheng, J., Wang, H., Teng, Z., Wang, Z., Wang, Y., et al.: Mmada-parallel: Multimodal large diffusion language models for thinking-aware editing and generation. arXiv preprint arXiv:2511.09611 (2025) 2

  50. [50]

    arXiv preprint arXiv:2601.10061 (2026) 5

    Tong, C., Chang, M., Zhang, S., Wang, Y., Liang, C., Zhao, Z., An, R., Zeng, B., Shi, Y., Dai, Y., et al.: Cof-t2i: Video models as pure visual reasoners for text-to- image generation. arXiv preprint arXiv:2601.10061 (2026) 5

  51. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  52. [52]

    Stable diffusion 3.5 large

    Wang, K., Chen, R., Zheng, T., Huang, H.: Imagent: A unified multi- modal agent framework for test-time scalable image generation. arXiv preprint arXiv:2511.11483 (2025) 2

  53. [53]

    Advances in neural information processing systems35, 24824–24837 (2022) 6

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 6

  54. [54]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 2, 11

  55. [55]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) 2, 11

  56. [56]

    arXiv preprint arXiv:2510.04290 (2025) 2, 5, 9, 10, 11 PhyEditBench 19

    Wu, J.Z., Ren, X., Shen, T., Cao, T., He, K., Lu, Y., Gao, R., Xie, E., Lan, S., Alvarez, J.M., et al.: Chronoedit: Towards temporal reasoning for image editing and world simulation. arXiv preprint arXiv:2510.04290 (2025) 2, 5, 9, 10, 11 PhyEditBench 19

  57. [57]

    arXiv preprint arXiv:2505.16707 (2025) 2, 3, 4, 8

    Wu, Y., Li, Z., Hu, X., Ye, X., Zeng, X., Yu, G., Zhu, W., Schiele, B., Yang, M.H., Yang, X.: Kris-bench: Benchmarking next-level intelligent image editing models. arXiv preprint arXiv:2505.16707 (2025) 2, 3, 4, 8

  58. [58]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Ye, Y., He, X., Li, Z., Lin, B., Yuan, S., Yan, Z., Hou, B., Yuan, L.: Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275 (2025) 2

  59. [59]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer: Collision events for video representation and reasoning. arXiv preprint arXiv:1910.01442 (2019) 4, 6, 7

  60. [60]

    arXiv preprint arXiv:2511.22625 (2025) 2

    Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., et al.: Reasonedit: Towards reasoning-enhanced image editing models. arXiv preprint arXiv:2511.22625 (2025) 2

  61. [61]

    arXiv preprint arXiv:2509.21309 (2025) 4

    Yuan, Y., Wang, X., Wickremasinghe, T., Nadir, Z., Ma, B., Chan, S.H.: Newton- gen: Physics-consistent and controllable text-to-video generation via neural new- tonian dynamics. arXiv preprint arXiv:2509.21309 (2025) 4

  62. [62]

    arXiv preprint arXiv:2512.04532 (2025) 4

    Zhan, Y.W., Wang, X., Chen, H., Feng, T., Feng, W., Wang, R., Li, G., Li, Q., Zhu, W.: Phyvllm: Physics-guided video language model with motion-appearance disentanglement. arXiv preprint arXiv:2512.04532 (2025) 4

  63. [63]

    Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 2, 3

    Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Pro- cessing Systems36, 31428–31449 (2023) 2, 3

  64. [64]

    In: European Conference on Computer Vision

    Zhang, T., Yu, H.X., Wu, R., Feng, B.Y., Zheng, C., Snavely, N., Wu, J., Freeman, W.T.: Physdreamer: Physics-based interaction with 3d objects via video genera- tion. In: European Conference on Computer Vision. pp. 388–406. Springer (2024) 4

  65. [65]

    Zhang, Z., Chen, Z., Yang, Z., Yang, Y.: Are image-to-video models good zero-shot image editors? arXiv preprint arXiv:2511.19435 (2025) 2

  66. [66]

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, and Haodong Duan

    Zhao, X., Zhang, P., Tang, K., Zhu, X., Li, H., Chai, W., Zhang, Z., Xia, R., Zhai, G., Yan, J., et al.: Envisioning beyond the pixels: Benchmarking reasoning- informed visual editing. arXiv preprint arXiv:2504.02826 (2025) 2, 3, 4, 8