pith. machine review for the scientific record. sign in

arxiv: 2604.15871 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI

Recognition: unknown

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image editing benchmarkvideo editing evaluationMLLM distillationhuman-aligned metricsunified protocolcost-effective judgesvisual editing assessment
0
0 comments X

The pith

A single benchmark with distilled evaluators allows fair, low-cost comparison of image and video editing methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniEditBench to solve the problem of scattered evaluation standards for visual editing, where image and video methods use incompatible protocols and automatic scores often fail to match what people prefer. It creates one taxonomy of operations for both modalities, supports two main editing styles under the same rules, and replaces heavy large-model judges with much smaller distilled versions that still rate edits on fidelity, alignment, consistency, and naturalness. If successful, this would let researchers run reproducible tests without prohibitive expense and track real progress across the field.

Core claim

UniEditBench supplies a shared protocol covering nine image operations including add, remove, replace, change, stroke-based, extract, adjust, count and reorder, plus eight video operations that include compositional challenges. A large MLLM teacher is distilled into 4B and 8B evaluators that output scores across structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency, and these lightweight judges retain strong agreement with human ratings while cutting deployment cost substantially.

What carries the argument

The distilled 4B/8B MLLM evaluators that deliver multi-dimensional scores aligned with human judgments.

If this is right

  • Different editing approaches can be ranked directly against each other using identical rules and metrics.
  • Video editing studies gain a consistent way to report results that was previously missing.
  • Evaluation runs become cheap enough to repeat across many models and settings.
  • Challenging tasks such as spatial reordering receive explicit coverage instead of being ignored by older metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Papers on new editing techniques could adopt this protocol so that results become easier to compare across publications.
  • The same distillation method might produce affordable judges for other vision-language tasks that currently rely on expensive models.
  • Developers could test whether the evaluators remain reliable when applied to editing outputs from models released after the distillation training.

Load-bearing premise

Distilling the large teacher model into the smaller evaluators preserves accurate judgment on every operation, including the hardest compositional ones, without adding new biases.

What would settle it

Collect human ratings on a fresh set of image and video edits that include many count and reorder cases, then measure whether the 4B or 8B evaluators show clearly lower correlation with those humans than the original large model does.

Figures

Figures reproduced from arXiv: 2604.15871 by Boxi Wu, Chenyang Wang, Deng Cai, Lifan Jiang, Tianrun Wu, Yuhang Pei.

Figure 1
Figure 1. Figure 1: Overview of the UniEditBench framework. (A) Dataset Composition & Detailed Taxonomy: Displays the structured [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall pipeline of UniEditBench. (A) Multi-Source Data Aggregation: A comprehensive dataset is constructed by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual results of selected image editing methods on UniEditBench. See Appendix C for more examples. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data distribution of UniEditBench samples across [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual results of selected video editing methods on UniEditBench. See Appendix C for more examples. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Radar chart comparing alignment of automated [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comprehensive qualitative comparison on Image Editing. The examples cover the full 9-category taxonomy on the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comprehensive qualitative comparison on Video Editing. The examples cover the full 8-category taxonomy on the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces UniEditBench, a unified benchmark for image and video editing that supports both reconstruction-based and instruction-driven methods under a shared protocol. It defines a taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, including compositional tasks. The authors distill a large MLLM (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scores on structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency. Experiments are claimed to show strong human agreement and substantial cost reductions relative to the teacher model, with the benchmark and reward models released publicly.

Significance. If the experimental claims on agreement and cost savings are substantiated with rigorous metrics, this benchmark could standardize evaluation across fragmented visual editing paradigms and modalities, addressing misalignment between automatic metrics and human preferences while lowering barriers to scalable assessment. The public release of the benchmark and models is a clear strength for reproducibility.

major comments (2)
  1. [Abstract and Experiments section] Abstract and Experiments section: The central claim that distilled 4B/8B evaluators 'maintain strong agreement with human judgments' lacks any quantitative details on agreement metrics (e.g., Pearson correlation, Cohen's kappa, or Spearman rho), number of human raters, test set sizes or splits, statistical significance testing, or controls for data leakage during distillation from the teacher model. This is load-bearing for the reliability claim across all nine image and eight video operations.
  2. [Distillation and Evaluation sections] Distillation process description: No equations, loss formulations, or ablation results are provided to show how multi-dimensional scoring is preserved in the smaller models, leaving open the risk of degradation on challenging compositional tasks such as Count and Reorder. This directly affects the weakest assumption that agreement holds without hidden biases.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., agreement score or cost reduction factor) rather than qualitative statements.
  2. [Figures] Ensure all figure captions explicitly label the evaluation dimensions and clarify whether results are averaged across operations or reported per-operation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: The central claim that distilled 4B/8B evaluators 'maintain strong agreement with human judgments' lacks any quantitative details on agreement metrics (e.g., Pearson correlation, Cohen's kappa, or Spearman rho), number of human raters, test set sizes or splits, statistical significance testing, or controls for data leakage during distillation from the teacher model. This is load-bearing for the reliability claim across all nine image and eight video operations.

    Authors: We agree that the Experiments section would benefit from more explicit quantitative support for the agreement claims. In the revised manuscript we expand this section with a dedicated table and text reporting Pearson correlations, Cohen's kappa, and Spearman rho per dimension and operation, the number of human raters and inter-rater agreement, test-set sizes and splits, statistical significance tests, and the train/test partitioning used during distillation to mitigate data leakage. These additions directly substantiate the reliability claims for all nine image and eight video operations. revision: yes

  2. Referee: [Distillation and Evaluation sections] Distillation process description: No equations, loss formulations, or ablation results are provided to show how multi-dimensional scoring is preserved in the smaller models, leaving open the risk of degradation on challenging compositional tasks such as Count and Reorder. This directly affects the weakest assumption that agreement holds without hidden biases.

    Authors: We concur that the distillation description is currently insufficient. We have revised the Distillation and Evaluation sections to include the loss equations and formulations employed to transfer multi-dimensional scoring, together with ablation results that compare the 4B/8B models against the teacher on compositional operations including Count and Reorder. These additions clarify preservation of scoring fidelity and address potential degradation or bias concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluator validation

full rationale

The paper independently defines a new taxonomy of nine image and eight video operations, constructs UniEditBench under a shared protocol for reconstruction-based and instruction-driven methods, and distills lightweight evaluators from an external teacher model (Qwen3-VL-235B-A22B Instruct). Reported agreement with human judgments is obtained via separate empirical evaluation on held-out data rather than by construction from fitted parameters or self-referential equations. No load-bearing self-citations, ansatzes smuggled via prior work, or reductions of predictions to inputs are present; the derivation chain remains self-contained against external benchmarks and human annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work relies on standard benchmark construction and knowledge distillation techniques.

pith-pipeline@v0.9.0 · 5569 in / 1070 out tokens · 43459 ms · 2026-05-10T08:59:28.327729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 31 canonical work pages · 14 internal anchors

  1. [1]

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Yue Yu, Zichen Liu, et al . [n. d.]. Ditto: Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset. ([n. d.])

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, and Soheil Feizi. 2023. Editval: Benchmarking diffusion based text-guided image editing methods.arXiv preprint arXiv:2310.02426(2023)

  4. [5]

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18392–18402

  5. [6]

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. 2023. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international confer- ence on computer vision. 22560–22570

  6. [7]

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. 2025. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951(2025)

  7. [8]

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7310–7320

  8. [9]

    Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, et al . 2025. EditMGT: Unleashing Po- tentials of Masked Generative Transformers in Image Editing.arXiv preprint arXiv:2512.11715(2025)

  9. [10]

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. 2022. Diffedit: Diffusion-based semantic image editing with mask guidance.arXiv preprint arXiv:2210.11427(2022)

  10. [11]

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)

  11. [12]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  12. [13]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.The Innovation(2024)

  13. [14]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2023. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543(2023)

  14. [15]

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626(2022)

  15. [16]

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi

  16. [17]

    In Proceedings of the 2021 conference on empirical methods in natural language processing

    Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 conference on empirical methods in natural language processing. 7514–7528

  17. [18]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  18. [19]

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023. 8003–8017

  19. [20]

    Lifan Jiang, Boxi Wu, Yuhang Pei, Tianrun Wu, Yongyuan Chen, Yan Zhao, Shiyu Yu, and Deng Cai. 2026. SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing.arXiv preprint arXiv:2601.19180(2026)

  20. [21]

    Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, and Qiang Xu. 2023. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. InThe Twelfth International Conference on Learning Representations

  21. [22]

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. 2024. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603 (2024)

  22. [23]

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. 2025. Flowedit: Inversion-free text-based editing using pre-trained flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 19721–19730

  23. [24]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  24. [25]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. 2025. FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742(2025)

  25. [26]

    Guangzhao Li, Yanming Yang, Chenxi Song, and Chi Zhang. 2025. Flowdirector: Training-free flow steering for precise text-to-video editing.arXiv preprint arXiv:2506.05046(2025)

  26. [27]

    Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. 2025. Five- bench: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 16672–16681

  27. [28]

    Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, and Jason Kuen. 2025. Lavida-o: Elastic large masked diffusion models for uni- fied multimodal understanding and generation.arXiv preprint arXiv:2509.19244 (2025)

  28. [29]

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. 2025. Uniworld-v2: Rein- force image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888(2025)

  29. [30]

    Xinyao Liao, Xianfang Zeng, Ziye Song, Zhoujie Fu, Gang Yu, and Guosheng Lin. 2025. In-context learning with unpaired clips for instruction-based video editing.arXiv preprint arXiv:2510.14648(2025)

  30. [31]

    Haonan Lin, Yan Chen, Jiahao Wang, Wenbin An, Mengmeng Wang, Feng Tian, Yong Liu, Guang Dai, Jingdong Wang, and Qianying Wang. 2024. Schedule your edit: A simple yet effective diffusion noise schedule for image editing.Advances in Neural Information Processing Systems37 (2024), 115712–115756

  31. [32]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. arXiv:2304.08485 [cs.CV] https://arxiv.org/abs/2304.08485

  32. [33]

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al . 2025. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761 (2025)

  33. [34]

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al . 2024. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177(2024)

  34. [35]

    Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. 2025. Visual autoregressive modeling for instruction-guided image editing.arXiv preprint arXiv:2508.15772(2025)

  35. [36]

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2021. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv preprint arXiv:2108.01073(2021)

  36. [37]

    Jiteng Mu, Nuno Vasconcelos, and Xiaolong Wang. 2025. Editar: Unified con- ditional generation with autoregressive models. InProceedings of the Computer Vision and Pattern Recognition Conference. 7899–7909. Jiang et al

  37. [38]

    Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, and Yueting Zhuang

  38. [39]

    In2024 IEEE International Conference on Multimedia and Expo (ICME)

    Instructvid2vid: Controllable video editing with natural language instruc- tions. In2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6

  39. [40]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  40. [41]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  41. [42]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  42. [43]

    Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. 2024. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792(2024)

  43. [44]

    Alexander Tanchenko. 2014. Visual-PSNR measure of image quality.Journal of Visual Communication and Image Representation25, 5 (2014), 874–878

  44. [46]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

  45. [47]

    Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. 2024. Taming rectified flow for inversion and editing.arXiv preprint arXiv:2411.04746(2024)

  46. [48]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)

  47. [49]

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871(2025)

  48. [50]

    Jay Zhangjie Wu, Xiuyu Li, Difei Gao, Zhen Dong, Jinbin Bai, Aishani Singh, Xiaoyu Xiang, Youzeng Li, Zuwei Huang, Yuanxi Sun, et al. 2023. Cvpr 2023 text guided video editing competition.arXiv preprint arXiv:2310.16003(2023)

  49. [51]

    Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. 2025. Dreamomni2: Multi- modal instruction-based editing and generation.arXiv preprint arXiv:2510.06679 (2025)

  50. [52]

    Chenxi Xie, Minghan Li, Shuai Li, Yuhui Wu, Qiaosi Yi, and Lei Zhang. 2025. Dnaedit: Direct noise alignment for text-guided rectified flow editing.arXiv preprint arXiv:2506.01430(2025)

  51. [53]

    Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, and Hao Li. 2026. Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing.arXiv preprint arXiv:2602.08820(2026)

  52. [54]

    Kaixiang Yang, Boyang Shen, Xin Li, Yuchen Dai, Yuxuan Luo, Yueran Ma, Wei Fang, Qiang Li, and Zhiwei Wang. 2025. FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing.arXiv preprint arXiv:2511.12151(2025)

  53. [55]

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. 2025. Imgedit: A unified image editing dataset and benchmark. arXiv preprint arXiv:2505.20275(2025)

  54. [56]

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. 2025. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference. 26125–26135

  55. [58]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF inter- national conference on computer vision. 3836–3847

  56. [59]

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. 2025. Enabling in- structional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems

  57. [60]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al . 2025. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 29733–29735. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video...

  58. [61]

    **Structural Fidelity**: Evaluate whether the structure, pose, orientation, and spatial relationships of unedited entities remain perfectly consistent with the original image

  59. [62]

    **Text-Image Alignment**: Evaluate how accurately the edited content reflects the requested changes described in the edited prompt

  60. [63]

    Check for unwanted changes, color shifts, or distortions in unedited areas

    **Background Consistency**: Evaluate the consistency of all regions EXCEPT the edited subject. Check for unwanted changes, color shifts, or distortions in unedited areas

  61. [64]

    explanation

    **Naturalness**: Evaluate whether the overall scene appears natural. Look for noticeable flaws such as inconsistent lighting, perspective errors, structural distortions, or watermarks. **Important Guidelines:** - [CRITICAL] You must FIRST provide a detailed step-by-step explanation for each dimension, and THEN output the final numerical scores. - Be criti...

  62. [65]

    **Structural Fidelity**: Evaluate whether the structure, pose, orientation, and spatial relationships of unedited entities remain consistent with the original video

  63. [66]

    **Text-Video Alignment**: Evaluate how accurately the edited content reflects the requested changes described in the edited prompt

  64. [67]

    Table 5: Detailed MSE on Image Editing categorized by editing operations

    **Background Consistency**: Evaluate the consistency Jiang et al. Table 5: Detailed MSE on Image Editing categorized by editing operations. Model Add Adjust Change Count Extract Remove Reorder Replace Stroke Overall Zero-shot 4B 1.94 1.91 3.43 2.85 5.92 3.38 4.00 2.11 1.47 2.95 Zero-shot 8B 1.34 1.46 2.46 2.34 5.24 3.00 3.95 1.87 2.11 2.51 SFT Image 4B 0....

  65. [68]

    **Naturalness**: Evaluate whether the overall video appears natural, checking for flaws like inconsistent lighting, perspective errors, or structural distortions

  66. [69]

    explanation

    **Temporal-Spatial Consistency**: Focus on the continuity and logical rationality of the video across time and space. Check if the object maintains identity consistency, if motion trajectories are smooth, and if spatial logic (e.g., gravity, depth) is reasonable. Penalize any flickering, teleportation, or physical violations. **Important Guidelines:** - [...