pith. sign in

arxiv: 2605.21487 · v2 · pith:YGRVTMJSnew · submitted 2026-05-20 · 💻 cs.CV

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Pith reviewed 2026-05-25 05:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelsimage editingmodel tuningdata synthesis pipelinemultimodal capabilitiesintelligent editing instructionsVQA transformation
0
0 comments X

The pith

A single intelligent editing task with complex instructions improves understanding, generation, and editing in unified multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches to training unified multimodal models mix multiple tasks, leading to conflicts and performance trade-offs that require elaborate pipelines and balancing. The paper claims that image editing, when made sufficiently demanding through complex instructions, can serve as a single general task that simultaneously strengthens all three capabilities. It introduces an automated pipeline to convert VQA data into editing instructions containing embedded questions and nested logic, producing a 148k dataset. Experiments show that tuning on this dataset alone yields gains across understanding, generation, and editing on models such as BAGEL and Janus-Pro, without auxiliary operations or multi-stage training.

Core claim

Uni-Edit shows that image editing is inherently suited as a general task for unified multimodal models because it requires both visual understanding and generation; existing simplistic editing data underuses understanding capacity, but an automated synthesis pipeline that embeds reasoning into instructions creates data that lets one training stage on one dataset improve all three capabilities at once.

What carries the argument

The automated data synthesis pipeline that converts diverse VQA data into complex editing instructions with embedded questions and nested logic to form the Uni-Edit-148k dataset.

If this is right

  • One training stage on one dataset replaces multi-stage mixed pipelines and balancing tricks.
  • Performance on understanding, generation, and editing all rise together without task conflicts.
  • Editing data can be scaled automatically from existing VQA sources while preserving reasoning demands.
  • The approach works across different unified models without model-specific auxiliary operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Task complexity may matter more than task diversity for avoiding conflicts in multimodal training.
  • The synthesis method could be adapted to create reasoning-intensive data for other paired capabilities such as captioning paired with generation.
  • If the gains hold, unified models might be tuned more efficiently by focusing on one well-designed cross-cutting task.

Load-bearing premise

The synthesized complex editing instructions are what drive gains in understanding capacity, rather than dataset artifacts, scale, or evaluation choices.

What would settle it

Retraining the same base models on the Uni-Edit-148k dataset and measuring no gain or a drop on standard understanding and generation benchmarks would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.21487 by Dian Zheng, Hongbo Liu, Hongsheng Li, Hongyu Li, Kaituo Feng, Kai Zou, Manyuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of Uni-Edit. We introduce intelligent image editing as a general tuning task for UMM. By transforming VQA into reasoning-intensive instructions and generating target images via Nano-Pro, we build Uni-Edit￾148k. Breaking the trade-offs of existing multi-data mixing strategy, it enhances understanding, generation, and editing using only one task, one dataset, and one training stage. Note our automat… view at source ↗
Figure 2
Figure 2. Figure 2: Data Construction pipeline.We first employ GPT-4o to classify the data from LLaVA-OV1.5 into eight distinct edit types, including attribute, caption, math, grounding, and world knowledge. Next, for each category, we use GPT-4o to embed the original question into an editing instruction and explicitly require the model to perform further editing operations based on the answer to the question. This process al… view at source ↗
Figure 3
Figure 3. Figure 3: Data Distribution of Uni-Edit-148k and Uni-Edit-40k. To this end, we fine-tuned BAGEL on understanding tasks using two state-of-the-art open-source datasets: Bee [29] and LLaVA-OV1.5 [5]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Tuning pipeline. In stage 1, we fine-tune BAGEL on our Uni-Edit data using only the generation loss. In stage 2, we align the distribution of the understanding head with the fine-tuned model using 80k understanding samples. MOT Layer means all of the transformer blocks in BAGEL, Both Und., Gen. heads are a single linear layer. ▷ For OCR and caption, we require the model to first generate a caption or perfo… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of image generation results between Uni-Edit and BAGEL. Tuned on Uni-Edit, the model demonstrates significant improvements in prompt understanding, knowledge reasoning, spatial perception, image composition, and aesthetic quality. boosts the understanding and reasoning ability of the model, resulting in substantial gains on the WISE benchmark. Additionally, since GenEval evaluates spatial reason… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of image editing results between Uni-Edit and BAGEL. Tuned on Uni-Edit, the model shows significant improvements in instruction following, logic, and spatial reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of Shape Type in Uni-Edit-148k. Origin Question How many rubber balls are the same color as the small metallic block? Edit Instruction Identify Examine the original image to determine the count of rubber balls that are the same color as the small metallic block, based on the question provided. Then, synthesize a visually distinct group of balloons, ensuring the total count matches the number of the… view at source ↗
Figure 8
Figure 8. Figure 8: Example of Count Type in Uni-Edit-148k. Origin Question What is the main subject of this image? Edit Instruction Analyze the original image to identify the main subject based on the given question. Create a new image displaying a close-up of a text medium, such as a chalkboard or parchment. Write a descriptive caption that specifically highlights the main subject of the original image using a 'Handwritten'… view at source ↗
Figure 9
Figure 9. Figure 9: Example of Caption Type in Uni-Edit-148k. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of Color Type in Uni-Edit-148k. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of OCR Type in Uni-Edit-148k. Origin Question Given the area of the parallelogram ABCD is 102 and the lengths of sides AB and AD are 23 and 14 respectively, calculate the degree of the BAD angle. Round computations to 2 decimal places. Edit Instruction Calculate the degree of the BAD angle in a parallelogram where the area is 102, and the lengths of sides AB and AD are 23 and 14 respectively (Roun… view at source ↗
Figure 13
Figure 13. Figure 13: Example of Math Type in Uni-Edit-148k. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
read the original abstract

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that image editing can function as a general task for unified multimodal model (UMM) tuning. By synthesizing a dataset (Uni-Edit-148k) of complex editing instructions with embedded questions and nested logic from VQA data, tuning solely on this single task, single stage, and single dataset improves understanding, generation, and editing capabilities simultaneously on models such as BAGEL and Janus-Pro, avoiding the performance trade-offs of mixed multi-task training.

Significance. If the central empirical claim holds after proper controls, the result would be significant: it would demonstrate that a single inherently multi-capability task (editing) can produce mutual reinforcement across understanding and generation without auxiliary operations or multi-stage balancing, simplifying UMM training pipelines. The automated synthesis pipeline for reasoning-intensive editing instructions would also be a methodological contribution if shown to be the driver of gains.

major comments (2)
  1. [Abstract; Experiments] Abstract and Experiments section: The claim that 'tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities' and that gains arise specifically from the 'complex and effective editing instructions with embedded questions and nested logic' (rather than the editing task itself or dataset artifacts) is load-bearing but unsupported by controls. No ablation is reported comparing Uni-Edit to (a) simplistic editing instructions on identical image pairs or (b) non-editing multi-task baselines using the same VQA source data; without these, attribution to the 'intelligent' property cannot be verified.
  2. [Method / Data Synthesis] Data synthesis pipeline description: The pipeline is presented as transforming VQA data into complex instructions, but no quantitative validation is supplied (e.g., distribution of nesting depth, percentage of embedded questions, or inter-annotator agreement on instruction quality and edited-image fidelity). This leaves open whether the synthesized data actually demands and improves understanding capacity as asserted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of experimental rigor and methodological validation. We respond to each point below and outline revisions that will strengthen the attribution of results to the proposed intelligent editing task.

read point-by-point responses
  1. Referee: [Abstract; Experiments] Abstract and Experiments section: The claim that 'tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities' and that gains arise specifically from the 'complex and effective editing instructions with embedded questions and nested logic' (rather than the editing task itself or dataset artifacts) is load-bearing but unsupported by controls. No ablation is reported comparing Uni-Edit to (a) simplistic editing instructions on identical image pairs or (b) non-editing multi-task baselines using the same VQA source data; without these, attribution to the 'intelligent' property cannot be verified.

    Authors: We agree that isolating the contribution of instruction complexity is critical for the central claim. Our current experiments compare single-task Uni-Edit tuning against the base models and against mixed multi-task baselines reported in the literature, showing simultaneous gains without trade-offs. However, we did not include the exact controls suggested: (a) a simplistic-instruction variant on the same image pairs and (b) a non-editing multi-task setup derived directly from the VQA source data. These ablations would provide stronger evidence that the embedded questions and nested logic are the key drivers. In the revised manuscript we will add both controls on a held-out subset of Uni-Edit-148k, reporting performance deltas for understanding, generation, and editing metrics. revision: yes

  2. Referee: [Method / Data Synthesis] Data synthesis pipeline description: The pipeline is presented as transforming VQA data into complex instructions, but no quantitative validation is supplied (e.g., distribution of nesting depth, percentage of embedded questions, or inter-annotator agreement on instruction quality and edited-image fidelity). This leaves open whether the synthesized data actually demands and improves understanding capacity as asserted.

    Authors: We acknowledge that the Method section describes the pipeline at a procedural level without accompanying statistics. While the pipeline is fully automated and scalable, the absence of quantitative descriptors (nesting-depth histograms, fraction of embedded questions, or quality metrics) limits the ability to verify that the generated instructions are reasoning-intensive. In the revision we will add a dedicated subsection with these statistics for Uni-Edit-148k, including average nesting depth, percentage of instructions containing embedded questions, and results from a small-scale human evaluation of instruction quality and edited-image fidelity (reported as agreement rates). revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external benchmarks and data synthesis, not self-referential reduction

full rationale

The paper advances an empirical claim: tuning UMMs solely on the synthesized Uni-Edit-148k dataset yields simultaneous gains in understanding, generation, and editing on BAGEL and Janus-Pro. No equations, derivations, fitted parameters renamed as predictions, or uniqueness theorems appear. The synthesis pipeline transforms VQA data into instructions, but this is a constructive data-generation step whose outputs are then evaluated on independent benchmarks; it does not reduce the performance claim to a definitional identity. No self-citations are invoked as load-bearing mathematical facts. The central result is therefore self-contained against external test sets rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the contribution is framed as an empirical task and data-synthesis method.

pith-pipeline@v0.9.0 · 5767 in / 1072 out tokens · 36916 ms · 2026-05-25T05:42:02.951582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 20 internal anchors

  1. [1]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  2. [2]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  3. [3]

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  4. [4]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InCVPR, 2025

  5. [5]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  6. [6]

    Nano-banana-pro

    Google. Nano-banana-pro. Accessed November, 2025 [Online] https://deepmind.google/models/ gemini-image/pro/, 2025

  7. [7]

    OpenAI. Gpt-4o. Accessed November 18, 2024 [Online]https://chatgpt.com/, 2024

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  9. [9]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  10. [10]

    The llama 3 herd of models.arXiv e-prints, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024

  11. [11]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  12. [12]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  13. [13]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  14. [14]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429, 2024

  15. [15]

    Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

    Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

  16. [16]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 10 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

  17. [17]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  18. [18]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

  19. [19]

    AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

    Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Yan Feng, et al. Architecture decoupling is not all you need for unified multimodal model.arXiv preprint arXiv:2511.22663, 2025

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  21. [21]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  22. [22]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  23. [23]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  24. [24]

    Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

    Zhipu AI. Glm-image.https://huggingface.co/zai-org/GLM-Image, 2026

  25. [25]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  26. [26]

    Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

  27. [27]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

  28. [28]

    Vq-va world: Towards high-quality visual question-visual answering.arXiv preprint arXiv:2511.20573, 2025

    Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, et al. Vq-va world: Towards high-quality visual question-visual answering.arXiv preprint arXiv:2511.20573, 2025

  29. [29]

    Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091, 2025

    Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, and Hongsheng Li. Factuality matters: When image generation and editing meet structured visuals.arXiv preprint arXiv:2510.05091, 2025

  30. [30]

    Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025

    Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, and Shi-Min Hu. Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms.arXiv preprint arXiv:2510.13795, 2025

  31. [31]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  32. [32]

    Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InECCV, 2024

  33. [33]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024

  34. [34]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  35. [35]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InCVPR, 2024

  36. [36]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023

  37. [37]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025

  38. [38]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  39. [39]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing.arXiv preprint arXiv:2504.02826, 2025

  40. [40]

    "Your output must be a single JSON object.\n\n

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models. arXiv preprint arXiv:2509.07295, 2025. 11 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning A System Prompt Task Type Classification System Prompt "You are an expert data processor. Your task is to analyze the inpu...

  41. [41]

    **Blurriness & Artifacts**: - Is the image significantly blurry, pixelated, or noisy? - Are there compression artifacts or "fried" textures? - Is the text (if any) legible, or is it garbled/gibberish?

  42. [42]

    Uncanny Valley

    **Structural Coherence (The "Uncanny Valley" Check)**: - Do objects look physically plausible? - Are there distorted limbs, melted faces, or floating objects that defy gravity? - Is the composition chaotic or nonsensical?

  43. [43]

    pasted-on

    **Visual Harmony**: - Do the lighting and shadows match across the image? - Are there harsh, unnatural seams or "pasted-on" effects (bad compositing)? - Are the colors overly saturated, washed out, or broken? ### Scoring Scale (1-5): - **5 (High Quality)**: Sharp, coherent, natural-looking, and aesthetically pleasing. No visible artifacts. - **4 (Good)**:...

  44. [44]

    **Original Image**: The first input image, which is before editing and is a realistic image

  45. [45]

    **Edited Image**: The second input image, which is the one after editing

  46. [46]

    the region in the answer

    **edit_instruction**: The command the model was supposed to follow. Note that this instruction may involve: - **Spatial Grounding**: Referring to specific regions (e.g., "the region in the answer"). - **Visual Transformation**: Changing style, objects, attributes or doing ocr, caption

  47. [47]

    replace bushes with flower beds

    **original_question & process_answer**: These define the **target** or **premise** of the edit. - If the Answer is a coordinate (bounding box), it defines *where* the edit must happen. - If the Answer is a caption/description, it defines the *answer* for the region and it need to be pushed into a blackboard or letter based on the edit_instruction. ### Eva...