{"total":89,"items":[{"citing_arxiv_id":"2606.31326","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Video Understanding and Generation in a Unified Framework","primary_cat":"cs.CV","submitted_at":"2026-06-30T08:29:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vega unifies video understanding and generation via shared vocabulary and hybrid autoregressive-diffusion architecture, reporting strong results on VBench and VideoMME.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30054","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-29T09:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29308","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MirrorPPR: Exemplar-Based Portrait Photo Retouching","primary_cat":"cs.CV","submitted_at":"2026-06-28T10:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29013","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers","primary_cat":"cs.CV","submitted_at":"2026-06-27T17:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26872","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing","primary_cat":"cs.CV","submitted_at":"2026-06-25T10:58:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26551","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing","primary_cat":"cs.CV","submitted_at":"2026-06-25T02:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03401","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Characterizing Scientific Image Utility and Upgradability","primary_cat":"cs.CV","submitted_at":"2026-06-02T09:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01985","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching","primary_cat":"cs.CV","submitted_at":"2026-06-01T09:46:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MT-EditFlow applies flow-matching RL with multi-reward aggregation to improve multi-turn image editing performance on models like FLUX.1-Kontext-dev by 6.85 points at turn-3.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01022","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ProductWebGen: Benchmarking Multimodal Product Webpage Generation","primary_cat":"cs.CV","submitted_at":"2026-05-31T05:25:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00351","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization","primary_cat":"cs.CV","submitted_at":"2026-05-29T20:45:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31604","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Representation Forcing for Bottleneck-Free Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-29T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30248","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenClaw: Code-Driven Agentic Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:13:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GenClaw introduces a three-stage code-driven workflow for agentic image generation that inserts programmatic sketches between linguistic reasoning and pixel synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29390","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T05:43:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28548","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GEM: Generative Supervision Helps Embodied Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-27T14:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27924","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:55:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIGMA generates accurate IML masks via semantic feature differencing and instruction-guided cross-modal refinement, yielding a 1.1M training set that boosts six detectors by 18.34% F1 on five datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23518","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset","primary_cat":"cs.CV","submitted_at":"2026-05-22T11:33:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22818","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MotiMotion: Motion-Controlled Video Generation with Visual Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21605","ref_index":48,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:12:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21090","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TextSculptor: Training and Benchmarking Scene Text Editing","primary_cat":"cs.CV","submitted_at":"2026-05-20T12:22:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20158","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:46:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20090","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-19T16:47:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18748","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aurora: Unified Video Editing with a Tool-Using Agent","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18714","ref_index":70,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Semantic Generative Tuning for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:46:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Semantic Generative Tuning applies segmentation-based generative proxies during post-training to align and improve both understanding and generation in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18678","ref_index":125,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Janus: Decoupling visual encoding for unified multimodal understanding and generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12966-12977, 2025. [125] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. [126] Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, and Qian He. Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024. [127] Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and reward learning."},{"citing_arxiv_id":"2605.15181","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14876","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13122","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:48:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13062","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-13T06:33:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12500","ref_index":141,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"works [17, 20, 69, 158] validate that direct pixel-space modeling can rival or even surpass latent diffusion, pointing toward a fundamentally new direction via fully end-to-end optimization from raw pixels. 2.2 Native Multimodal Unified Models Early efforts to unify multimodal understanding and generation have largely converged on shared backbones, as exemplified by Show-o [145, 146], Janus [18, 92, 140], OmniGen [141, 144], and BAGEL [28]. While these systems demonstrate that perception and synthesis can coexist within a single model, they remain split across fundamentally different tokenizers, diffusion heads, or decoupled pathways, reflecting a deeper mismatch between understanding and generation. A complementary line of work shifts the focus to the visual interface itself, including shared discrete"},{"citing_arxiv_id":"2605.12305","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:54:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[41] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871, 2025. [42] Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Zhonghua Wu, Qingyi Tao, Wentao Liu, Wei Li, and Chen Change Loy. Harmonizing visual representations for unified multimodal understanding and generation.arXiv preprint arXiv:2503.21979, 2025. 11 [43] Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao"},{"citing_arxiv_id":"2605.12271","ref_index":33,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:35:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[32] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Towards instruction-aligned multimodal generation, 2026. URL https://arxiv.org/abs/2506.18871. [33] Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in- one video creation and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17191-17202, 2025. [34] Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant."},{"citing_arxiv_id":"2605.12088","ref_index":36,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T13:10:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11818","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition","primary_cat":"cs.CV","submitted_at":"2026-05-12T09:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11400","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","primary_cat":"cs.MM","submitted_at":"2026-05-12T01:43:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11061","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Zhang, W.: Scone: Bridging composition and distinction in subject-driven image generation via unified understanding-generation modeling. arXiv preprint arXiv:2512.12675 (2025) [53] Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) [54] Wu, C., Zheng, P ., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) [55] Xia, B., Peng, B., Zhang, Y., Huang, J., Liu, J., Li, J., Tan, H., Wu, S., Wang, C., Wang, Y., et al.: Dreamomni2: Multimodal instruction-based editing and generation."},{"citing_arxiv_id":"2605.10859","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Masked Generative Transformer Is What You Need for Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:05:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"819 0.897 0.261 0.052 0.879 •AnyEdit [21] CVPR'25 1B 0.872 0.285 0.070 0.821 0.898 0.2750.0510.881 •OmniGen [20] arXiv'24 4B 0.836 0.233-0.804 - - - - •PixWizard [9] ICLR'25 2B 0.845 0.2480.0690.798 0.884 0.265 0.063 0.876 •UniReal [4] CVPR'25 5B 0.851 0.285 0.099 0.790 0.903 0.3080.081 0.837 •GoT [7] NeurIPS'25 6B 0.864 0.276- - - - - - •OminiGen2 [18] arXiv'25 7B 0.876 0.309-0.822 - - - - •EditAR [13] ICLR'25 3B - - - - 0.867-0.103 0.804 •NEP [19] arXiv'25 3B 0.871 0.307 0.0780.844 - - - - •V AREdit [11] arXiv'25 8B 0.876 0.280 0.094 0.825 0.901 0.287 0.083 0.844 •EDITMGT Ours 1B 0.878 0.308 0.093 0.832 0.911 0.301 0.058 0.881 image tokensC I naturally encode the spatial location of in- tended edits (Fig."},{"citing_arxiv_id":"2605.10127","ref_index":49,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08354","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria","primary_cat":"cs.AI","submitted_at":"2026-05-08T18:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[40] Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?, 2025. [41] Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. [42] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. [43] Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu, and Wenhu Chen. Editre- ward: A human-aligned reward model for instruction-guided image editing."},{"citing_arxiv_id":"2605.08029","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation","primary_cat":"cs.CV","submitted_at":"2026-05-08T17:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07477","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:23:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IEEE/CVF International Conference on Computer Vision (ICCV), pages 17312-17323, 2025. [50] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE TIP, 13(4):600-612, 2004. [51] Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. [52] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. [53] Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. VisualQuality-R1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505."},{"citing_arxiv_id":"2605.07457","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement","primary_cat":"cs.CV","submitted_at":"2026-05-08T09:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IEEE Transactions on Image Processing (TIP)13(4), 600-612 (2004) [51] Wei, H., Liu, H., Wang, Z., Peng, Y ., Xu, B., Wu, S., et al.: Skywork unipic 3.0: Unified multi-image composition via sequence modeling. arXiv preprint arXiv:2601.15664 (2026) [52] Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) [53] Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y ., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025) [54] Wu, H., Zhang, Z., Zhang, W., Chen, C., Li, C., Liao, L., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023) 13"},{"citing_arxiv_id":"2605.07402","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InsHuman: Towards Natural and Identity-Preserving Human Insertion","primary_cat":"cs.CV","submitted_at":"2026-05-08T07:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"InsHuman proposes Human-Background Adaptive Fusion, Face-to-Face ID-Preserving, and Bidirectional Data Pairing to enable natural human insertion in images without altering identity.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 1:Quantitative comparisons among InsHuman and other image editing models.InsHuman achieved best or second-best performance in the following metrics. Method IDS↑BM % ↓PCE % ↓BD % ↓BL % ↓FR % ↓ FLUX.2 [5]0.610.76 11.45 6.87 5.34 21.37 DreamOmni2[6] 0.26 60.30 29.00 11.45 3.05 79.39 HunyuanImage-3.0-instruct[7] 0.21 5.34 25.95 8.400.7633.59 OmniGen2[8] 0.28 13.00 31.30 15.27 4.58 46.56 Qwen-Image-Edit-2509[9] 0.50 21.37 29.01 7.63 13.00 59.54 InsHuman (Ours)0.550.76 3.82 3.823.0510.69 algorithm and minimizes identity feature distances per matched pair, with no injection required at inference. To our knowledge, we are the first to simultaneously learn the interactive relationship with the background and retain the facial identity in a unified training framework without additional"},{"citing_arxiv_id":"2605.05781","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","primary_cat":"cs.CV","submitted_at":"2026-05-07T07:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"For image editing, we evaluate on GEdit-Bench-EN/CN [33], a comprehensive multilingual benchmark derived from real-world user instructions. BaselinesWe compare against both generation-only and unified models. For image generation, generation-only baselines include SDXL [38], Stable Diffusion 3.5 Medium/Large [11], FLUX.1- dev [3], Infinity [17], OmniGen2 [57] and Wan2.2-t2i-plus, unified models include Janus [56], Janus- Pro [6], Emu3 [54], OneCAT [27], Janus-Flow [34], BLIP3-o [5], UniWorld-V1 [31], Mogao [30] and BAGEL [9]. For editing, generation-only models include Instruct-Pix2Pix [4], MagicBrush [70], AnyEdit [19], OmniGen [63], OmniGen2 [57], Step1X-Edit [33] and FLUX-Kontext [24], and unified models include BAGEL [9], BAGEL-NHR [22] and UniWorld-V1 [31]."},{"citing_arxiv_id":"2605.05646","ref_index":145,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality","primary_cat":"cs.CV","submitted_at":"2026-05-07T03:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04128","ref_index":86,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","primary_cat":"cs.GR","submitted_at":"2026-05-05T15:49:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"our model maintains consistently high accuracy in both English and Chinese, indicating stable long-text rendering capability across different language settings. 17 Table 7Quantitative evaluation results on LongText-Bench [33]. Model LongText-Bench-EN↑LongText-Bench-ZH↑ Janus-Pro [21] 0.019 0.006 BLIP3-o [17] 0.021 0.018 HiDream-I1-Full [10] 0.543 0.024 Kolors 2.0 [72] 0.258 0.329 FLUX.1 [Dev] [6] 0.607 0.005 OmniGen2 [86] 0.561 0.059 BAGEL [28] 0.373 0.310 GPT Image 1 [High] [61] 0.956 0.619 X-Omni [33] 0.900 0.814 Seedream 3.0 [32] 0.896 0.878 Z-Image-Turbo [76] 0.917 0.926 Z-Image [76] 0.935 0.936 Qwen-Image [85] 0.943 0.946 JoyAI-Image 0.963 0.963 CVTG-2k.On the CVTG-2K benchmark shown in Table 8, our model demonstrates strong text rendering capability, particularly in terms of word accuracy and structural consistency."},{"citing_arxiv_id":"2605.08163","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing","primary_cat":"cs.CV","submitted_at":"2026-05-04T16:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26341","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","primary_cat":"cs.CV","submitted_at":"2026-04-29T06:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025). [50] Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871(2025). [51] Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. 2025. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747(2025). [52] Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2025. Omnigen: Unified image generation."},{"citing_arxiv_id":"2604.25477","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-28T10:30:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"optimizing the Thinker with this dual-faceted and group-relative feedback, DDA-Thinker learns to generate executable plans that are not only logically sound but also highly executable, leading to high-quality visual edits. IV. EXPERIMENTS A. Experimental Setup Models and Training Setup.We utilize Qwen3-VL-32B as the primary trained Thinker, with additional results reported for Qwen3-VL-8B [60]. Throughout training, the image generation Editor remains frozen. Unless otherwise specified, we use Qwen-Image-Edit-2511 [1] as the default Editor. The training pipeline follows a two-stage paradigm. First, the Thinker undergoes SFT on our 5k synthesized dataset, performed on 16 NVIDIA H200 GPUs. Second, it is optimized via dual-atomic RFT using GRPO [48] on a curated 1."},{"citing_arxiv_id":"2604.25072","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-04-27T23:57:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"two metrics diverge significantly for a given model, it reveals that the model's apparent consistency stems from consistent hallucination and we empirically validate this diagnostic property through a human study in Sec. 4.4. 4 Experiments 4.1 Experimental Setup We evaluate eight open-sourced uMMs with XTC-Bench, including BAGEL 7B [10], BLIP3-o-8B [4], Janus-Pro-7B [5], MMaDA-8B [41], OmniGen-2 [36], Show-o [37], Show-o2-7B [38], and Tar-7B [15]. We chose those models to cover the taxonomy of uMM architectures proposed in recent work [46]. MMaDA lever- ages the diffusion paradigm for both visual and text modality. Tar, OmniGen2, BLIP3-o-8B, and Janus-Pro-7B adopt an autoregressive next-token prediction (NTP) backbone structure with sophisticated encoding and decoding strategies"},{"citing_arxiv_id":"2604.24625","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Meta-CoT: Enhancing Granularity and Generalization in Image Editing","primary_cat":"cs.CV","submitted_at":"2026-04-27T15:52:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Unified Models GoT [11] 3.61 2.94 1.35 2.78 2.57 2.29 3.51 1.75 2.66 2.61 Ming-UniVision [22] 3.55 3.14 1.52 3.25 3.29 2.77 3.99 2.74 3.91 3.06 BAGEL(w/o think) [8] 3.56 3.31 1.70 3.38 2.62 3.24 4.49 2.38 4.17 3.20 UniWorld-V1 [34] 3.82 3.64 2.27 3.47 3.24 2.99 4.21 2.96 2.74 3.26 BAGEL(think) [8] 3.65 3.53 2.03 3.60 3.03 3.45 4.43 2.59 4.22 3.39 OmniGen2 [62] 3.57 3.06 1.77 3.74 3.20 3.57 4.81 2.52 4.68 3.44 BLIP3o-NEXT [5] 4.003.78 2.39 4.05 2.614.304.64 2.67 4.13 3.62 Meta-CoT+RL(Ours) 3.87 3.91 2.40 4.22 3.74 3.98 4.80 3.26 4.33 3.83 ∆Over Base Model +6.0% +10.8% +18.2% +17.2% +23.4% +15.4% +8.4% +25.9% +2.6% +13.0% editing task is fully drawn from ComplexEdit [73]. We ad- ditionally introduce 5 new task categories (each with 100"}],"limit":50,"offset":0}