hub Mixed citations

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan · 2025 · cs.CV · arXiv 2510.16888

Mixed citation behavior. Most common role is baseline (42%).

21 Pith papers citing it

Baseline 42% of classified citations

open full Pith review browse 21 citing papers arXiv PDF

abstract

Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 5 other 1

citation-polarity summary

baseline 5 background 4 unclear 3

representative citing papers

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Inline Critic Steers Image Editing

cs.CV · 2026-05-12 · conditional · novelty 7.0

Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

cs.CV · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

cs.CV · 2026-02-27 · unverdicted · novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.

Setting the Stage: Text-Driven Scene-Consistent Image Generation

cs.CV · 2025-12-14 · conditional · novelty 7.0

A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

Semantic Generative Tuning for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.

SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.

GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

cs.CV · 2026-05-20 · unverdicted · novelty 5.0 · 2 refs

GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

cs.LG · 2026-05-06 · unverdicted · novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraining, reasoning supervision, and reinforcement learning.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

cs.CV · 2025-11-27 · unverdicted · novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

cs.GR · 2026-05-05 · unverdicted · novelty 4.0 · 2 refs

JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

cs.CV · 2025-12-14

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10

citing papers explorer

Showing 21 of 21 citing papers.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling cs.CV · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Inline Critic Steers Image Editing cs.CV · 2026-05-12 · conditional · none · ref 27 · internal anchor
Inline Critic uses a learnable token to critique and steer a frozen image-editing model's intermediate layers during generation, delivering state-of-the-art results on GEdit-Bench, RISEBench, and KRIS-Bench.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning cs.LG · 2026-04-21 · unverdicted · none · ref 23 · internal anchor
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models cs.CV · 2026-04-19 · unverdicted · none · ref 45 · 2 links · internal anchor
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 29 · internal anchor
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model cs.CV · 2026-02-27 · unverdicted · none · ref 13 · internal anchor
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
Setting the Stage: Text-Driven Scene-Consistent Image Generation cs.CV · 2025-12-14 · conditional · none · ref 21 · internal anchor
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
Semantic Generative Tuning for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 29 · internal anchor
Semantic Generative Tuning uses image segmentation as a generative proxy to align misaligned representation spaces in unified multimodal models and improve both perception and generative layout fidelity.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping cs.CV · 2026-05-11 · unverdicted · none · ref 118 · internal anchor
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing cs.CV · 2026-04-06 · unverdicted · none · ref 30 · internal anchor
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation cs.CV · 2026-05-20 · unverdicted · none · ref 24 · 2 links · internal anchor
GenEvolve introduces a self-evolving agent framework for image generation using tool-orchestrated trajectories and Visual Experience Distillation to achieve claimed SOTA results on benchmarks.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 15 · internal anchor
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 72 · internal anchor
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium cs.LG · 2026-05-06 · unverdicted · none · ref 19 · internal anchor
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing cs.CV · 2026-04-21 · unverdicted · none · ref 31 · internal anchor
SmartPhotoCrafter performs automatic photographic image editing by coupling an Image Critic module that identifies deficiencies with a Photographic Artist module that generates edits, trained via multi-stage pretraining, reasoning supervision, and reinforcement learning.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unverdicted · none · ref 42 · internal anchor
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation cs.GR · 2026-05-05 · unverdicted · none · ref 46 · 2 links · internal anchor
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling cs.CV · 2025-12-14 · unreviewed · ref 15 · internal anchor
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unreviewed · ref 27 · internal anchor

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer