EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
super hub Mixed citations
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Mixed citation behavior. Most common role is background (43%).
abstract
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of Omn
co-cited works
representative citing papers
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
citing papers explorer
-
Masked Generative Transformer Is What You Need for Image Editing
EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
-
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
-
InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
-
MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation
MetaPoint represents 2D coordinates as special tokens in visual generative models to enable precise spatial control using existing positional encodings without architectural modifications.
-
Towards Characterizing Scientific Image Utility and Upgradability
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
-
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
-
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
VINS-120K supplies the first large-scale set of instruction-image-edited-image triplets at ultra-high resolution together with an adaptation strategy that improves detail synthesis.
-
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion adds visual reasoning via a training-free VLM to refine primary trajectories and hallucinate secondary motions, plus a confidence-aware guidance scheme, yielding more plausible interactions on the new MotiBench benchmark.
-
MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
MetaEarth-MM unifies multi-modal remote sensing image generation and any-to-any translation across five modalities via scene-centered joint modeling on the new EarthMM dataset.
-
Aurora: Unified Video Editing with a Tool-Using Agent
Aurora introduces a VLM-based agent that converts raw user video edit requests into structured conditioning inputs for a unified diffusion transformer, improving performance on underspecified tasks via a new benchmark.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
-
RevealLayer: Disentangling Hidden and Visible Layers via Occlusion-Aware Image Decomposition
RevealLayer decomposes natural images into multiple RGBA layers using diffusion models with region-aware attention, occlusion-guided adaptation, and a composite loss, outperforming prior methods on a new benchmark dataset.
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
-
Edit Where You Mean: Region-Aware Adapter Injection for Mask-Free Local Image Editing
A co-trained adapter framework enables mask-free local editing in DiTs by factorizing edit semantics from spatial location and jointly learning a mask predictor.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
OneHOI: Unifying Human-Object Interaction Generation and Editing
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
-
ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
ASTRA disentangles subject identity from pose structure in diffusion transformers via retrieval-augmented pose guidance, asymmetric EURoPE embeddings, and a DSM adapter to improve multi-subject generation.
-
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
-
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control
AIM-Bench is the first dedicated benchmark for editing images to evoke specific emotions with fine-grained control, paired with AIM-40k dataset that delivers a 9.15% performance gain by correcting training data imbalances.
-
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
-
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
MIBE: Multi-subject Interaction Benchmark and Evaluator for Personalized Image Generation
MIBE introduces a multi-subject interaction benchmark (MIB) with silver and gold sets and a dual-head evaluator (MIE) trained on VLM labels that outperforms baselines in matching human judgments.
-
GMO-E$^2$DIT: Grounded Multi-Operation Editing for E-Commerce Images
GMO-E²DIT is an agentic editing framework that decouples VLM-based planning from mask-conditioned rendering and uses reflection to execute multi-operation e-commerce image edits with error recovery.
-
MirrorPPR: Exemplar-Based Portrait Photo Retouching
MirrorPPR extracts retouching operations from exemplar pairs via a dedicated extractor and transfers them to query images through a LoRA-adapted Diffusion Transformer, enabled by a new 47-million-pair dataset and self-augmentation for alignment.
-
Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers
Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.
-
SpatialFlow-GRPO: Where Spatial Credit Drives Image Editing
SpatialFlow-GRPO adds region-level reward feedback and spatial alignment to Flow-GRPO-style RL for image editing, reporting gains on GEdit-Bench, ImgEdit-Bench, and a new MultiEditBench.
-
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
ImageWAM shows image editing models can replace video generation in world action models, delivering better performance with 6x lower FLOPs and 4x lower latency by using edit-derived KV caches as compact context.
-
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.
-
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model using Double-Stream MMDiT, an 8.6M-frame embodied corpus, and progressive curriculum training that ranks first on EWMBench and DreamGen Bench.
-
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
-
ProductWebGen: Benchmarking Multimodal Product Webpage Generation
Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.
-
Representation Forcing for Bottleneck-Free Unified Multimodal Models
Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.
-
GenClaw: Code-Driven Agentic Image Generation
GenClaw introduces a three-stage code-driven workflow for agentic image generation that inserts programmatic sketches between linguistic reasoning and pixel synthesis.
-
SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization
SIGMA generates accurate IML masks via semantic feature differencing and instruction-guided cross-modal refinement, yielding a 1.1M training set that boosts six detectors by 18.34% F1 on five datasets.
-
TextSculptor: Training and Benchmarking Scene Text Editing
TextSculptor supplies an automated data synthesis pipeline yielding 3.2M samples plus a four-task benchmark that raises open-source scene text editing performance.
-
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.
-
Lance: Unified Multimodal Modeling by Multi-Task Synergy
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
-
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
-
Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.